ETL Pipeline or Data Pipelines - What is the Future?

Written by Todd Goldman | Category: Data Engineering

What is ETL?

It used to be you had to extract data (“E”) from some source system, transform (“T”) it somehow or integrate it with other data sources and then load (“L”) the data into some target system for use by some other applications, very often a business intelligence or data visualization layer.

Of course, the order might be different. You might do the “E,” then the “L,” then the “T.”

You also had to worry about data quality and issues around data governance and security. Most importantly, defining who should have access to what data and what the retention policies should be. Not to mention tracking where data came from and who was using it.

Difference Between ETL Pipeline and Data Pipeline

Today, there are more modern data architectures based on Spark and Hadoop, but the same basic challenges still exist.

You still need to:

  • extract data from the legacy systems and load it into your data lake whether it is on-premise or in the cloud.
  • Transform and integrate data, so it is in the right format and structure for use.

And you still have data integration development tools with GUIs that have icons representing the data and actions being taken on that data which are all connected by lines that form a data pipeline.

The new GUIs for more modern enterprise data operations and orchestration systems  (EDO2)  have a more contemporary look and feel. And for people who don’t get into the details of new data integration solutions, I have often heard the comment that “Hey this stuff looks like <name your favorite ETL vendor>!” In other words, that’s like saying a Toyota Corolla is the same as a Tesla Model S.

They both have a steering wheel, air conditioning, windows, seats, etc. However, when you open the hood, you find that a Corolla and a Tesla are wildly different.

At the 30,000 foot level, people say the same thing about legacy data integration in comparison to modern Enterprise Data Operations and Orchestration ( EDO2) systems. However, when you look under the hood, some significant changes have occurred.

Previously, every data integration vendor had its own proprietary data integration engine. Moreover, the power and scalability of that engine was a key part of the differentiation of those vendors.  When I was at Informatica, we would talk about how it was much easier to set up a grid computing structure to run our ETL engine. We would also talk about how we parallelized data ingestion. Moreover, if you have to scale, you would need to add more nodes to your dedicated ETL cluster.

The key word here is “dedicated.” Consequently, the systems you chose to run your ETL processing on were dedicated to data integration and nothing else.

Advantages of Modern Data Integration Tools

Modern EDO2 systems turn the above concepts on their head. The “engine” is now baked into the distributed storage/compute cluster that you also use for running queries against your data. Data integration solutions now use the native execution engine available in the cluster to provide compute and processing scalability.

Companies expect that modern data integration solutions should use the same compute engine—usually, Spark or some flavor of Hadoop—already in place. Also, data integration jobs should be portable from one compute environment to another.

The idea is that a user will create a data integration job using some GUI interface. The data integration job could then be run on any number of big data platforms whether they are on-premise or in the cloud.  Conceptually, the picture below depicts the above scenario for all of the environments featuring some flavor of Spark or Hadoop.

ETL Pipeline vs. Data Pipeline: The Future of Modern Data Integration

As it turns out, all of the execution engines and storage systems have significant differences from one another so if you decide to hand-code your data pipelines, you need to be prepared to completely re-write them if you want them to perform well on a different execution engine.

That’s part of the value of enterprise data operations and orchestration (EDO2) systems. They are independent of the underlying execution engine but optimize their output to run at high performance on whatever execution engines they support. The value of taking a logical representation of a data pipeline and being able to run it on any number of big data platforms has some significant and very positive implications:

Enterprise Data Operations and Orchestration

______

ETL Pipeline Tools

 

This new approach provides platform independence. You can run on-premise one day and decide to move into the cloud without having to recode your data pipelines. Or, given most organizations have multiple clouds, move from cloud to cloud as necessary without recoding.

 

The old approach tied you to a specific data integration vendor and their proprietary engine.  If you chose to hand-code, you would have to re-code every time you moved your pipeline code from one cloud vendor to another.  There are too many little differences that tend to force a complete re-write.

 

Investment in the distributed computing engines is shared in the open-source community and used for lots of use cases. The result is that these engines have more global investment and scale much better than the old proprietary engines.

Investment in a single vendor for its engine. These engines tend to evolve very slowly, limited by the brainpower within a given software company.
The platform for data integration is also meant for other purposes like doing the actual analytics and business intelligence. This has the side effect of requiring a lot less movement of the data itself. The compute platform for data integration was only for data integration. This meant separate investment in a compute environment and also forces users always to have to move data.
Batch and real-time data integration can use the same development platform and compute infrastructure.

Batch and real-time data integration tend to use completely different development and computing infrastructure.

 

But for those folks who think that modern data operations and orchestration systems are the same as old fashioned ETL, I would say, it is time to take a much closer look under the hood.

You don’t know what you are missing.

About this Author
Todd Goldman
Todd is the VP of Marketing and a silicon valley veteran with over 20+ years of experience in marketing and general management. Prior to Infoworks, Todd was the CMO of Waterline Data and COO at Bina Technologies (acquired by Roche Sequencing). Before Bina, Todd was Vice President and General Manager for Enterprise Data Integration at Informatica where he was responsible for their $200MM PowerCenter software product line. Todd has also held marketing and leadership roles at both start-ups and large organizations including Nlyte, Exeros (acquired by IBM), ScaleMP, Netscape/AOL and HP.

Eckerson Report: Best Practices in DataOps

This Eckerson Group report recommends 10 vital steps to attain success in DataOps. 

READ MORE
Want to learn more?
Watch 12 minute product demo