Automation

Agile Data Engineering & DataOps vs Old Fashioned ETL

Posted by Todd Goldman

It used to be you had to extract data ( “E”)  from some source system, transform ( “T”) it somehow or integrate it with other data sources and then load (“L” ) the data into some target system for use by some other applications, very often a business intelligence or data visualization layer.  Of course, the order might be different, you might do the “E”, then the “L” then the “T”. You also had to worry about data quality and about issues around data governance and security like who should have access to what data, what the retention policies should be as well as tracking where did data come from and where was it being used.

Today, there are more modern data architectures based on Hadoop and Spark but the same basic challenges still exist.  You still need to extract data from the legacy systems and load it into your data lake whether it is on premise or in the cloud.  You still need to transform and integrate data so it is in the right format and structure for use. In addition, you still have data integration development tools with GUIs that have icons representing the data and actions being taken on that data which are all connected by lines that form a data pipeline.  

And while the new GUIs for more modern and agile data engineering platforms have a more modern look and feel, they are fundamentally very similar to the old ETL tools when you just take a cursory look.  In fact, for folks who don’t get into the details of new data integration solutions, I have often heard the comment that “ Hey this stuff looks like <name your favorite ETL vendor>!” That is like saying that a Toyota Corolla is the same as a Tesla.  They both have a steering wheel, they both have 4 wheels, they both have windows, they both have seats, etc. However, when you open the hood, you find that a Corolla and a Tesla are completely different.

As it turns out, at the 30,000 foot level, the same thing can be said about legacy data integration when compared to a more modern agile data engineering and dataops platform.  However, when you look under the hood, there are some significant changes that have occurred. It used to be that every data integration vendor had their own, proprietary data integration engine.  In fact, the power and scalability of that engine was a key part of the differentiation of those vendors.

When I was at Informatica, we would talk about how it was much easier to set up a grid computing structure to run our ETL engine.  We would also talk about how we parallelized data ingestion and how if you needed to scale, you just added more nodes to your dedicated ETL cluster.  The keyword here is “dedicated”. The systems you chose to run your ETL processing on were dedicated to data integration and nothing else.

Modern agile data engineering and dataops turns these concepts on their head. The “engine” is now baked into the distributed storage/compute cluster that you also use for running queries against your data.  Data integration solutions now use the native execution engine available in the cluster whether it is Hive, Impala, Spark etc,. to provide compute and processing scalability.  The expectation from companies is that a modern data integration solution should use the same compute engine ( usually some flavor of Hadoop)  that they have already selected, and that data integration jobs should be portable from one compute environment to another.

The idea is that a user will create a data integration job using some GUI interface and that the data integration job could then be run on any number of big data platforms whether they are on premise or in the cloud.  Conceptually, this is what is shown in the picture below. And while all of the environments in the picture below are all some flavor of Hadoop, they all have enough significant differences that it turns out that if you hand code your data pipelines to run well on one big data platform, you have to pretty much rewrite them if you want them to perform well if you move them to another big data platform .  That’s part of the value of the agile data engineering platforms. They are independent of the underlying execution engine.

But in the end, the value of taking a logical representation of a data pipeline and being able to run it on any number of big data platforms has some significant and very positive  implications:

 

New Agile Data Engineering Platform

______

Old Data Integration

 

This new approach provides platform independence.  You can run on premise one day and decide to move into the cloud without having to recode your data pipelines

The old approach tied you to a specific data integration vendor and their proprietary engine.
 

Investment in the distributed computing engines is shared in the open source community and used for lots of use cases.  The result is that these engines have more global investment and scale much better than the old proprietary engines.

Investment was made by a single vendor in their engine.  These engines tended to evolve very slowly, limited by the brain power within a given software company.
 

The platform used for data integration is also used for other purposes like doing the actual analytics and business intelligence.  This has the side effect of requiring a lot less movement of the data itself

The compute platform used for data integration was used only for data integration.  This meant separate investment in a compute environment and also forced users to always have to move data
 

Batch and real time data integration can be done using the same development platform and same compute infrastructure

Batch and real time data integration tend use completely different development and computing infrastructure.  

 

Note that this blog post is only covering one aspect of the improvement provided by modern data integration.  In future blog posts we will cover additional value added capabilities like the higher level of automation delivered in the newer data integration solutions and the ability of the newer solutions to extend beyond  ingestion and data transformation to go all the way to generating cubes and in memory models so data is truly ready for consumption by data visualization tools like Tableau.  

But for those folks who think that modern agile data engineering and data integration platforms are the same as old fashioned ETL, I would say, it is time to take a much closer look under the hood.  You don’t know what you are missing.

About this Author
Todd Goldman