It used to be you had to extract data (“E”) from some source system, transform (“T”) it somehow or integrate it with other data sources and then load (“L”) the data into some target system for use by some other applications, very often a business intelligence or data visualization layer.
Of course, the order might be different. You might do the “E,” then the “L,” then the “T.”
You also had to worry about data quality and issues around data governance and security. Most importantly, defining who should have access to what data and what the retention policies should be. Not to mention tracking where data came from and who was using it.
Today, there are more modern data architectures based on Hadoop and Spark, but the same primary challenges still exist.
You still need:
And you still have data integration development tools with GUIs that have icons representing the data and actions being taken on that data which are all connected by lines that form a data pipeline.
The new GUIs for more modern and agile data engineering platforms have a more contemporary look and feel. But really, they are fundamentally very similar to the old ETL tools when you take a cursory glance.
For folks who don’t get into the details of new data integration solutions, I have often heard the comment that “Hey this stuff looks like <name your favorite ETL vendor>!” In other words, that’s like saying a Toyota Corolla is the same as a Tesla Model S.
They both have a steering wheel, air conditioning, windows, seats, etc. However, when you open the hood, you find that a Corolla and a Tesla are wildly different.
At the 30,000 foot level, the same thing people say about legacy data integration in comparison to a modern dataops platform. However, when you look under the hood, some significant changes have occurred.
Previously, every data integration vendor had its own proprietary data integration engine. Moreover, the power and scalability of that engine was a key part of the differentiation of those vendors.
When I was at Informatica, we would talk about how it was much easier to set up a grid computing structure to run our ETL engine. We would also talk about how we parallelized data ingestion. Moreover if you have to scale, you would need to add more nodes to your dedicated ETL cluster.
The key word here is “dedicated.” Consequently, the systems you chose to run your ETL processing on were dedicated to data integration and nothing else.
Modern agile data engineering and dataops turn the above concepts on their head.
The “engine” is now baked into the distributed storage/compute cluster that you also use for running queries against your data. Data integration solutions now use the native execution engine available in the cluster to provide compute and processing scalability.
Companies expect that modern data integration solutions should use the same compute engine—usually some flavor of Hadoop—already in place. Also, data integration jobs should be portable from one compute environment to another.
The idea is that a user will create a data integration job using some GUI interface. The data integration job could then be run on any number of big data platforms whether they are on-premise or in the cloud.
Conceptually, the picture below depicts the above scenario for all of the environments featuring some flavor of Hadoop. As it turns out, they all have significant differences from one another.
One issue arises should you decided to hand code your data pipelines to run well on one big data platform. You have to pretty much rewrite these data pipelines if you want them to perform well on a different big data platform.
That’s part of the value of agile data engineering platforms. They are independent of the underlying execution engine.
But in the end, the value of taking a logical representation of a data pipeline and being able to run it on any number of big data platforms has some significant and very positive implications:
New Agile Data Engineering Platform
Old Data Integration
This new approach provides platform independence. You can run on-premise one day and decide to move into the cloud without having to recode your data pipelines.
|The old approach tied you to a specific data integration vendor and their proprietary engine.|
Investment in the distributed computing engines is shared in the open source community and used for lots of use cases. The result is that these engines have more global investment and scale much better than the old proprietary engines.
|Investment in a single vendor for its engine. These engines tend to evolve very slowly, limited by the brain power within a given software company.|
|The platform for data integration is also meant for other purposes like doing the actual analytics and business intelligence. This has the side effect of requiring a lot less movement of the data itself.||The compute platform for data integration was only for data integration. This meant separate investment in a compute environment and also forces users always to have to move data.|
|Batch and real-time data integration can use the same development platform and compute infrastructure.||Batch and real-time data integration tend to use completely different development and computing infrastructure.|
Note that this blog post is only covering one aspect of the improvement provided by modern data integration. In future blog posts, we will cover additional value-added capabilities.
For example, we will cover the higher level of automation delivered in the newer data integration solutions and the ability of the newer solutions to extend beyond ingestion. Additionally, data transformation to go all the way to generating cubes and in memory models so data is truly ready for consumption by data visualization tools like Tableau.
But for those folks who think that modern agile data engineering and data integration platforms are the same as old fashioned ETL, I would say, it is time to take a much closer look under the hood.
You don’t know what you are missing.