I recently wrote a blog post about how “There is greater demand for professionals in Data Engineering than Data Science” and got some great comments and questions on my LinkedIn version of this post. One of the questions came in from Amir Bahmanyari who wrote:
So we know by now what a DS’s (Data Scientists) skills are. Besides being ETL experts, which have been around for decades now, what are the MODERN skills necessary in Data Engineering?
Great question Amir! First, let me start by defining “Data Engineering” and then expand from there. My definition is:
Data Engineering comprises all engineering and operational tasks required to make data available for analytics. This includes but is not limited to: Data Ingestion, Data Synchronization, Data Transformation, Data Models, Data Governance, Performance Optimization, Production orchestration
While I don’t think that data engineering professionals deal solely with big data, you may notice that most data engineering positions are associated with more modern technologies. In fact, they are less associated with traditional big data ETL tools.
Here’s a more involved breakdown for those interested in Data Engineering.
Data ingestion involves getting data out of source systems and ingesting it into a data lake. A data engineer would need to know how to efficiently extract the data from a source, including multiple approaches for both batch and real-time extraction.
Additionally, they need to know about both standard connections like JDBC and high-speed proprietary connections like TPT.
Modern data engineering professionals should know how to deal with issues around incremental data loading, fitting within small source windows and the parallelization of loading data as well.
Some consider this to be a subtask of data ingestion, but because data synchronization is such a big issue in the big data world since Hadoop and other big data platforms don’t support incremental loading of data, I have listed it separately.
The Data Engineering professional needs to know how to detect changes in source data (CDC), merge and sync changed data from sources into a big data environment. For a more detailed discussion on this topic, read the incremental data synchronization section on a previous blog.
This is the “T” in ETL and is focused on integration and transforming data for a specific use case.
The major skill set here is knowledge of SQL. As it turns out, not much has changed in terms of the type of data transformations that people are doing now vs. in a purely relational environment. Even the big data environments are often SQL-like, support graphical tools or spreadsheet-like tools to do the transformation.
Data engineering teams are not responsible for the governance of the data per se. But, they have to ensure the systems necessary for data access control and data lineage, for example, are put in place and support the capabilities necessary for good data governance.
When data engineering teams implement a set of dataops tools for data ingestion, sync, transformation, and models, they need to be aware of data governance concepts and be sure that the tooling and platform also support the need for good governance.
Performance Optimization (and Data Models)
Performance optimization and data models are tougher areas.
Anyone can build a slow performing system; the challenge is to build data pipelines that are both scalable and efficient. The ability and understanding of how to optimize the performance of an individual data pipeline and the overall system are a higher-level data engineering skill.
For example, big data platforms continue to be challenging with regard to query performance and have added complexity to the modern data engineer’s job.
In order to optimize the performance of queries and the creation of reports and interactive dashboards, the data engineering group needs to know how to de-normalize, partition, index data models, or understand tools and concepts regarding in-memory models and OLAP cubes.
It is one task to build a data pipeline that can run in an experimental sandbox.
It is a second to get that to perform, as mentioned in the previous section.
It is another skill entirely to build a system that allows you to rapidly promote data pipelines from prototype to production, monitor the health and performance of those pipelines and ensure fault tolerance of the entire operational environment. Like performance optimization, I would argue that this is also a higher-level skill that you tend to see in much more senior engineers.
Automation to Improve Data Engineer Productivity
Many of the areas and task work covered in this post can be automated.
But even for those that are automated, data engineering professionals need to have a good fundamental understanding of how the basic systems work. If they don’t know what’s happening under the cover, they can end up misusing the automation systems in the end.
That said, those in the data engineering profession have a tough job. To do it well, you need to develop a lot of skills across a wide variety of disciplines.