I recently wrote a blog post about how “Data Engineers are in Greater Demand than Data Scientists” and got some great comments and questions on my LinkedIn version of this post. One of the questions came in from Amir Bahmanyari who wrote:
So we know by now what a DS’s (Data Scientists) skills are. Besides being ETL experts, which have been around for decades now, what are the MODERN skills a DE (Data Engineer) is expected to have?
Great question Amir! First, let me start by defining “Data Engineering” and then expand from there. My definition is:
Data Engineering comprises all engineering and operational tasks required to make data available for analytics. This includes but is not limited to: Data Ingestion, Data Synchronization, Data Transformation, Data Models, Data Governance, Performance Optimization, Production orchestration
While I don’t think that data engineering is explicitly for big data, you may notice that most data engineering job descriptions are associated with positions that deal with the more modern technologies. They are less associated with traditional big data ETL tools.
Here’s a more involved breakdown of the specific abilities modern data engineers need to possess.
Data ingestion involves getting data out of source systems and ingesting it into a data lake. A data engineer would need to know how to efficiently extract the data from a source, including multiple approaches for both batch and real-time extraction.
Additionally, they need to know about both standard connections like JDBC and high-speed proprietary connections like TPT.
Modern data engineers should know how to deal with issues around incremental data loading, fitting within small source windows and the parallelization of loading data as well.
Some consider this to be a subtask of data ingestion, but because data synchronization is such a big issue in the big data world since Hadoop and other big data platforms don’t support incremental loading of data, I have listed it separately.
Here the data engineer needs to know how to deal with how to detect changes in source data (CDC), merge and sync changed data from sources into a big data environment. For a more detailed discussion on this topic, read the incremental data synchronization section on a previous blog.
This is the “T” in ETL and is focused on integration and transforming data for a specific use case.
The major skill set here is knowledge of SQL. As it turns out, not much has changed in terms of the type of data transformations that people are doing now vs. in a purely relational environment. Even the big data environments are often SQL-like, support graphical tools or spreadsheet-like tools to do the transformation.
Data engineers are not responsible for the governance of the data per se. But, they have to ensure the systems necessary for data access control and data lineage, for example, are put in place and support the capabilities necessary for good data governance.
When a data engineer implements a set of tools for data ingestions, sync, transformation, and models, they need to be aware of data governance concepts to the tooling and platform also support the need for good governance.
Performance optimization and data models are tougher areas for data engineers.
Anyone can build a slow performing system; the challenge is to build data pipelines that are both scalable and efficient. The ability and understanding of how to optimize the performance of an individual data pipeline and the overall system are a higher-level data engineering skill.
For example, big data platforms continue to be challenging with regard to query performance and have added complexity to the modern data engineer’s job.
In order to optimize the performance of queries and the creation of reports and interactive dashboards, the data engineer needs to know how to de-normalize, partition, index data models, or understand tools and concepts regarding in-memory models and OLAP cubes.
It is one task to build a data pipeline that can run in an experimental sandbox.
It is a second to get that to perform, as mentioned in the previous section.
It is another skill entirely to build a system that allows you to rapidly promote data pipelines from prototype to production, monitor the health and performance of those pipelines and ensure fault tolerance of the entire operational environment. Like performance optimization, I would argue that this is also a higher-level skill that you tend to see in much more senior data engineers.
Many of the areas and task work covered in this post can be automated.
But even for those that are automated, the data engineer needs to have a good fundamental understanding of how the basic systems work. If they don’t know what’s happening under the cover, they can end up misusing the automation systems in the end.
That said, to be a data engineer is clearly a tough job. To do it well, you need to develop a lot of skills across a wide variety of disciplines.