When enterprises are getting started with big data initiatives, the first step is to get data into the big data infrastructure. There are a variety of data ingestion tools and frameworks and most will appear to be suitable in a proof-of-concept.
However, appearances can be extremely deceptive. Making the transition from proof of concept or development sandbox to a production DataOps environment is where most of these projects fail. It is one thing to get data into your environment once on a slow pipe just so a data scientist can play with data to try to discover some new insight. It is an entirely different problem to do this over and over again while meeting service level agreements (SLAs).
This is because big data ingestion is complex and requires practical solutions to numerous technical challenges.
Many projects start data ingestion to Hadoop using test data sets, and tools like Sqoop or other vendor products do not surface any performance issues at this phase. However, large tables with billions of rows and thousands of columns are typical in enterprise production systems. So a job that was once completing in minutes in a test environment, could take many hours or even days to ingest with production volumes.
The impact of this is two-fold:
There are many ways around this situation including:
a) parallelizing ingestion across tables
b) parallelizing ingestion within tables, by reading separate segments simultaneously
c) incrementally loading the changed data
d) utilizing source native paths to accelerate data unloading from the sources (e.g using TPT instead of JDBC to access Teradata systems)
All of these techniques can help but add to the complexity and resource-intensity of data ingestion projects. Thinking about each of these as simply SMOP ( small matter of programming) is underestimating the challenge ahead.
One way to minimize the time it takes to load large source data sets is to load the entire data set once, and then subsequently load only the incremental changes to that source data. This is a process called Change Data Capture (CDC) that is well known in the data integration and ETL space. CDC is complex to set up since the ideal way to capture incremental changes is to monitor source database logs, or to use lightweight queries.
An additional problem is that incremental changes need to then be merged with the base data on the big data platform. Most big data stores do not support Merge or Update operations, which means inefficient and complex access to the refreshed data downstream by analytics users. Failing to merge incremental data also introduces a performance penalty when downstream users access the data.
Finally, incremental data changes typically need to be tracked and stored in the big data platform – a process known as Slowly Changing Dimensions (Type 2). This means that every incremental data change needs to be recorded and accessible by users who want to audit the data, see changes over time, or do point-in-time analytics.
So once again, what starts off as a seemingly simple challenge, gets increasingly more difficult as we peel back the layers of the onion and find yet one more layer ( and then another, and another, etc.).
Just as data changes constantly, schemas in the source systems also change. One enterprise that I talked to recently has weekly schema changes (column adds, data type changes, etc.) in their EDWs. Unfortunately, a lot of these source schema changes happen on weekends, and production support is needed to fix the ingestion pipelines and target data sets on the big data system. Most ingestion pipelines, especially those that are developer-focused with handwritten code will simply break when the source schemas change. The result is yet one more data engineering challenge that keeps big data projects from making it out of the sandbox and into production.
In subsequent blogs, I will describe how an agile data engineering platform like Infoworks handles all these issues and more. In the meantime, it is extremely important that you know these issues exist before underestimating the amount of effort involved to achieve production success. Fortunately, with automation, new data sets can be rapidly onboarded, and then kept synchronized, in production, with no coding needed… stay tuned to learn how this works.