3 Important Gotchas when Ingesting Data to Hadoop or Cloud (part I)

Data Ingestion

Written by Ramesh Menon - May 14, 2018 | Category: Data Ingestion

3 Important Gotchas when Ingesting Data to Hadoop or Cloud (part I)

When enterprises are getting started with big data initiatives, the first step is to get data into the big data infrastructure. There are a variety of data ingestion tools and frameworks and most will appear to be suitable in a proof-of-concept.

However, appearances can be extremely deceptive.  Making the transition from proof of concept or development sandbox to a production DataOps environment is where most of these projects fail.  It is one thing to get data into your environment once on a slow pipe just so a data scientist can play with data to try to discover some new insight.  It is an entirely different problem to do this over and over again while meeting service level agreements (SLAs).

This is because big data ingestion is complex and requires practical solutions to numerous technical challenges.

3 Data Ingestion Challenges When Moving Your Pipelines Into Production:

1. Large tables take forever to ingest

Many projects start data ingestion to Hadoop using test data sets, and tools like Sqoop or other vendor products do not surface any performance issues at this phase.  However, large tables with billions of rows and thousands of columns are typical in enterprise production systems. So a job that was once completing in minutes in a test environment, could take many hours or even days to ingest with production volumes.

The impact of this is two-fold:

  1. Meeting SLAs around data availability is a challenge, and
  2. The source system administrators get very unhappy about the added load on their systems.

There are many ways around this situation including:

a) parallelizing ingestion across tables
b) parallelizing ingestion within tables, by reading separate segments simultaneously
c) incrementally loading the changed data
d) utilizing source native paths to accelerate data unloading from the sources (e.g using TPT instead of JDBC to access Teradata systems)

All of these techniques can help but add to the complexity and resource-intensity of data ingestion projects.  Thinking about each of these as simply SMOP ( small matter of programming) is underestimating the challenge ahead.  

2.  Handling incremental data synchronization

One way to minimize the time it takes to load large source data sets is to load the entire data set once, and then subsequently load only the incremental changes to that source data.  This is a process called Change Data Capture (CDC) that is well known in the data integration and ETL space. CDC is complex to set up since the ideal way to capture incremental changes is to monitor source database logs, or to use lightweight queries.

An additional problem is that incremental changes need to then be merged with the base data on the big data platform.  Most big data stores do not support Merge or Update operations, which means inefficient and complex access to the refreshed data downstream by analytics users.  Failing to merge incremental data also introduces a performance penalty when downstream users access the data.

Finally, incremental data changes typically need to be tracked and stored in the big data platform – a process known as Slowly Changing Dimensions (Type 2).  This means that every incremental data change needs to be recorded and accessible by users who want to audit the data, see changes over time, or do point-in-time analytics.

So once again, what starts off as a seemingly simple challenge, gets increasingly more difficult as we peel back the layers of the onion and find yet one more layer ( and then another, and another, etc.).

3.  What about schema changes in the source?

Just as data changes constantly, schemas in the source systems also change. One enterprise that I talked to recently has weekly schema changes (column adds, data type changes, etc.) in their EDWs.  Unfortunately, a lot of these source schema changes happen on weekends, and production support is needed to fix the ingestion pipelines and target data sets on the big data system. Most ingestion pipelines, especially those that are developer-focused with handwritten code will simply break when the source schemas change.  The result is yet one more data engineering challenge that keeps big data projects from making it out of the sandbox and into production.

In subsequent blogs, I will describe how an agile data engineering platform like Infoworks handles all these issues and more.  In the meantime, it is extremely important that you know these issues exist before underestimating the amount of effort involved to achieve production success.  Fortunately, with automation, new data sets can be rapidly onboarded, and then kept synchronized, in production, with no coding needed… stay tuned to learn how this works.

About this Author
Ramesh Menon
Prior to Infoworks, Ramesh led the team at Yarcdata that built the world’s largest shared-memory appliance for real-time data discovery, and one of the industry’s first Spark-optimized platforms. At Informatica, Ramesh was responsible for the go-to-market strategy for Informatica’s MDM and Identity Resolution products. Ramesh has over 20 years of experience building enterprise analytics and data management products.