Data Lake Creation

Automated Data Lake Ingestion - It is Like Magic

Posted by Ramesh Menon

In a previous blog post, I wrote about the 3 top “gotchas” when ingesting data into big data or cloud.  In this blog, I’ll describe how Infoworks automation can speed up the process of ingesting data, keeping it synchronized, in production, with zero coding.  Overcoming these “gotchas” manually is a complex, time-consuming and error-prone process, so automation is essential if you want agility and speed in implementing your big data projects.


Challenge 1.     Large tables take forever to ingest:

Before getting started with ingesting data, it is important to understand what kind of data you are dealing with first.  This is why Infoworks has an automated data source crawling capability which first crawls the source metadata. In this step, we discover the source schema including table sizes, source data patterns and data types.  Understanding what’s in the source with regard to data volumes is of course important, but discovering data patterns and distributions will help with ingestion optimization later.

Now onto the actual ingestion process:  

Challenge 1:  Always parallelize!  

Ingesting data in parallel is essential if you want to meet Service Level Agreements (SLAs) with very large source datasets.  This means don’t bottleneck the ingestion process by funneling data through a single server or edge node. Big data’s scalability and economics are based on distributed processing. To ensure the ingestion process scales with your cluster, data needs to be ingested directly from all your nodes in parallel.

Parallelized ingestion across tables provides the next level of optimization, but automation is key to manage the right number of source connections.  Too many connections to the source system and you run the risk of raising the anger of the source system DBAs. Too few connections will result in longer ingestion times which risks missing ingestion SLAs. Easily managing connection pools, with automatic prioritization between several tables, is the key to finding the right balance. Infoworks automates the process of working within source connection limits, and pooling connections among the various tables that need to be ingested, based on data size or other user-defined priorities.

What do you do if the table you are ingesting is so large that ingesting it all at once will simply take too long within the source system windows? One answer is to parallelize ingestion within the table.  Automatically loading such large tables or files in segments, with segments running within allowed source windows is an easy way to get your bulk loads completed despite source system constraints. Automation is key here first to identify those tables that will require this approach, but is also needed to provide  fault-tolerance, error handling, dependency management and rescheduling.

One last way to speed up large source loads is to use fast database paths for ingestion.  For example, using JDBC to ingest data from Teradata data warehouses is typically 5-10x slower than using Teradata Parallel Transporter (TPT).  Automatic use of fast unload paths for various source systems is essential for making the leap from development to production deployments where high speed data loads are an absolute must.


Challenge2:  Handling incremental data synchronization:

Loading large datasets is hard enough for the initial bulk loads, but when your projects demands continuous availability to fresh source data, incremental synchronization is a must because it is usually unrealistic to do a full load, over and over again without disrupting the source systems ( and once again raising the anger of the DBAs) One of our customers had a multi-billion row source table that took 6 hours to fully load into their big data system.  Business users then requested access to the latest data, no more than 60 mins after it was added or updated in the source system. The full load ingestion that took 6 hours was already fully optimized, so another solution other than reloading the base data was needed. Infoworks’ automated the incremental data synchronization to bring in around one million changed records from the source table in less than 10 min, every hour, thereby meeting the business SLA.

Incremental ingestion can be done in a variety of ways, but the first step is to capture the changes in the source system since the last ingestion (also known as CDC or change data capture).  This can be done by accessing the source system logs (log-based CDC) or through querying the source table using Batch-IDs, Timestamp or other queries. Log-based CDC is less intrusive for the source database, and has numerous other advantages.  However in some databases or source types, access to the source logs is not possible, so query-based techniques are needed. Once again, the details matter: most query-based approaches typically doesn’t detect deletes, and even when deleted data is provided, production deployments cannot easily handle it without much rework.  Infoworks automates all kinds of CDC whether log-based or query-based.

Once the incremental data arrives in the big data or cloud system, there’s an additional challenge in reconciling or merging the incremental data with the base data that has already been ingested earlier. While that merge process is trivial in traditional RDBMS or DW systems, most big data stores are immutable, or have immature and problematic ACID implementations.  To deal with this issue, Infoworks built a continuous merge capability that supports fast ingestion and continuous fresh data availability, while keeping the data optimized for downstream query performance.

In addition, automatically maintaining history of all changes that have propagated through the incremental changes is necessary for point-in-time analytics, audits and slowly-changing-dimension type 2 (SCD2) support. Without automation, implementing time axes and SCD2 manually or with low-level tools makes it even more time-consuming to move into production.  

The bottom line however is that there are many steps and challenges involved in dealing with CDC.  If you think you are going to just code it yourself with a vanilla hadoop implementation, be sure to give yourself plenty of time because it is more complicated than you think to do it properly.


Challenge 3: What about schema changes in the source?:

Data is not the only thing that changes in production, schemas also change.  At one of our customers, schemas in a number of the data warehouses change on a weekly basis.  New columns are added, data is filled sometimes with nulls, and sometimes with history, and data types also change.  Back in the classic data warehousing days, this would have been an enormous problem. Propagating the additional columns into the data warehouse would have required discussion on where the new columns should go, plus a DBA would have to add the column to the data warehouse schema.  All of this could take weeks.

The good news is that in the “big data” world, deciding where to put those new columns is trivial.  The bad news, the works still has to be done. WIthout automation to identify that a source change has happened and handle these real-life production scenarios, ingestion pipelines break and their developers are called for production support,usually on a weekend when these schema changes happen.  By automating the this process , schema evolution is automatic: schema changes are identified when they happen, are propagated to the the data lake where new columns are automatically added, relevant data is backfilled, and notifications to downstream consumers completed, all without a single production ticket being raised.


At the 30,000 foot level, none of this seems very difficult. It is just that the devil is in the details.  Perhaps as you read this article, you might have thought, “ this isn’t that difficult.” And while any single aspect of the challenges presented above may not be difficult, as you layer on additional challenges, the issues slowly but surely become overwhelming.  In fact, one of our customers spend 5 engineers over two years to build out a complete Hadoop data ingestion framework for their on premise  cluster. Then they had to keep those engineers working on keeping that ingestion framework running. When they originally built the framework, there was not commercial alternative.  However, 2 years after completion they realized that they could replace all of that effort with commercial data ingestion tools plus one part time employee. So don’t be deceived, ingesting data into a data lake while maintaining production SLAs is hard. The good news is, automation can help.

About this Author
Ramesh Menon