In a previous blog post, I wrote about the 3 top “gotchas” when ingesting data into big data or cloud. In this blog, I’ll describe how automated data ingestion software can speed up the process of ingesting data, keeping it synchronized, in production, with zero coding.
Overcoming these “gotchas” manually is a complex, time-consuming and error-prone process. Automation is essential if you want agility and speed in implementing your big data projects.
Before getting started with ingesting data, it is important to understand what kind of data you are dealing with first. This is why Infoworks has an automated data source crawling capability which first crawls the source metadata.
In this step, we discover the source schema including table sizes, source data patterns, and data types.
Understanding what’s in the source concerning data volumes is important, but discovering data patterns and distributions will help with ingestion optimization later.
Ingesting data in parallel is essential if you want to meet Service Level Agreements (SLAs) with very large source datasets. This means that you don’t bottleneck the ingestion process by funneling data through a single server or edge node.
Distributed processing is the basis for big data’s scalability and economics. To ensure the ingestion process scales with your cluster, you need to ingest your data directly from all of your nodes in parallel.
Parallelized ingestion across tables provides the next level of optimization, but automation is key to manage the right number of source connections.
Easily managing connection pools, with automatic prioritization between several tables, is the key to finding the right balance. Infoworks automates the process of working within source connection limits and pooling connections among the tables you wish to ingest. This process depends on data size or other user-defined priorities.
What if the table is so large that ingesting it all at once will take too long within the source system windows?
One answer is to parallelize ingestion within the table. Automatically loading such large tables or files in segments, with segments running within allowed source windows is an easy way to complete your bulk loads despite source system constraints.
Automation is key here first to identify those tables that will require this approach. But it is also necessary to provide fault-tolerance, error handling, dependency management, and rescheduling.
One last way to speed up large source loads is to use fast database paths for ingestion.
For example, using JDBC to ingest data from Teradata data warehouses is typically 5-10x slower than using Teradata Parallel Transporter (TPT). Automating fast unload paths for various source systems is essential for leaping from development to production deployments. Particularly where high-speed data loads are an absolute must.
Loading large datasets is hard enough for the initial bulk loads. But when your projects demand continuous availability to fresh source data, incremental synchronization is a must.
This is because it is usually unrealistic to do a full load repeatedly without disrupting the source systems. (In turn, raising the anger of the DBAs once again.)
One of our customers had a multi-billion row source table that took six hours to load into their big data system fully. Business users then requested access to the latest data, no more than 60 mins after it was added or updated in the source system.
The full load ingestion that took 6 hours was already fully optimized. So, a different solution other than reloading the base data became obligatory.
Infoworks automated the incremental data synchronization, bringing in around one million changed records from the source table at a time. This all occurred in less than 10 minutes every hour, thus meeting the business SLA.
You can complete incremental ingestion in a variety of ways. The first step is always to capture the changes in the source system since the last ingestion. This process is also known as change data capture (CDC).
One way to complete CDC can be by accessing the source system logs (log-based CDC). Or you can query the source table using Batch-IDs, Timestamp or other queries. Log-based CDC is less intrusive for the source database and has numerous other advantages.
However, in some databases or source types, access to the source logs is not possible, so query-based techniques are needed.
Once again, the details matter. Most query-based approaches typically don’t detect deletes. And even when deleted data is available, production deployments cannot easily handle it without much rework.
Infoworks automates all kinds of CDC whether log-based or query-based.
Once the incremental data arrives in the big data or cloud system, there’s an additional challenge. This involves reconciling or merging the incremental data with the base data from earlier ingestion.
While that merge process is trivial in traditional RDBMS or DW systems, most big data stores are immutable or have immature and problematic ACID implementations.
To deal with this issue, Infoworks built a continuous merge capability that supports fast ingestion and continuous fresh data availability. All while keeping the data optimized for downstream query performance.
Also, automatically maintaining the history of all changes that have propagated through the incremental changes is necessary for point-in-time analytics, audits and slowly-changing-dimension type 2 (SCD2) support. Without automation, implementing time axes and SCD2 manually or with low-level tools makes it even more time-consuming to move into production.
The bottom line is that there are many steps and challenges involved in dealing with the CDC. If you think you are going to just code it yourself with a vanilla Hadoop implementation, be sure to give yourself plenty of time.
It is far more complicated than you think to do it properly.
Data is not the only thing that changes in production; schemas also change.
For one of our customers, schemas in a number of the data warehouses change on a weekly basis. Some of that includes adding new columns, data is filled sometimes with nulls, and sometimes with history, and data types also change.
Back in the classic data warehousing days, this would have been an enormous problem. Propagating the additional columns into the data warehouse would have required discussion on where the new columns should go. Additionally, a DBA would have to add the column to the data warehouse schema.
All of this could take weeks.
The good news is that in the “big data” world, deciding where to put those new columns is trivial. The bad news: the work is still ahead.
Without automation to identify that a source change has happened, ingestion pipelines break, thus calling in their developers for production support. In many cases, this real-life scenario plays out over the weekend when these schema changes happen.
By automating this process, schema evolution is automatic:
All of this occurs without raising a single production ticket.
At the 30,000 foot level, none of this seems very difficult. It just means that the devil is in the details.
Perhaps as you read this article, you might have thought, “this isn’t that difficult.” And while any single aspect of the challenges presented above may not be difficult, as you layer on additional challenges, the issues slowly but surely become overwhelming.
One of our customers spends five engineers over two years to build out a complete Hadoop data ingestion framework for their on-premise cluster. Then they had to keep those engineers working on keeping that ingestion framework running.
When they originally built the framework, there was not a commercial alternative. However, two years after completion they came to the realization that they could replace all of that effort with commercial data ingestion tools plus one part-time employee.
So, don’t be deceived; ingesting data into a data lake while maintaining production SLAs is hard. The good news is that automation can help.