The Top 5 Data Preparation Challenges to Get Big Data Pipelines to Run in Production

Data Engineering

Written by Ramesh Menon - February 12, 2019 | Category: Data Engineering

The Top 5 Data Preparation Challenges to Get Big Data Pipelines to Run in Production

Many organizations have been looking to big data to drive game-changing business insights and operational agility. Unfortunately, big data ends up to be so complicated and costly to configure, deploy and manage, that most data projects never make it into production

The agility businesses require today means being able to add new big data use cases constantly. But so much of the resources available are consumed by the sheer ongoing maintenance of the pipelines already in place. So, what can organizations do to overcome the problem?

Questions about data preparation for big data

Common Data Preparation Questions

Following the ingestion of data into a data lake, data engineers need to transform this data in preparation for downstream use by business analysts and data scientists. Challenges in data preparation tend to be a collection of problems that add up over time to create ongoing issues. Some of these include:  

  • How do you develop pipelines for incrementally loading data?
  • How do you debug transformation logic in a highly distributed environment?
  • How do you optimize the running of pipelines and ensure reliability and availability?
  • How do you deal with endless changes in the underlying technology stack?
  • How does the system handle the propagation of upstream changes?

After the preparation of data happens, optimizing it for use by consuming applications and users is also complex.  

For example, accessing data located in a big data repository with data visualization tools like Tableau or Qlik is problematic. These tools just can’t handle that kind of volume of data. Nor are data sources like Hive designed to give sub-second response times to complex queries. 

The big data industry has tried to address this issue by creating environments that support in-memory data models as well as the creation of OLAP cubes to provide the high fast response times enterprises need from their interactive dashboards. But new questions emerge:

  • How do you tune the data models to be highly performant for appropriate use-case?
  • How does the user know when to query a cube versus an in-memory model?
  • How do you scale up to support the need for terabyte size OLAP cubes?

What is clear: it’s essential to enter into this space with your eyes wide open.

Production-Ready Data Pipeline Challenges

Here are our top five challenges to be aware of when developing production-ready data pipelines for a big data world. You’ll also find several links to solutions (at the bottom of this article) that can alleviate these issues through the power of automated data engineering:

1. Developing the data pipeline itself is challenging

This was true when comparing hand coding to old fashioned ETL, and it’s even more so now. Development of data pipelines on a distributed computing framework is an order of magnitude more complicated than writing transformation logic in non-distributed, single server environments. 

The problem has been getting worse as the world moves to Spark, which has become the most common data transformation technology used in big data and cloud today. Unfortunately, development in Spark usually requires a very knowledgeable developer who can write low-level code. They must also be familiar with optimizing distributed computing frameworks.  

Additionally, data pipelines in the modern data world have to deal with batch, incremental loads of data plus streaming that’s all intermingled. This means building expertise across a variety of different technologies. Creating new headaches for data engineering specialists that must now work in a distributed environment.   

2. Debugging your transformation logic

Conceptually, this problem is the same as it was back in the old ETL days. 

A data pipeline consists of multiple steps where data from different sources is combined, normalized, and cleansed as it progresses through the pipeline. Even in the ETL days, graphical development tools allowed users to create visual pipelines and see the results of their actions every step of the way. 

Displaying the visualization of the data input and output in real time while developing your data pipelines enables the developer to work using an agile methodology. Thus making it possible to debug their visual code as they progress.

However, in the big data world, there is an additional technical challenge. Because data is stored in a distributed manner, performing tasks like pivots and aggregations require smart sampling to support debugging of pipelines when datasets can be stored across multiple nodes in your cluster. The bottom line is if you accept that visual pipeline development was faster back in the ETL days (and there is a lot of support for that point), then it is even more valid today.  

3. Creating a repeatable pipeline

Writing a pipeline that will run once for ad hoc queries is much easier than writing a pipeline that will run in production. In a production environment, you have to get that same pipeline to run repeatedly. This requires dealing with error handling, process and performance monitoring, and process availability to name just a few issues you will encounter.

In addition, dealing with constantly changing source data requires writing two distinct data pipelines; one for the initial data load and one for the incremental loading of data. This process takes additional time and often results in errors since it requires writing two different but similar data pipeline processes. 

The problem gets worse by the fact that big data file systems are immutable and don’t understand how to handle incremental changes to a row of data in a table. So, data pipelines become even more complex to manage. 

4. Pipeline performance tuning and optimization

Writing a pipeline that runs slowly is easy. Tuning for performance to meet your SLAs is a different beast altogether. 

The challenge here is exacerbated by the fact that the underlying big data fabrics don’t all perform similarly because of varying versions and optimizations. So tuning for Spark to run on Cloudera on-premise will be different than tuning for Spark or Databricks on Azure.

In addition, depending on the underlying compute and storage environment you are running (HDFS, Hive, Azure ADLS, Azure Blob Storage, Google Cloud Storage, AWS S3, etc.) you may have to make additional adjustments. And don’t forget that optimizing and debugging Spark code requires scarce specialist skills for ongoing pipeline maintenance.

5. Ensuring high-speed query performance

Speed up data preparation

Market-leading business intelligence and data visualization tools can’t handle the large data volumes typically associated with “big data”. At the same time, data sources like Hive and NoSQL are not able to deliver sub-second response times for complex queries. So, while merely pointing a visualization tool like Tableau at a Hive table may “work” in the strictest sense, it will deliver very unsatisfactory end-user performance.

Moreover, other approaches require yet another kind of big data expertise that can be hard to find. Some of these approaches include creating scalable OLAP cubes and in-memory modules that do work well with common data visualization tools. And depending on the number of end-users, even more, scalability issues can arise.

Other use-cases or applications may require the prepared data to made available in an external system (such as an EDW, Cloud store, etc). Although many use-cases require accessing the data prepared via the pipelines directly on the big data platform. The bottom line is that you will need one of these approaches if you expect good end-user query performance.

Automated Data Pipeline & Data Preparation Solutions

Now that you’re aware of the challenges, here are some Infoworks solutions that can help you achieve the agility you need.

  • For developing and debugging data pipelines and pipeline tuning optimization, you can read here how Infoworks automates this process via a no code data engineering development environment. (Surprise! You don’t need to know how to code in Spark to be a Spark programmer to develop and debug complex data pipelines!)  
  • For creating pipelines that run reliably and repeatedly, you can read here about how Infoworks provides DataOps tool capabilities that automatically manages data pipelines built in our no-code development environment.
  • To learn about how to ensure high-speed query performance for end users, you can read about how Infoworks helps data engineers develop OLAP cubes and in-memory models that scale without requiring performance tuning experts.
About this Author
Ramesh Menon
Prior to Infoworks, Ramesh led the team at Yarcdata that built the world’s largest shared-memory appliance for real-time data discovery, and one of the industry’s first Spark-optimized platforms. At Informatica, Ramesh was responsible for the go-to-market strategy for Informatica’s MDM and Identity Resolution products. Ramesh has over 20 years of experience building enterprise analytics and data management products.