Automation

The Top 5 Data Preparation Challenges to Get Big Data Pipelines to Run in Production

Posted by Ramesh Menon

Many organizations have been looking to big data to drive game-changing business insights and operational agility, but big data has turned out to be so complex and costly to configure, deploy and manage, most data projects never make it into production. The agility businesses require today means being able to constantly add new analytics use cases, but so much of the resources available are consumed by the sheer ongoing maintenance of the pipelines that were already built. So what can organizations do to overcome the problem?

After data is ingested into a data lake, data engineers need to transform this data in preparation for downstream use by business analysts and data scientists. Challenges in data preparation tend to be a collection of issues that add up over time to create ongoing maintenance and management issues. Some of these include:  

  • How do you create pipelines for incrementally loading data?
  • How do you debug transformation logic in a highly distributed environment?
  • How do you optimize the running of pipelines and ensure reliability and availability?
  • How do you deal with endless changes in the underlying technology stack?
  • How does the system handle the propagation of upstream changes?

Once the data has been prepared, optimizing it for use by consuming applications and users is also complex.  For example, accessing data located in a big data repository with data visualization tools like Tableau or Qlik is problematic. These tools just aren’t designed to deal with that kind of volume of data, nor are data sources like Hive designed to give sub-second response times to complex queries. The big data industry has tried to address this issue by creating environments that support in-memory data models as well as the creation of OLAP cubes to provide the high fast response times enterprises need from their interactive dashboards. But new questions emerge:

  • How do you tune the data models to be highly performant for appropriate use-case?
  • How does the user know when to query a cube versus an in-memory model?
  • How do you scale up to support the need for terabyte size OLAP cubes?

What is clear: it’s important to enter into this space with your eyes wide open.

To assist with this, here are our top five challenges to be aware of when developing production-ready data pipelines for a big data world along with (at the bottom of this article) links to solutions that can alleviate these issues through the power of automated data engineering:

 

Production-Ready Data Pipeline Challenges

  • Developing the data pipeline itself is challenging

This was true when comparing hand coding to old fashioned ETL and it’s even more so now. Development of data pipelines on a distributed computing framework is an order of magnitude more complicated than writing transformation logic in non-distributed, single server environments. And the problem has been getting worse as the world moves to Spark, which has become the most common data transformation technology used in big data and cloud today. Unfortunately, development in Spark usually requires a very knowledgeable developer who can write low-level code and is familiar with optimizing distributed computing frameworks.  

Additionally, data pipelines in the modern data world have to deal with batch, incremental loads of data plus streaming that’s all intermingled. This means building expertise across a variety of different technologies, which creates new headaches for data engineering specialists that must now work in a distributed environment.   

 

  • Debugging your transformation logic

Conceptually, this problem is exactly the same as it was back in the old ETL days.  A data pipeline consists of multiple steps where data from different sources is combined, normalized, and cleansed as it progresses through the pipeline. Even in the ETL days, graphical development tools allowed users to create visual pipelines and see the results of their actions every step of the way. Displaying the visualization of the data input and output in real time while developing your data pipelines enables the developer to work using an agile methodology, making it possible to debug their visual code as they progress.

However, in the big data world, there is an additional technical challenge.  Because data is stored in a distributed manner, performing tasks like pivots and aggregations require smart sampling in order to support debugging of pipelines when datasets can be stored across multiple nodes in your cluster. The bottom line is if you accept that visual pipeline development was faster back in the ETL days (and there is a lot of support for that point), then it is even more true today.  

 

  • Creating a repeatable pipeline

Writing a pipeline that will run once for ad hoc queries is much easier than writing a pipeline that will run in production. In a production environment you have to get that same pipeline to run repeatedly, which requires dealing with error handling, process monitoring, performance monitoring, and process availability to name just a few issues you will encounter.  

In addition, dealing with constantly changing source data requires writing two distinct data pipelines–one for the initial data load and one for the incremental loading of data. This process takes additional time and often results in errors since it requires writing two different but similar data pipeline processes. This problem is compounded by the fact that big data file systems are immutable and don’t understand how to handle incremental changes to a row of data in a table. So data pipelines become even more complex to deal with.

 

  • Pipeline performance tuning and optimization

Writing a pipeline that runs slowly is easy. Tuning for performance to meet your SLAs is a different beast altogether. This challenge is exacerbated by the fact that the underlying big data fabrics don’t all perform exactly the same because of varying versions and optimizations. So tuning for Spark to run on Cloudera on-premise, will be different than tuning for Spark or Databricks on Azure.  In addition, depending on the underlying compute and storage environment you are running (HDFS, Hive, Azure ADLS, Azure Blob Storage, Google Cloud Storage, AWS S3 etc), you may have to make additional adjustments. And don’t forget that optimizing and debugging Spark code requires scarce specialist skills for ongoing maintenance of pipelines.

 

  • Ensuring high-speed query performance

Market-leading business intelligence and data visualization tools are not designed to handle the large data volumes typically associated with “big data”. At the same time, data sources like Hive and NoSQL are not able to deliver sub-second response times for complex queries. So, while merely pointing a visualization tool like Tableau at a Hive table may “work” in the strictest sense, it will deliver very unsatisfactory end-user performance.

Moreover, other approaches, like creating scalable OLAP cubes and in-memory modules that do work well with common data visualization tools, require yet another kind of big data expertise that can be hard to find. And depending on the number of end-users, even more, scalability issues can arise. While many use-cases require accessing the data prepared via the pipelines directly on the big data platform, other use-cases or applications may require the prepared data to made available in an external system (such as an EDW, Cloud store etc).  The bottom line is that you will need one of these approaches if you expect good end-user query performance.

 

Automated Data Pipeline & Data Preparation Solutions

Now that you’re aware of the challenges, here are some Infoworks solutions available that can help you achieve the agility you need.

About this Author
Ramesh Menon
Prior to Infoworks, Ramesh led the team at Yarcdata that built the world’s largest shared-memory appliance for real-time data discovery, and one of the industry’s first Spark-optimized platforms. At Informatica, Ramesh was responsible for the go-to-market strategy for Informatica’s MDM and Identity Resolution products. Ramesh has over 20 years of experience building enterprise analytics and data management products.
Want to learn more?
Schedule a demo