Many organizations have been looking to big data to drive game-changing business insights and operational agility. Unfortunately, big data ends up to be so complicated and costly to configure, deploy and manage, that most data projects never make it into production.
The agility businesses require today means being able to add new big data use cases constantly. But so much of the resources available are consumed by the sheer ongoing maintenance of the pipelines already in place. So, what can organizations do to overcome the problem?
Following the ingestion of data into a data lake, data engineers need to transform this data in preparation for downstream use by business analysts and data scientists. Challenges in data preparation tend to be a collection of problems that add up over time to create ongoing issues. Some of these include:
After the preparation of data happens, optimizing it for use by consuming applications and users is also complex.
For example, accessing data located in a big data repository with data visualization tools like Tableau or Qlik is problematic. These tools just can’t handle that kind of volume of data. Nor are data sources like Hive designed to give sub-second response times to complex queries.
The big data industry has tried to address this issue by creating environments that support in-memory data models as well as the creation of OLAP cubes to provide the high fast response times enterprises need from their interactive dashboards. But new questions emerge:
What is clear: it’s essential to enter into this space with your eyes wide open.
Here are our top five challenges to be aware of when developing production-ready data pipelines for a big data world. You’ll also find several links to solutions (at the bottom of this article) that can alleviate these issues through the power of automated data engineering:
This was true when comparing hand coding to old fashioned ETL, and it’s even more so now. Development of data pipelines on a distributed computing framework is an order of magnitude more complicated than writing transformation logic in non-distributed, single server environments.
The problem has been getting worse as the world moves to Spark, which has become the most common data transformation technology used in big data and cloud today. Unfortunately, development in Spark usually requires a very knowledgeable developer who can write low-level code. They must also be familiar with optimizing distributed computing frameworks.
Additionally, data pipelines in the modern data world have to deal with batch, incremental loads of data plus streaming that’s all intermingled. This means building expertise across a variety of different technologies. Creating new headaches for data engineering specialists that must now work in a distributed environment.
Conceptually, this problem is the same as it was back in the old ETL days.
A data pipeline consists of multiple steps where data from different sources is combined, normalized, and cleansed as it progresses through the pipeline. Even in the ETL days, graphical development tools allowed users to create visual pipelines and see the results of their actions every step of the way.
Displaying the visualization of the data input and output in real time while developing your data pipelines enables the developer to work using an agile methodology. Thus making it possible to debug their visual code as they progress.
However, in the big data world, there is an additional technical challenge. Because data is stored in a distributed manner, performing tasks like pivots and aggregations require smart sampling to support debugging of pipelines when datasets can be stored across multiple nodes in your cluster. The bottom line is if you accept that visual pipeline development was faster back in the ETL days (and there is a lot of support for that point), then it is even more valid today.
Writing a pipeline that will run once for ad hoc queries is much easier than writing a pipeline that will run in production. In a production environment, you have to get that same pipeline to run repeatedly. This requires dealing with error handling, process and performance monitoring, and process availability to name just a few issues you will encounter.
In addition, dealing with constantly changing source data requires writing two distinct data pipelines; one for the initial data load and one for the incremental loading of data. This process takes additional time and often results in errors since it requires writing two different but similar data pipeline processes.
The problem gets worse by the fact that big data file systems are immutable and don’t understand how to handle incremental changes to a row of data in a table. So, data pipelines become even more complex to manage.
Writing a pipeline that runs slowly is easy. Tuning for performance to meet your SLAs is a different beast altogether.
The challenge here is exacerbated by the fact that the underlying big data fabrics don’t all perform similarly because of varying versions and optimizations. So tuning for Spark to run on Cloudera on-premise will be different than tuning for Spark or Databricks on Azure.
In addition, depending on the underlying compute and storage environment you are running (HDFS, Hive, Azure ADLS, Azure Blob Storage, Google Cloud Storage, AWS S3, etc.) you may have to make additional adjustments. And don’t forget that optimizing and debugging Spark code requires scarce specialist skills for ongoing pipeline maintenance.
Market-leading business intelligence and data visualization tools can’t handle the large data volumes typically associated with “big data”. At the same time, data sources like Hive and NoSQL are not able to deliver sub-second response times for complex queries. So, while merely pointing a visualization tool like Tableau at a Hive table may “work” in the strictest sense, it will deliver very unsatisfactory end-user performance.
Moreover, other approaches require yet another kind of big data expertise that can be hard to find. Some of these approaches include creating scalable OLAP cubes and in-memory modules that do work well with common data visualization tools. And depending on the number of end-users, even more, scalability issues can arise.
Other use-cases or applications may require the prepared data to made available in an external system (such as an EDW, Cloud store, etc). Although many use-cases require accessing the data prepared via the pipelines directly on the big data platform. The bottom line is that you will need one of these approaches if you expect good end-user query performance.
Now that you’re aware of the challenges, here are some Infoworks solutions that can help you achieve the agility you need.