This Eckerson Group report recommends 10 vital steps to attain success in DataOps.READ MORE
In February 2018, we documented the challenges of getting big data projects to production as well as the solutions to those challenges.
As the big data landscape continues to change, we have updated the problems and solutions for getting big data projects to production in 2019
In a November 2018 paper by Gartner analyst Summit Pai, he wrote, “In 2017, only 17% of Apache Hadoop deployments were in production.” Our experience is that the same is true for Spark deployments as well and the numbers are not improving significantly. When initiating a big data project, many data engineers and data scientists are able to hand-code data pipelines that ingest, transform and prepare data for use for one or even a few analytics use cases that work in a sandbox environment.
But prototyping is one thing. It is a completely different challenge to get that prototype to create a data workflow that runs every day without failing, or to enable elegant recovery when the data flow job does fail. Or even when you do create a production-quality data pipeline, creating a repeatable and scalable process that can work for hundreds or thousands of analytics projects is yet another major challenge. And while “The Cloud” simplifies some of this work, even with a cloud big data platform, implementing an enterprise-class, agile data architecture is still very difficult. In fact, because most organizations are now operating in a multi-cloud and hybrid-cloud world, the introduction of the cloud is actually making overall orchestration of data and data flows even more difficult.
The reality is that organizations that hand-code to address this issue cannot keep up with the pace of new analytics use cases because they end up spending more time fixing or enhancing their data analytics and machine learning pipelines and less time adding new use cases. Every successive analytics project takes longer and costs more. Organizations can’t justify throwing more people at these big data projects… it never ends. They are looking for a better way.
While there are lots of tools that support data ingest to get data from legacy sources into a data lake whether that lake is on-premise or in the cloud, you still need an expert to make it work. What do you do with really large files? How do you partition the data on the source and then reconstruct it properly?
If you can’t properly parallelize the ingest of data, ingestion tasks that could be done in an hour can take 10 to 20 times longer. The problem is that most people don’t know how to tune this properly.
Don’t hand code a solution. Many vendors tackle this problem for Hadoop, Spark, and cloud-based file systems. This means you don’t have to re-invent the wheel by writing a lot of code to resolve your data ingest issue. You have a broad choice of vendors to address this challenge, including Infoworks.io. The worst thing you can do is hand-code your way into more volatile terrain. This is true whether your analytics environment is on-premise or in the cloud. While I prefer you consider Infoworks.io as your primary big data suitor, I also believe the most important concept here is that hand coding is a poor solution relative to the variety of options available.
That is because according to studies by Gartner Research, hand-coding a simple data ingestion job has been shown to take 20% less effort than using an automated data ingestion framework. However, maintaining that same code takes 200% more effort. Hand-coding doesn’t embed capabilities like lineage, change management tracking or process monitoring. So the 20% time savings upfront only happens because there is less error handling built-in at development time which results in larger costs when you go to operationalize. And when it comes to more complex data ingestion jobs that require parallelization, hand-coding is no longer faster than using an ingestion tool.
When you evaluate data ingestion vendors, consider how much coding automation their solution provides. Automation is a hedge against hand-coding, something you will want to avoid as it radically inflates support and maintenance costs in the future as noted above. Examine areas whereas the vendor does not provide automation. Are those areas laborious and complex? Or are they areas your company can easily resolve?
Do they connect to all the data sources that are imperative to your business? Does the vendor provide fast, parallel, native path access to your sources or is it just JDBC?
Be warned, some user interfaces may appear shiny and pretty on the exterior, however, when you click an icon you find yourself buried in grisly code. That’s not why you chose a third-party data ingestion vendor.
Most organizations aren’t moving their entire operations onto a big data environment. They move data there from existing operational systems to perform new kinds of analysis or machine learning. This means that they need to keep loading new data as it arrives.
The problem is that these environments are immutable file-based systems that don’t support the concept of inserts. This means you have to reload the entire data set again (see point 1 above) or you have to code your way around this classic change data capture plus merge and synch problem.
Once again, the solution is to automate the process and many of the vendors who automate ingest problems help with this process as well. Note that you need to be able to deal with two challenges here.
The first is change data capture on the source. You have to have a way to identify that new rows or columns have been added to the source system and then move just those changed rows or columns into your data lake.
The second challenge is handling the merge and synching of the new data into the target big data system, which once again, doesn’t support the concepts of add, deletes or inserts. That means that whichever ingest vendor you chose, they better take care of this issue for you as well.
Note that as of this writing there are some new open source technologies to better support these concepts, but also as of this writing, they are not very mature and not very good.
With that as background, once again, Infoworks address the incremental ingestion of data and fully automates it. In fact, the amount of effort that is required by you to configure Infoworks is to choose the type of approach you want to take to monitor changing data on the source. For this, we give you a simple pull-down menu and a single checkbox to select.
In addition, if there is a column added to a table on the source you are ingesting, we will detect that as well and automatically add that new column to the ingest process and merge it properly into the data lake.
Many organizations have been able to identify the potential for new insights from the data scientist working within their sandbox environment. Once they have identified a new “recipe” for analytics, they need to move from an individual data scientist running this analysis in their sandbox to a production environment that can run every day. Moving from dev to production is too often a complete lift and shift operation that is generally done manually. And while it ran just fine on the dev cluster, now that same data pipeline has to be re-optimized on the production cluster. Worse yet, many organizations are operating in a multi-cloud and hybrid environment. This means that the development of a data pipeline might occur in an on-premise development sandbox but then go into production in the cloud.
The result is that just getting the pipeline to work in production can require re-coding and the tuning process can also require significant rework to get it to perform efficiently. This is especially true if the dev environment is in any way different from the production environment (as noted above). This is a problem that people expected “the cloud” would resolve. And while moving to the cloud does make it easier to spin up and down clusters to process data, getting those cluster to scale and perform in a reliable way is still a challenge.
The challenge here is that in this case, there isn’t a long list of vendors who actually tackle this problem. There are a lot of “data prep” applications out there that are great for data scientists who are basically mining the data and prototyping potential “recipes” that could be used for decision making. Once they discover these recipes they leave it as an exercise for the user to convert the query or analytic or machine learning algorithm into a repeatable process that can be continuously run at scale.
The obvious answer, once again to this challenge, is automation. This is a case where also, once again, Infoworks automates the process of promoting a project from dev to text to production. Along the way, Infoworks automatically adjusts and optimizes your data pipelines to take advantage of the size of the production cluster. No re-coding or reimplementation of the pipeline is required.
This means that the same self-service that data scientists are taking advantage of for data discovery, can be delivered as well, all the way through to the push to full production.
Most organizations have focused on tooling up so their data analyst and scientists can more easily identify new insights. They have not invested however in similar tooling for running data workflows in production where you have to worry about starting, pausing and restarting jobs. You have to also worry about ensuring fault tolerance of your jobs, handle notifications, and orchestrating multiple workflows to avoid “collisions”.
What can you do about this? Well, fortunately, there are now software products that exist that directly address and automate away the complexity of big data. Don’t assume you are relegated to hiring an army of Hadoop or Spark specialists. Because hiring your way out of this is not realistic. The only realistic path is to automate away the complexity.
Here you could attempt to use Apache Airflow to manage the workflow process for your data pipelines. The issue is that it is really a workflow “scaffold”. One positive review of Airflow stated, “When authoring a workflow, you have to think how it could be divided into tasks which can be executed independently.” The implication is that you have to be a developer to use it and need a good understanding of distributed computing to get it to work well.
In addition, it doesn’t deal with recovery or retry of failed jobs or work across multiple clouds or for hybrid cloud/on-premise environments (more on that later). The bottom line is that Airflow is table-stakes for operationalizing your data pipelines and you will need to either buy or build capabilities for data pipeline production management at scale that support:
Fortunately, Infoworks addresses this problem as well, providing a distributed orchestrator with a visual development environment that monitors production workloads and makes them fault-tolerant, reducing the load on system and production administrators.
As noted earlier, 80% of organizations that are moving to the cloud — and that is almost 100% of all organizations — are moving to a multi-cloud environment. This is happening for two main reasons. The first is that large organizations often have different parts of their business selecting different cloud providers. The second is that even when a business is more coordinated across its departments, it will often select one cloud vendor because it has superior capabilities for a specific use case and a different cloud vendor for a different use case.
The result is that data has to flow and be replicated across cloud and on-premise environments, and data pipelines have to also flow across and provide orchestration of pipelines that flow across different cloud execution environments.
As noted above, there are two challenges you have to deal with. The first is the high-speed replication of data across environments. To do this you have to be able to duplicate data reliably and bi-directionally. In addition, you also have to do be able to replicate data en masse as well as incrementally. This means you have to address challenges 1 and 2 as a baseline noted above.
Second, you have to also be able to orchestrate pipelines that run from cloud to cloud, cloud to on-premise, and on-premise to cloud. This is yet another set of problems that Infoworks addresses, and whether or not you use Infoworks software, you will have to address these issues.
The bottom line is that you don’t need nearly as much expertise as was required 5 years ago when Hadoop first started to get big.
The first wave of automation came into existence about 3 years ago and automated individual slices of data pipeline development from ingest to consumption. Infoworks represents a second wave that doesn’t just automate an individual slice but automates the entire end-to-end data pipeline in a fully integrated fashion across multiple deployment infrastructures both on-premise and in the cloud.
Regardless of whether you go with the first wave of automation, or what is now appearing as a second wave, you should not have to hand-code any of your big data pipelines either in development or in production.
So if you find the majority of your big data effort turning into a coding effort in Python, Pig, Hive, Scala, etc, you are doing something wrong. The tools and platforms are now available that your existing data and business analysts should be able to achieve a relatively high level of self-service without having to become big data project experts.