Enterprises are struggling to deploy big data workloads in products. This was captured very well by Gartner with
a late 2016 press release which stated, “Only 15 percent of businesses reported deploying their big data project to production.” Gartner was pretty careful in their word selection. They didn’t mean there isn’t a lot of experimentation, or that data scientists haven’t found new insights by using big data techniques. They specifically stated that these projects didn’t make it into production. The problem isn’t with big data analytics or even with much of the data science experiments. The challenge is the lack of big data automation to make it easy to promote initial experiments out of the sandbox and into a fully functional production environment.
And this isn’t completely surprising.
Most people think that getting analytics into production is just about tuning the cluster. Sure, you can write a sqoop script and bring a table in once. But it is another challenge to bring it in multiple time without affecting the source systems. Then you have to be sure that the data pipeline you’ve built delivers the data within the timeframe set by the service level agreements (SLAs). Additionally, the data models are optimized for consumption by your users with tools, like Tableau, Qlik, etc. that they are currently using along with the responsive they have grown to expect.
There has been a ton of effort and investment in using tools on top of Hadoop and Spark to do rapid prototyping against large data sets. But prototyping is one thing. It is a completely different challenge to get that prototype to create a data workflow that runs every day without failing, or to enable elegant recovery when the data flow job does fail.
While tools like sqoop support parallelization for data ingest to get data from legacy sources into a data lake, you need an expert to make it work. How do you partition the data? How do you know how many containers to run? If you can’t properly parallelize the ingest of data, ingestion tasks that could be done in an hour can take 10 to 20 times longer. The problem is that most people don’t know how to tune this properly.
Most organizations aren’t moving their entire operations onto a big data environment. They move data there from existing operational systems to perform new kinds of analysis or machine learning. This means that they need to keep loading new data as it arrives. The problem is that these environments don’t support the concept of adds, deletes or inserts. This means you have to reload the entire data set again (see point 1 above) or you have to code your way around this classic change data capture problem.
Imagine you have 1000 BI analysts, and none of them want to use your data models because they take too long to query. Actually, you only need one data analyst to make this unbearable. This is a classic problem with Hadoop and is the reason why lots of companies only use Hadoop for preprocessing and applying specific machine learning algorithms but then move the final data set back to a traditional data warehouse for use by a BI tool. Regardless, this adds yet one more step in the process that gets in the way of successfully completing a big data project.
Many organizations have been able to identify the potential for new insights from the data scientist working within their sandbox environment. Once they have identified a new “recipe” for analytics, they need to move from an individual data scientist running this analysis in their sandbox to a production environment that can run every day. Moving from dev to production is a complete lift and shift operation that is generally done manually. And while it ran just fine on the dev cluster, now that same data pipeline has to be re-optimized on the production cluster. This tuning can often require significant rework to get it to perform efficiently. This is especially true if the dev environment is in any way different from the production environment.
Most organizations have focused on tooling up so their data analyst and scientists can more easily identify new insights. They have not invested however in similar tooling for running data workflows in production where you have to worry about starting, pausing and restarting jobs. You have to also worry about ensuring fault tolerance of your jobs, handle notifications, and orchestrating multiple workflows to avoid “collisions”.
What can you do about this? Well, fortunately, there are now software products that exist that directly address and automate away the complexity of big data. Don’t assume you are relegated to hiring an army of Hadoop specialists. Because hiring your way out of this is not realistic. The only realistic path is to automate away the complexity.
In the next post, I will talk about exactly what can and has been done to address some of these issues. Or in the meantime, visit www.infoworks.io to learn more about how we are specifically helping with big data automation to simplify away the complexity and accelerate time from prototype to big data production.