Cloud Data Analytics: 4 Data Engineering Impediments to Success

Written by Todd Goldman | Category: Cloud and Hybrid Cloud

There is a massive trend to move data analytics and data engineering to  the cloud.  And there is good reason for this trend. The fact that cloud environments have cheap storage and the computation costs are pay-as-you-go, make initial startup costs of cloud analytics much more affordable than on premise since there is no initial up-front capital cost. Plus, big data and data warehouse environments in the cloud are relatively easier to use when compared to on premise offerings.  They automatically spin up and spin down and the amount of infrastructure management required is significantly less.

That said, there are still some significant challenges that you have to overcome to successfully implement  production class data engineering pipelines. These are issues that the cloud vendors are reluctant to talk about but if you are moving to the cloud, you need to take these factors into account:

  • Hand coding by experts is still required
  • There is a big delta between ad hoc and production ready data pipelines
  • Data & data pipeline portability
  • Managing cross cloud & hybrid deployments

 

Challenge1: Hand Coding by Experts is Still Required

Cloud vendors will tell you that implementing your data analytics pipelines in the cloud is easy. And they will show diagrams that looks like this, showing all of the tools they provide to make it simple.     

What they fail to mention is that to implement the architecture above, you have to write code, and quite a bit of code for that matter.  The problem with hand coding is that it is:

    • Error prone
    • Hard to troubleshoot
    • Lacks lineage
    • Not portable across big data fabrics
    • Becomes more difficult to support over tie and you end up spending more time fixing old data pipelines and less time adding new use cases

Challenge 2: There is a big delta between ad hoc and production ready data pipelines

Operationalizing a data pipeline so it can run repeatedly with full monitoring and fault tolerant recovery  was difficult on premise and it is just as difficult in the cloud. In fact the issues discussed in our blog post called, “The Top 5 Data Preparation Challenges to Get Big Data Pipelines to Run in Production” are just as true in the cloud.

But wait a minute, don’t open source tools like Airflow address this operationalization issue? The answer is that it is a good framework but… a bit of an empty toolbox:

    • You need to be a developer to use it
    • You have to understand distributed computing to get it to work well
      • According to one blog post (and this was a positive post),“When authoring a workflow, you have to think how it could be divided into tasks which can be executed independently”
    • It doesn’t deal with recovery or retry of failed jobs
    • It doesn’t work across multiple clouds or for hybrid cloud/on-premise environments
      • This is an issue because according to a 2019 Gartner survey, about 80% of companies have more than one cloud environment

The bottom line is that you will need to either  buy or build capabilities for data pipeline production management at scale that support:

    • Debugging of production runs
    • Resource management & Performance Optimization
    • Fault Tolerance: Retries and restarts failed jobs
    • Dynamic control: Passing parameters in the workflow
    • Distributed orchestration: Running jobs in parallel on and off the cluster
    • Orchestration multi-cloud and hybrid workflows

So keep this in mind as you implement your cloud data pipeline infrastructure.

 

Challenge 3: Data and Pipeline Portability

As noted briefly above, per Gartner, most companies ( 80%) have multiple cloud environments in addition to their on premise data.   When organizations decide to swap out their cloud data fabric or run multiple fabrics simultaneously, they have to be able to move their data pipelines and query logic as well as move the data and keep it synchronized on an ongoing basis.  

Unfortunately, migration and synchronization from one environment to another is complicated.  You have to deal with complex issues like storage and format inconsistency, compute platform differences and the fact that cloud compute platforms evolve and tend to have poor backwards compatibility.  Not to mention dealing with the challenge that in a multi cloud environment, data must be replicated and then kept in synch on an ongoing basis. None of these issues is trivial. In fact, the exact opposite is true.  Each of them is difficult in its own unique way.

 

Challenge 4: Managing cross  cloud & hybrid deployments

While multi-cloud and hybrid environments create the need for portability, they also create  the need to be able to manage data pipelines that flow across different cloud environments. So everything noted earlier in Challenge 2 about operationalizing your data pipelines is still true, but now you have to do it across multiple clouds!

In addition, you have to deal with data replication across clouds because two-way incremental replication is required so multiple cloud clusters can be kept in synch to maintain an operational state.

 

Keep in mind, even with all of these challenges, implementing your data analytics stack in the cloud is still going to be easier than doing it on premise. Plus, the cloud vendors and companies like Infoworks are constantly competing to improve their product offerings to simplify deployment of end to end analytics use cases.   

So move to the cloud, but keep in mind that there are still lots of challenges that you will need to take into consideration.

 

__________________________________________________________________________________

 

NOTE:   Infoworks addresses all of the challenges noted above by providing an agile data engineering platform that accelerates the pace of delivering business intelligence and machine learning analytics projects running on any big data fabric, in the cloud or on premise.  Visit our product pages to learn more about our data engineering and DataOps automation solutions.

 

About this Author
Todd Goldman
Todd is the VP of Marketing and a silicon valley veteran with over 20+ years of experience in marketing and general management. Prior to Infoworks, Todd was the CMO of Waterline Data and COO at Bina Technologies (acquired by Roche Sequencing). Before Bina, Todd was Vice President and General Manager for Enterprise Data Integration at Informatica where he was responsible for their $200MM PowerCenter software product line. Todd has also held marketing and leadership roles at both start-ups and large organizations including Nlyte, Exeros (acquired by IBM), ScaleMP, Netscape/AOL and HP.

Eckerson Report: Best Practices in DataOps

This Eckerson Group report recommends 10 vital steps to attain success in DataOps. 

READ MORE
Want to learn more?
Watch 12 minute product demo