There is a massive trend to move data analytics and data engineering to the cloud. And there is good reason for this trend. The fact that cloud environments have cheap storage and the computation costs are pay-as-you-go, make initial startup costs of cloud analytics much more affordable than on premise since there is no initial up-front capital cost. Plus, big data and data warehouse environments in the cloud are relatively easier to use when compared to on premise offerings. They automatically spin up and spin down and the amount of infrastructure management required is significantly less.
That said, there are still some significant challenges that you have to overcome to successfully implement production class data engineering pipelines. These are issues that the cloud vendors are reluctant to talk about but if you are moving to the cloud, you need to take these factors into account:
Cloud vendors will tell you that implementing your data analytics pipelines in the cloud is easy. And they will show diagrams that looks like this, showing all of the tools they provide to make it simple.
What they fail to mention is that to implement the architecture above, you have to write code, and quite a bit of code for that matter. The problem with hand coding is that it is:
Operationalizing a data pipeline so it can run repeatedly with full monitoring and fault tolerant recovery was difficult on premise and it is just as difficult in the cloud. In fact the issues discussed in our blog post called, “The Top 5 Data Preparation Challenges to Get Big Data Pipelines to Run in Production” are just as true in the cloud.
But wait a minute, don’t open source tools like Airflow address this operationalization issue? The answer is that it is a good framework but… a bit of an empty toolbox:
The bottom line is that you will need to either buy or build capabilities for data pipeline production management at scale that support:
So keep this in mind as you implement your cloud data pipeline infrastructure.
As noted briefly above, per Gartner, most companies ( 80%) have multiple cloud environments in addition to their on premise data. When organizations decide to swap out their cloud data fabric or run multiple fabrics simultaneously, they have to be able to move their data pipelines and query logic as well as move the data and keep it synchronized on an ongoing basis.
Unfortunately, migration and synchronization from one environment to another is complicated. You have to deal with complex issues like storage and format inconsistency, compute platform differences and the fact that cloud compute platforms evolve and tend to have poor backwards compatibility. Not to mention dealing with the challenge that in a multi cloud environment, data must be replicated and then kept in synch on an ongoing basis. None of these issues is trivial. In fact, the exact opposite is true. Each of them is difficult in its own unique way.
While multi-cloud and hybrid environments create the need for portability, they also create the need to be able to manage data pipelines that flow across different cloud environments. So everything noted earlier in Challenge 2 about operationalizing your data pipelines is still true, but now you have to do it across multiple clouds!
In addition, you have to deal with data replication across clouds because two-way incremental replication is required so multiple cloud clusters can be kept in synch to maintain an operational state.
Keep in mind, even with all of these challenges, implementing your data analytics stack in the cloud is still going to be easier than doing it on premise. Plus, the cloud vendors and companies like Infoworks are constantly competing to improve their product offerings to simplify deployment of end to end analytics use cases.
So move to the cloud, but keep in mind that there are still lots of challenges that you will need to take into consideration.
NOTE: Infoworks addresses all of the challenges noted above by providing an agile data engineering platform that accelerates the pace of delivering business intelligence and machine learning analytics projects running on any big data fabric, in the cloud or on premise. Visit our product pages to learn more about our data engineering and DataOps automation solutions.