Seems like every major enterprise these days is undergoing a digital transformation that includes mass-migration of their data analytics workload to the Cloud. The Cloud’s promise of greater agility at lower cost is just as applicable to data analytics as it is to every other layer of the technology stack, and has empowered both large and small organizations, and both data engineers in centralized IT as well as “citizen” data engineers in the line of business, to stand up data analytics more quickly than ever before. The Cloud Platform vendors have enabled this with a steady stream of products for data warehousing, data lakes, and the data operations tools to feed them. Problem is, many are proprietary to those respective cloud environments, risking the same degree of data lock-in that was concerning enterprises back in the days of on-premise enterprise databases and data warehouses. And so, if you need your data migrated to some other environment, it might be singing this tune:
Last thing I remember, I was running for the door,
I had to find the passage back to the place I was before.
‘Relax, ‘ said the night man, ‘We are programmed to receive.
You can check-out any time you like, but you can never leave!’
The Eagles, “Hotel California”
This blogger might be dating himself here, nonetheless this tune is about as old as the concerns of data lock-in in our industry. Technology platforms change, but the concern remains.
So how does one realize the agility and cost benefits of the Cloud, without giving up control of one’s Data?
First, let’s recognize what exactly are the causes of lock-in. It isn’t in the database and data lake technologies themselves. Whether Amazon Redshift and EMR, Google BigQuery and Dataproc, Azure HDInsight and Synapse, Snowflake, Databricks and many others, they all do reasonably well in making it easy to get data in and out. Indeed they are all designed to participate in modern data architectures that encourage using different engines for different workloads, e.g. data warehouses for structured analytics and data lakes for ad hoc and real-time analytics, and expose the right open APIs and compute flexibility to make this possible. Instead, the problem arises in the data operations and orchestration itself. Using the data integration, data preparation, workflow, catalog and other data management tools that are native to the respective environments, one has little choice but to spend data engineering time and skills building out their data pipelines in those native tools, and the resulting artifacts are not easily migratable from one such tool to the other. Even worse, much scripting and “glue code” is often required and not leverageable in other environments at all. So, if you have data pipelines built out for one analytic engine in one cloud, and then want to re-use those pipelines for slightly different workloads in another engine on another cloud, you are out of luck. Those pipelines are essentially a full rewrite, requiring time and skills that many organizations can ill afford. You have essentially lost control of your Data to tools hardwired to specific cloud environments. Programmed to receive, and never leave, indeed. What to do?
Infoworks can help! Infoworks was built from the ground up to be a heterogeneous hybrid/multi-cloud data operations and orchestration platform. It includes an abstraction layer between (1) the design-time and ongoing lifecycle of data onboarding, preparation and orchestration logic and (2) the underlying compute environments to which that logic can be deployed, with no coding or scripting required to deploy and redeploy from one environment to another. So if, for example, you have data pipelines that aggregate all customer data from various applications into a customer-360, and you want to send that customer-360 data to a cloud data warehouse (e.g. Snowflake) for historical reporting, and also a cloud data lake (e.g. Databricks, AWS EMR, Dataproc) for data science and ad hoc analytics, you can define those pipelines once and point-and-click target both environments. Then if, for whatever reason, another department needs this same customer-360 data but they had independently chosen a different data lake on a different cloud, no problem, just deploy the same pipelines there, with no recoding required. You stay in control of where your data is going, regardless of which cloud, which engine, or even on-premise.
So, Infoworks offers the best of both worlds. You can confidently embrace the Cloud’s promise of data analytics agility and cost-effectiveness, without giving up control of your Data. You can retarget and replatform your data operations when needed, without data re-engineering time and expense getting in the way. Even better, when the data operations logic itself does need to change for whatever reason (e.g. data sources change, data model change), Infoworks’ metadata-driven automated approach significantly reduces data engineering time and effort compared to cloud-native alternatives. So, win throughout your data lifecycle, regardless of the nature of the change.
If you’re interested in learning more, then why wait, try Infoworks yourself! Visit our Test Drive at https://www.infoworks.io/try/