Data Lake Creation

Keep Your Data Lake in the Cloud from Turning into Another Data Swamp

Posted by Todd Goldman

“Data lakes are an epic fail,” a veteran tech reporter recently wrote. That reporter is not alone. There has been a lot of negative buzz about data lakes and their lack of governability and general success.

However, data lakes have recently begun to recapture the luster they gained in 2010 when organizations began using them to store all their raw data in a cost-effective way. The problem was the data lake turned out to be a great way to store data, but a terrible way to generate actual value, with data left to languish unknown and ungoverned in a desolate dumping ground. But that’s changing with new approaches that better integrate data lakes with data warehouses and emerging solutions that help automate management of the data lake so data is governed, searchable and accessible.  A recent report by the Eckerson Group, The Future of Data Warehousing; Integrating with Data Lakes, Cloud, and Self Service  ( webinar version here ) covers a wide variety of approaches and architectures for integrating data lakes and data warehouses.

The rebirth of the data lake is happening in parallel with the Great Cloud Migration, now in full swing as more organizations look to reap potential cost savings, increased agility and other benefits. AWS, GCP and Azure all tout cloud services that allow organizations to build their data lakes where data can be analyzed in secure, scalable and flexible fashion. In addition, setting up a data lake in the cloud is much easier to do than doing in on premise.  There is no hardware to buy, no data center to cool, no networking to set up and no Hadoop to install. The cloud vendors take care of all of that for you. So setting up the foundation for a data lake is significantly easier to do than it was 8 years ago.

However, the fundamental challenge after moving your data mess from an existing on-premise cluster to a cloud based platform,  is that you still have a mess! Data lakes failed initially because of the lack of pre-planning. Instead of building their data lakes in accordance with very specific needs, organizations were turning them into data swamps by just dumping data into them. The same thing can happen to your data lake in the cloud if care isn’t taken to determine the function your new data lake is supposed to serve.  The net net? Think first about your use cases. Your data lake in the cloud should be built around very specific and predefined business needs and goals.

You also want to start small. Rather than move everything and provide structure later, move the data you know you’re going to use and then scale in a managed way with clear purpose. This let’s you establish actual value and ROI early.  You want to make sure your cloud provider supports all of your business operations and IT requirements. And don’t forget to talk to your cloud provider about their compliance capabilities and certifications to make sure your data governance policies are carried over.

Especially important is implementing an agile methodology and project management framework. Automation is key. Another reason initial data lakes failed was because organizations had to hand code much of the data pipeline and workflow processes on top of the lake itself.  The result was inflexible code that was rigid and brittle and had to be modified with the slightest change in upstream schemas or the introduction of a new file type into the environment. One Infoworks customer in the retail space recently told us that they spent more time fixing their data ingestion and transformation code than they did adding new use cases ( By the way, this is why they moved to Infoworks agile data engineering platform which is automating away all of their hand code.)

To avoid this problem, you will want to think about how you will automate the management processes that sit on top of the data lake: like data ingestion, data preparation, data lineage, creation of high speed queries ( because querying hdfs or hive directly from Tableau or similar tools will be really slow), and the orchestration of your data pipelines in production.  These are all processes you need to consider once you get your data lake running. Before you even get it fully operational, you have to migrate your data into the cloud with high speed replication facilities. In addition, you will want to migrate legacy transformation logic so you don’t have to rewrite it all for the pre-existing analytics processes you decide to move to the cloud.  

But be forewarned, these are not processes that the data lake vendors themselves actually automate.  So make sure to look for big data automation solutions that will accelerate the implementation of those processes without requiring knowledge of Hadoop and Spark to do the work.  Vendors, like Infoworks.io, along with our competitors, have solved this problem. The result is that the automation up and down the big data stack combined with cloud implementations, are making it much easier to successfully implement successful data lakes.

To learn more, check out the Eckerson Group white paper I noted above.  

 

 

 

About this Author
Todd Goldman