Data Lake Creation

Rumors of the Data Lake’s Death are Greatly Exaggerated

Posted by Todd Goldman

Isn’t it funny when good ideas start out as good ideas that turn into bad ideas that turn into good ideas again?

History is littered with products that flopped at first before improvements brought them back to life—particularly during the tech boom of the 1990s and 2000s. Napster failed, but out of the ashes came iTunes. Blackberry died so it could give life to the smartphone. Friendster bowed down to Facebook. Sometimes, as these products show, success can only exist on the back of initial failure.

The same thing is happening to the data lake. When Pentaho CTO James Dixon first coined the term ‘data lake’ while touting its benefits in 2010, organizations quickly fell in line. New sources of data were emerging all the time, passing new streams of potential information through businesses like sieves. Organizations needed a way to capture, store and analyze it. The data lake allowed them to do just that, keeping vast amounts of raw data in its native format in a flat architecture. It was cost effective, easily accessible, and flexible. Organizations were breaking down their information silos, providing a single repository for their entire data estate. It was astonishingly simple.

But that was the problem. The data lake worked as long as long as the organization’s needs were relatively simple. Once they began dumping all of their data into the lake, they realized the data lake wasn’t quite the panacea they had grown to expect from all the marketing hype. Experts began calling it the data graveyard and the data swamp. For a time, it was a great way for organizations to put off deciding how they were going to deal with their data. But then they realized there were too many kinds of data co-existing without any way to sort out the valuable information from the junk.

So, the data lake grew and grew, and businesses steadily lost track of whatever potential intelligence they might have had. The whole point of the data lake—to  enable big data analytics—it seemed, had been declared null and void. Data lakes weren’t helping organizations derive value from data. There was no organization. No structure. Just a dumping ground. Projects that were supposed to tap into those lakes stalled until eventually budgets were cut and projects were canceled. Data lakes were failing therefore big data was failing. To this day, over 80% the companies implementing big data projects never move into production.

But that’s changing. Data lakes are becoming data enablers again. Organizations have figured out how to properly manage their lakes. They’re transforming their sloppy swamps into governed systems that allow their data scientists, analysts and even everyday business users to easily mine and analyze the data they need to generate business value. They learned rather than collect everything and provide structure later, it’s better to collect less data in the beginning and scale in a managed way with clear purpose. They learned the data lake should be built around pre-defined business needs and goals. The data should be curated. It should be searchable. And, especially important, the processes for building and managing the data lake should be automated.

Infoworks has always been a big supporter and key enabler of the managed data lake. We automate the process of creating and operationalizing the lake so organizations can use it to fuel their big data strategies in days instead of months. But the important takeaways from all of this is that first, the concept of the data lake should just be one part of your overall data strategy. And second, while the idea of the data lake is supposed to be more flexible and agile than a data warehouse, you don’t have to trade flexibility for manageability. With all of the automation available to manage your big data infrastructure and your data pipelines, you actually can have both.  

The bottom line is that the data lake is making a comeback as we are seeing more organizations implement them as part of an overall strategy, especially when it comes to cloud based data architectures. If you want to learn more about how to get the most out of your data lake, check out our Solution Brief on Automated Data Lake Creation and Management. Just as the data lake is enjoying new life, so can your organization as a newly data-driven powerhouse.

About this Author
Todd Goldman