Data Lake Creation

High Tech Behemoth Taps Infoworks to Automate Big Data Ingestion Challenges

Posted by Todd Goldman

2019 is going to be a big year for big data for a couple of big reasons. Remember what big data and hadoop promised in 2010? It’s finally starting to materialize for many organizations in 2019. And remember all that magic we were supposed to get with Hadoop in 2011? Well that magic is being delivered now but in other ways.

Allow me to restate in the most basic of terms: everything big data was supposed to be seven or eight years ago, it’s going to be now. I’m about to be even bolder still: a lot of what’s making this possible is being driven here at Infoworks.

Consider one of our customers, a very well-known high tech $50 billion industry bellwether. The organization had the money and resources to invest in a data operation that would give it all the agility such an operation was supposed to deliver. The problem was… it didn’t. Even a company this large was having trouble getting the big data delivery man to show up. The analytics framework they tried to establish to deliver a consolidated view across all of their business lines for customer churn, support, marketing, and supply chain management—regardless of the product line—hadn’t come to fruition.

What went wrong?

The company had attempted to build a consolidated enterprise data warehouse (EDW) many times over many years.   However, the process of modeling the EDW and building ETL jobs could never keep up with the company’s fast growth and numerous acquisitions. As soon as they integrated a new business into the EDW, more data sources would pop up.

Hadoop was originally seen in 2011 as the answer to their EDW problems so the company built its first data lake into which teams began indiscriminately dumping data. It didn’t help that they had no governance policy, so different business users would find themselves requesting the same data from the same data source. Since they didn’t know what data was previously ingested, these users would go back to that source many times and reload the data. The DBAs for the source systems began to quickly complain that the data lake was draining data base computing resources needed to run the business.  It was a data whirlpool sucking the data, and their time, into it.

In  addition, because data ingestion was being done  by various data engineers who were all hand coding their ingestion processes, some ingestion jobs ran well, but  most did not. The efficiency and performance of the ingestion jobs were dependent on the skill and experience of the data engineers  and most of them were not very experienced… after all, it was still only 2011. Additionally, Hadoop was, and still is, a complex distributed system.  So hand-coding efficiently is quite difficult. It led to such a drain on operational systems that the data operation became a running joke within the company. It was all pain, no gain.  Almost immediately, they realized their lake was rendered useless.

Despite significant financial and  personnel resources, this early big data adopter had but one singular achievement: they were one of the very first companies to create a data swamp.

By 2012, the organization converted all its learnings into a plan. They created a team of 8 data engineers that would work to eliminate data silos and ensure all data was available for downstream analytics. With a focus on efficiency, they eliminated poorly-written code by getting rid of coding altogether, implementing a metadata driven ingestion system with proper governance in place to control access.

By 2015 and 24 engineering years  later, they had built a best practices infrastructure that provided performance optimizations for scaling data ingestion as well support for governance, security and compliance. The problem was that it was expensive to build  and the underlying Hadoop technology kept evolving. They wanted to move their framework into maintenance mode, so they could move their chief data architect and other data talent into higher value roles. However, these engineers were needed to update  a perpetually evolving technology stack that still required 4 experienced data engineers just to keep the system up and running.

By 2017, this high tech manufacturer decided that it was time to look for a commercial replacement for their in-house code.  After reviewing 30 vendors and bringing in 8 vendors for proofs of concept , Infoworks was selected as the data engineering and data ops  platform to replace the in-house data ingestion code. Today, Infoworks is used to crawl all of the company’s data sources, import the metadata, and make that information available and searchable via a data shopping cart. When a user selects a data set and they are authorized to access that data, the data is then ingested into the data lake, if it is not already been loaded.

After 18 months, the legacy pipelines had been moved over to run through Infoworks automated framework including ingestion of 5000 data objects with over 100TB of data. We also helped the company solve the problem of dealing with change data capture on the source systems and incrementally merging and synching data into the lake as well as using more efficient connectors to load data from Teradata and Oracle at much higher speeds than before. Meanwhile, Infoworks keeps the environment up to date with evolving technologies without requiring any manual code changes. This includes allowing them to move to the cloud with Infoworks managing the entire migration without having to recode. The company now only needs one junior data engineer to support the Infoworks platform. The rest of the data team that was managing the system is now working on other projects that are more directly affect their business.

Let’s not forget: this is a multi-billion dollar company. They can afford to make large investments in their data operation. They could afford bing an early adopter. They could afford their missteps. For most companies, making these kinds of early adopter mistakes is unacceptable. They can’t afford long data journeys. For them, slow and steady does not win the race. If not for Infoworks, building an enterprise-class solution simply isn’t possible.

Looking to make your big data dreams a reality? Learn more about Infoworks and our Autonomous Data Engine here.


About this Author
Todd Goldman
Todd is the VP of Marketing and a silicon valley veteran with over 20+ years of experience in marketing and general management. Prior to Infoworks, Todd was the CMO of Waterline Data and COO at Bina Technologies (acquired by Roche Sequencing). Before Bina, Todd was Vice President and General Manager for Enterprise Data Integration at Informatica where he was responsible for their $200MM PowerCenter software product line. Todd has also held marketing and leadership roles at both start-ups and large organizations including Nlyte, Exeros (acquired by IBM), ScaleMP, Netscape/AOL and HP.
Want to learn more?
Schedule a demo