It’s 2019. Can We Drop the Term “Big Data” When We Mean Agile Data?

Written by Todd Goldman | Category: Big Data

Let’s face it, we have all been uncomfortable with the “big data” term for some time now. People in the data world are now avoiding the term because very often they don’t actually want to talk about large volumes of data or data moving at high velocity. What they want instead is to be able to implement data projects quickly and easily. As it turns out, some of the newer, non-relational data management technologies like Hadoop and Spark, which have become synonymous with the term “big data”, are also useful even when you don’t have volume, variety or velocity. Consider this real-life example:

A 130-year-old publishing company wanted to digitally transform its business. As one of the nation’s largest diversified media companies with billions in annual revenue and more than 360 businesses, this organization had accumulated a multitude of data silos that hindered the organization from gaining full visibility across its digital media brands. For years, the CTO wanted to get consolidated cross-company reporting of IT expenditures across 60 of their properties. In addition, they had been trying to retire both Essbase/Hyperion and DB2 databases running on the mainframe.

IT had given an estimate of 12 months to complete the first project and even longer to complete the database migrations using traditional relational data warehousing approaches. But using a combination of approaches that leveraged Spark, Hadoop, and NoSQL data storage, they completed the project in just a few weeks.

This was not a big data project in the purest sense. The volume of data was not particularly large nor was the data changing at a rapid pace. And while there were a lot of data sources, you wouldn’t say there was a high variety of data sources or types–at least not in the way we traditionally think of variety today. On the surface, this was a classic Enterprise Data Warehouse (EDW) project. There was nothing “big data” about it. However, by using what people consider to be “big data” approaches, the project was completed in much less time for much less cost.  

This is exactly why we need a different term to describe what companies want to achieve right now, namely completing more data projects, quickly and easily, in less time. As a result, we believe the concept of “agile data” is a much more appropriate term. Our industry has already adopted agile development methodologies for the desired benefits of efficiency, flexibility, and improved data agility. So why not use the same concept for data?

What Does Agile Data Mean?

First, allow me to define what I mean by Agile Data while avoiding the use of specific technical terms in the definition. Doing so would only set the term up for eventual disposal the same way big data was.

Agile Data is the ability to quickly and easily:

  • Add new data processing, machine learning, and data analytics use cases in support of rapidly evolving business models and initiatives, regardless of data volume, variety or velocity
  • Promote development data pipelines from development into production to potentially multiple different execution environments ( on-prem, cloud, Hadoop, Spark, etc.)
  • Handle changing upstream data source changes with minimal or no manual intervention
  • Iterate your data engineering, machine learning, and analytics pipelines
  • Manage the operational environment for large numbers of data processing and analytics use cases

Here is some more commentary and thinking on each of the bullets above:

Text Commentary
Agile Data is the ability to quickly and easily: The words “quickly and easily” are underlined here because the core of the value of “agility” is baked into those two words.
Add new data processing, machine learning and analytics use cases in support of rapidly evolving business models and initiatives, regardless of data volume, variety or velocity Agile data is only useful in the context of supporting some business objective, which is the reason this particular bullet is listed first. Additionally, the focus here isn’t on the volume, variety or velocity of data itself, but on the volume of use cases that an organization can execute. Note that “big data” use cases do need to be supported, but whether or not an analytic use case is  “big data” is of secondary importance.
Promote development data pipelines from development into production to potentially multiple different execution environments A growing number of organizations are implementing a combination of data infrastructures, both on-premises and in the cloud. In addition, the underlying technologies of these environments are continually evolving and changing. In order to support an agile data approach, organizations will need to be able to change the underlying technology they are using to process and analyze data as that technology changes. This means that the logical instruction set of how data is ingested, transformed, prepped and queried needs to be able to be exported to run on different execution environments without re-coding the business logic. That is the only way that organizations will be able to remain “agile” as data fabrics continue to evolve from Hadoop/MapReduce to Spark to serverless and on to technologies that have yet to be developed.
Handle changing upstream data source changes with minimal or no manual intervention One problem with legacy ETL and traditional EDW architectures is that when a new column gets added to an upstream data source, the data pipelines break and have to get re-coded. In order to be “agile”, data pipelines and analytics engines need to be able to automatically deal with at least some aspect of upstream changes, like adding or deleting a column, in order to remain agile.
Iterate your data engineering, machine learning and analytics pipelines Agile programming is inherently an iterative process. The same should be true for Agile Data. As business logic changes, it should be easy to evolve and iterate the data  pipelines that support that  logic. Perhaps this is a bit redundant with the prior two bullets, but it is included here for further clarification.
Manage the operational environment for large numbers of data processing and analytics use cases This last bullet is extremely important. Many organizations focus strictly on agility of development of data pipelines but don’t concern themselves with supporting those pipelines in production. Eventually, they find their developers are spending all of their time keeping their initial data pipelines in production and don’t have time to add new data pipelines. In order to scale the total number of data processing, machine learning and analytics use cases, agile operationalization of data pipelines, sometimes referred to as “DataOps”,  is as critical as the development process.

As we all settle into 2019, let’s use the beginning of the year to expand our vocabulary and start using the term “Agile Data” in place of “big data” as a broader, more inclusive and much more powerful term.

 

About this Author
Todd Goldman
Todd is the VP of Marketing and a silicon valley veteran with over 20+ years of experience in marketing and general management. Prior to Infoworks, Todd was the CMO of Waterline Data and COO at Bina Technologies (acquired by Roche Sequencing). Before Bina, Todd was Vice President and General Manager for Enterprise Data Integration at Informatica where he was responsible for their $200MM PowerCenter software product line. Todd has also held marketing and leadership roles at both start-ups and large organizations including Nlyte, Exeros (acquired by IBM), ScaleMP, Netscape/AOL and HP.
Want to learn more?
Watch 12 minute product demo