Last year, Julien Ereth wrote a great article titled, “DevOps in Business Intelligence and Analytics” where he wrote about the concept of applying DevOps to the data world with the idea of driving an agile data engineering process.
I liked Julien’s article, but after recently re-reading it, I think the concept should be taken even further and the term coined specifically for the data community: DataOps. This term was first brought up by Andy Palmer of Tamr, but it didn’t seem to catch on. Perhaps it was too ahead of its time.
Time, it seems, has finally caught up.
Before I get into the DataOps definition, let’s start with DevOps. For those of you who don’t know what DevOps is, part of the definition from Wikipedia reads as follows:
The main characteristic of the DevOps movement is to strongly advocate automation and monitoring at all steps of software construction, from integration, testing, releasing to deployment and infrastructure management. DevOps aims at shorter development cycles, increased deployment frequency, and more dependable releases, in close alignment with business objectives.
So, what is DataOps for agile data engineering?
I won’t refer to the Wikipedia definition of DataOps because, as of this writing, I don’t think it is as good as the DevOps definition. And, frankly, it should have better leveraged the DevOps concepts. So as a simple exercise, if I start by replacing the software development aspects of the definition above with “data,” we get a pretty good initial definition for DataOps:
The main characteristic of the DataOps is to strongly advocate automation and monitoring at all steps of data pipeline construction, from data integration, testing, releasing to deployment and infrastructure management. DataOps aims at shorter development cycles, increased deployment frequency, and more dependable releases of data pipelines, in close alignment with business objectives.
However, this definition is still missing some critical concepts that are unique to managing data. In the data world, we have to worry not just about the development of the pipelines, but also about the management of the data itself from the perspective of data quality, data security and data governance.
Since it is difficult to reduce the concept of DataOps into a single sentence, it’s better to understand these five key characteristics.
A critical capability of DataOps has to be moving from hand coding of data pipelines to automating as much of that process as possible. This should range from ingesting data into the data lake and transforming the data to implementing data pipelines for AI and ML and generating in memory models.
A significant part of the DataOps job is in properly promoting successful data experiments from the sandbox into full production and continuously integrated new pipelines into the production environment.
Automating this process is critical for DataOps because if this process isn’t automated, the organization is limited in terms of the number of data pipelines they can run. They will end up spending more time moving from dev to prod than they will in creating new value-added pipelines.
Just like DevOps, it isn’t enough just to get your data pipeline running. You have to keep it running by monitoring the ongoing operational flow of data, identifying problems and remediating issues before they escalate and affect business operations.
One area of DataOps that is not analogous to DevOps is around the concept of governance and security. Data management has for years had to deal with issues around the governance of data.
How do you manage the lifecycle of the data?
When should it be retired?
What are the definitions of the data?
Who should have access to data?
And so on. DataOps IS NOT responsible for the governance of the data, but it IS responsible for implementing the systems so the business can properly govern the data.
DataOps IS NOT responsible for defining data quality and what determines good quality or good enough quality data for different aspects of decision making within a business. It IS responsible for implementing the infrastructure and automation systems that will enforce data quality rules defined by the business.
Bear in mind, this blog post is a humble first cut at refining the DataOps definition. As we receive feedback, we will refine and grow our definition to include what I’m sure are some missing key concepts.
In the meantime, let’s hope that the world of DataOps tools and agile data engineering continue to grow as organizations become more interested in driving value from their data assets.