As the title of this post notes, there is no Gartner Magic Quadrant for DataOps. However, this is a topic and space of growing interest. In fact, last year, I wrote a blog about the five characteristics that define DataOps. Almost 12 months have passed since that blog post so it is time to revisit those characteristics and update them to take into account new factors like:
With that as background, the core definition of DataOps that appeared last year has not really changed:
The main characteristic of DataOps is to strongly advocate automation and monitoring at all steps of data pipeline construction, from data integration, testing, releasing to deployment and infrastructure management. DataOps aims at shorter development cycles, increased deployment frequency, and more dependable releases of data pipelines, in close alignment with business objectives.
And note that DataOps isn’t just technology, it is also a process for connecting business needs for data analytics with the rapid creation and ongoing management data pipelines in support of those needs. That said, there are technologies and capabilities you should look for in a DataOps platform that can support the creation of effective DataOps processes including the following concepts.
In last year’s blog post, we talked about the importance of moving from hand coding to automating that process as much as possible to deal with data ingestion, transformation, data quality and generation of in memory models as part of a true end-to-end data pipeline for business intelligence or machine learning. Automation of the development process has become even more critical in the last year as organizations are moving towards hybrid cloud/on-premise and multi-cloud environments. In addition, they are deploying a wide variety of distributed compute engines to run the pipelines including hadoop, spark and serverless. The implication is that the logic of a data pipeline will need to be portable from one cloud to another cloud and from one data engine and storage to another. In an ideal world, the logic of the data pipeline could be moved from one environment to another without recoding but where the logic is “recompiled” to optimally run in the new environment automatically.
This point is partially an extension of the portability of pipelines noted above. But the important addition here is the concept that data pipelines should be built in a way that they can be easily operationalized. This means you need the ability to take a raw data pipeline that just contains the basic logic for moving data, and extend it to make it part of more complete production quality workflows that add error handling to your data pipelines. The framework must also support the potential teams of people that can work on and move workflows from dev to test to production with different people taking responsibility for different aspects for the hardening of a data pipeline. This is necessary if you want to take what a data scientist creates, which is a pipeline that runs in a development sandbox and get it ready to run incessantly in an enterprise class production environment.
Just like DevOps, it isn’t enough just to get your data pipeline running. You have to keep it running by monitoring the ongoing operational flow of data, identifying problems and remediating issues before they escalate and affect business operations. This means you need to be able to start, stop and pause data pipelines, avoid pipeline “collisions” and alert users when there are issues.
Also note that while this is listed as a separate capability, it really is a capability that should be integrated at development time. Adding monitoring code after the fact is much harder than instrumenting your data pipelines to be monitored and managed during the development process. So this point ties back to the automation that happens in point 1 above when you first build your data pipelines.
Even in relatively small organizations, there are different people with different roles. The data scientists identify algorithms that can be used to automate decision making and use data for competitive advantage. The data engineers take those algorithms and build data pipelines that will run on a regular basis and the dataops engineers make sure that those pipelines run smoothly. Team based development demands that data engineering development platforms allow users to share data sets, data pipelines and workflows between each other.
For years data management solutions like ETL and MDM software have had to deal with issues around the governance of data. So these concepts are not new. Organizations have been dealing with issues around how you manage and govern data like:
As we noted last time, DataOps IS NOT responsible for the governance of the data, but it IS responsible for implementing the systems so the business can properly govern the data. As a result, capabilities like data lineage should be automatically generated when the pipelines are built. The team based development environment noted above has to be created in a manner that controls who has access to sensitive data sets as well as who has access to data pipelines built using those data sets. Finally, a well governed process must also track which data engineers make changes to any aspect of a data pipeline, keeping an audit trail that can be reviewed.
Once again, this all puts pressure on the development platform used for building these pipelines in the first place. But one thing we know for sure based on the past 20 years of experience in data management is that it is much easier to build governance into the data pipeline during the development process than it is to bolt it on after the fact.
Just like last years post on this subject, DataOps continues to evolve and we are also learning more about what it means to provide a DataOps platform. As we continue to receive feedback and successfully deploy more customers, we will refine and grow our definition to include what I’m sure are some missing key concepts.
In the meantime, let’s hope that the world of DataOps tools and agile data engineering continue to grow as organizations become more interested in accelerating the pace at which they drive value and competitive advantage from their data assets and data analytics use cases.
Learn how Infoworks.io data engineering and DataOps software products address the challenges discussed in this blog post.