Automation

Extending Agile ( and DataOps ) to Big Data

Posted by Todd Goldman

“I am going to describe my personal views about managing large software developments.” Winston Royce kicked off his article, “Managing the Development of Large Software Systems,” with this sentence in a paper published nearly 50 years ago. He is widely credited with being the first to describe the Agile method for software development. 30 years later, the term “Agile” was officially used to describe the iterative method of engineering by a group of software developers in their “Agile Manifesto”. What they came up with frames the Agile methodologies that now help software organizations deliver higher quality products, faster response to change, reduced risk and faster ROI.

This then gave rise to DevOps, an extension of the Agile method that’s focused on improving communication and collaboration between development (the ‘dev’ in DevOps) and IT operations (‘ops’), and which aims to accelerate the systems development and deployment lifecycle. A core component of DevOps, automation, is also a key enabler of DataOps, an even newer process-oriented methodology that shares many tenets of DevOps. While DevOps brings application development and IT together, DataOps unites the data analytic and data teams with IT Operations for improved and faster time to deployment of data analytics use cases, workflows and pipelines.

DataOps affects the entire data lifecycle of the Agile data engineering processes used in the development of analytics use cases, with the addition of ongoing scaling, monitoring and verification of those data pipelines and workflows, after they have been put into operation. The idea is that as you’re developing new data analytics pipelines and workflows, they are built with operational concepts like monitoring, lineage, and governance built in. This way, when you take a data pipeline built using agile data engineering methodology, they are also ready to be put into production and scaled for broader use.

Note that it is possible to be Agile without also being easily operationalizable. This is what happens when you implement Agile without automating the DataOps process along with it. The consequence of this approach is that as you can crank out new data analytics use cases due to your development agility, over time you find more of your development efforts go to “break-fixing” existing data analytics pipelines, instead of building new ones.

This is where combining Agile with DataOps comes in. Bringing Agile and DataOps automation to big data means eliminating unnecessary time, expense and complexity. Data teams can add new use cases in days instead of months, allowing them to move relevant data to the front lines now, not later. Data-driven organizations can innovate and compete more efficiently and effectively. This is why over the past couple of years you’ve heard talk about lending the predictive power of Agile to big data, but up until now, it’s been just that: talk.
So why hasn’t Agile been applied to data before? It has… only not very well. The difference is that in the past few years we have seen automation, and the ability to leverage machine learning, evolve to the point where it can truly enable Agile and DataOps for big data. Up until recently, setting up the movement and transformation of data still involved manual steps that interrupted what little automation did exist and made the process time-consuming, rigid, and brittle.

For example, updating your data warehouse or relational data sources tends to be cumbersome and time consuming, requiring up front planning to ensure good performance. Simply adding a column to the data warehouse often results in long discussions and debates about where to put that column in the schema to ensure good performance. Meanwhile, the assumptions surrounding the project’s original value could change. Decisions in this environment are made in a way that doesn’t allow ad hoc or iterative experiments to see what makes the most sense. As a result, your data operation remains a large ship in competitive markets that require a speedboat to survive.

This is one of the big advantages to some of the newer architectures, which do allow for faster experimentation and iteration. In addition, most of the steps involved with building data pipelines in support of the of analytics are now being automated from end to end. Through automation, organizations get the enablement of not just self-service but fully governed self-service. That’s because the governance is now being built into the data pipelines at development versus being bolted on after the fact. IT still maintains control and can ensure regulatory compliance. But with automation, users can still have self service and the agility they need to move data throughout the organization to wherever it’s needed, and it’s done in an automatically governed way.

Now, organizations that are adopting today’s automation capabilities aren’t just building data pipelines faster and they’re aren’t just doing it with governance and operationalization in place. They’re doing it with the agility to add more analytics use cases in a way that also scales the number of use cases and the ability to change their data pipelines quickly when confronted with new insights or needs that dictate change.

DataOps and Agile Data Engineering, what took you so long? Never-mind. We’re just happy to see you’ve finally arrived.

About this Author
Todd Goldman