This Eckerson Group report recommends 10 vital steps to attain success in DataOps.READ MORE
In last weeks blog, we introduced the concept of Enterprise Data Operations and Orchestration (EDO2), provided a definition and an overview of what EDO2 is all about. As a reminder, the short definition that presented was:
Enterprise Data Operations and Orchestration (EDO2) refers to the systems and processes that enable businesses to organize and manage data from disparate sources and process the data for delivery to analytic applications.
This week we will discuss the benefits of EDO2 and dig more into the details of the capabilities of an EDO2 system.
An EDO2 system includes core capabilities that automate essential functions required to build and manage data pipelines from source to consumption independent of the underlying compute and data storage system. Because development of these pipelines using modern distributed computing technologies like Hadoop and Spark tends to be complex, EDO2 systems provide code-free development paradigms that reduce the complexity for developing and managing data pipelines.
The core capabilities of an EDO2 system are:
Self-Service End-to-End Data Pipeline Development: The core requirement of an EDO2 system is the ability to build data pipelines that connect to data sources of all types, ingest the data into a computing platform for processing, integrate and transform data, apply machine learning and AI algorithms, and then prepare the data for high-speed consumption by business intelligence and data visualization tools. Data pipelines must be able to run batch, incremental and streaming data flows. What distinguishes EDO2 software from typical data integration software –– is that it combines what has been historically separate data pipeline development components ( batch vs streaming for example ) into a single unified system for the user.
Pipeline development in modern EDO2 systems requires limited to no hand-coding, and uses automation to hide the underlying complexity of distributed computing engines ( e.g. Hadoop, Spark, etc.) used for processing the data from the developer. Modern EDO2 systems provide self-service graphical visualization to aid in the development of data pipelines by non-experts.
Data Pipeline Operationalization: Traditional data prep tools separate the development of the pipelines from the process of placing those pipelines into full production. In fact, the hardening of experimental pipelines as they move from dev to test to production is an exercise that is most often left for the user. EDO2 systems are intended to standardize these processes and provide automatic promotion of data pipelines from dev to test to production. EDO2 systems provide additional workflow management capabilities that take raw data pipelines, which contain only the basic logic for moving and transforming data, and extend them to be part of complete production quality workflows that include monitoring, error handling, the ability to restart failed jobs as well as fault notification.
Data Pipeline Orchestration: Orchestration is a complex technique that provides for coordination across multiple data pipelines that have been placed into operation. Orchestration allows for how multiple pipeline processes relate to one another and allows for coordination and management of multiple pipelines to avoid collisions and the potential for data flows to interfere with one another. In modern data environments that operate on-premises and across multiple cloud environments, EDO2 systems must be able to orchestrate data pipelines that flow across multiple environments.
Team Based Development: Data pipeline development tools typically are single user based. EDO2 systems require that users can share data sets, data pipelines and workflows with each other. The technical core of team development is a data catalog where all users, with proper access rights, can easily search for and find information about pre-existing data sources, data pipelines and workflows, and subsequently derive more value from their present investments.
Data and Process Governance: EDO2 systems, unlike legacy data governance approaches, do not provide a separate standalone data governance infrastructure. The philosophy of EDO2 is for governance to be implemented as part of the development, operationalization and orchestration process. As a result, capabilities like data lineage, should be automatically generated when the data pipelines are built. The system must control who has access to sensitive data sets, mask data appropriately, as well as control who has access to data pipelines built using those data sets. EDO2 systems must also track which users make changes to any aspect of a data pipeline, keeping an audit trail that can be reviewed.
Data Pipeline Portability: A critical differentiator of modern EDO2 when compared with legacy approaches is the ability to build logical data flows that can be run on a wide variety of compute and storage platforms and are not technically tied to a specific environment. For example, data pipelines that were initially run on Hive must be able to be refactored to run on Spark without recoding and with only minimal configuration changes.
Automation Depth and Context Sharing: Many people build out their data analytics pipelines by taking a best in class approach and automate each individual step in the process, one at a time. So they will use one automation tool for ingesting data, a second for building transformation pipelines, a third for creating high performance views and yet another for managing the operationalization of multiple data pipelines. They then stitch the components together themselves. The concept of EDO2 is to not only have depth of automation, but to also pass context of what is happening at each step in the process onto the next step automatically. So if something changes in the data ingestion process, the data transformation process is informed and can automatically adjust. This makes it possible to have data pipelines that are less brittle. It also means that a key characteristic of EDO2 systems, is that the individual components must communicate with one another.
EDO2 offers a wide variety of benefits, most of which come from both the depth of automation for each step in the data pipeline management process, as well as the sharing of context how data is manipulated across the data pipeline process from ingestion, to transformation to performance tuning to operationalization and “hardening” of the pipeline into production. Because EDO2 components can share context about the underlying data pipelines built with the system across each step in the process more easily than disparate systems that are stitched together after the fact, they make deployment of large numbers of ML, AI and BI data pipelines in support of data analytics use cases easier to manage on a daily basis.
In addition, EDO2 software:
The bottom line benefit of EDO2 is to allow all companies to harness data more effectively by building and managing hundreds to thousands of data analytics pipelines without requiring an army of data engineering experts.
Learn more about Infoworks’ Enterprise Data Operations and Orchestration System, Infoworks DataFoundry.