Transformed data is valuable for more than just reporting and enterprise analytics. The insights available through artificial intelligence and machine learning are moving quickly into the enterprise.
From the perspective of a unified data platform, though, AI/ML is just another approach to transforming your data into a more useful form. In this case, trained data models.
There is a meta level of complexity involved here. The deeper complexity is 1 baked into the ML libraries, relied on by data analysts to derive valuable insights. But the broader complexity involves 2 the multiple complex roles and activities that must be governed and orchestrated for the potential of this approach to move out of the lab and into production. Business knowledge must be captured to 3 drive acquiring and ingesting data which 4 must be prepared and transformed so 5 that models can be trained and then 6 integrated and deployed to productive use by your apps, analytics, and reporting processes, all of which 7 require ongoing support 8 feeding back into your growing business understanding. Done right, it's a virtuous cycle. 9 But, point tools touch only parts of the whole.
It's true there is a whole zoo full of powerful point tools you could use, to 1 acquire, prepare, train, deploy, and support your AI/ML efforts. 2 Tooling to help you discover, structure, sequence, clean, and enrich data pulled from many sources ... in what often seems an endless cycle. But, 3 once data sets are available, they must be handed off to another tool set so that models can be trained. 4 Once trained, those models must be handed off again for testing with various apps and eventual deployment. 5 At which point, support for the deployed models is handed off yet again. 6 Which puts a lot of hands in this process, with fragile interactions, and no centralized, end to end data lineage, much less governance.
The source of complexity in machine learning solutions is deceptive. 1 The large majority of ML libraries themselves are highly efficient, well managed, and driven through well defined APIs. You drop them in and skilled data engineers put them to work. The complexity comes in the 95% of related plumbing code, built and rebuilt as technical ecosystems evolve. 2 Experimentation is easy in such environments, leading to complex code graveyards, and 3 pipeline jungles, over time. Testing becomes a major challenge, when each pipeline is built just a bit differently. 4 Further, the continuing lack of standardized abstractions – akin to the deeply established table, row, and query conventions of purely relational data models – leads to ad hoc design, and the resulting long term technical dependencies.
But, again, from the standpoint of a unified data platform, AI/ML is just another aspect of data transformation, leading to another form of target. 1 DataFoundry provide a centralized way to ingest virtually any data source, and transform it to any form, for modeling, all with end to end governance, including tagging and cataloging by user domains. 2 To support data modeling, machine learning libraries can be dragged, dropped, and configured as part of a transformation pipeline. SparkML is built in, H20 and Tensorflow are available, and SDK support is available for integrating additions libs. Normalization is also built in providing, multiple features. 3 Training data splits, cross validation, and error analysis are all automated. And we provide both feature decomposition and transformation support. 4 Trained models expressed in either PMML or Spark formats can be imported or exported. And, it is as easy to migrate models from Dev and QA to Production as any other transformation we support. 5 More broadly, all these processes can be orchestrated and scheduled directly in or alongside any other workflow. Proven workflows and pipelines can easily be iterated and derived either by export or re-deployment.
To look at this a bit differently, compared to model management point tools, 1 we have ingestion, they don't, 2 we have data synchronization, 3 production grade transformations, 4 data modeling built inline with automated transformation pipelines, for 5 zero code data engineering. 6 Like them, we support model evaluation, deployment and model management, though 7 it's true they currently provide a bit more automation in this area. 8 Yet we also provide production orchestration and 9 automated resource management. 10 It's a bit unclear though how point tools will scale as demand accelerates, while we scale across most any on premise, hybrid, or cloud cluster you may need.
So, what have you learned? 1 AI/ML processes are complex, not just for the code and analysis involved, but for the orchestration needed among several distinct complex processes. 2 Point tools touch only parts of the process, leaving 3 hand off gaps, fragile integrations, and a lack of end to end data lineage and governance, limiting the potential of this powerful new analytical resource. 4 Infoworks provides the orchestration and governance needed to effectively manage AI/ML processes inline, and directly in pipeline, with your broader analytical and reporting requirements. 5 We also provide feature parity and far beyond, by comparison to data modeling point tools and the process chains they enable.
AI/ML holds great potential, but it's still a bit of a wild frontier out there. A unified data platform can help you bring things under governance.