Most big data automation solutions focus on enabling data analysts and data scientists to more easily identify new insights in an ad hoc analytics environment. They do not however deliver similar capabilities for automating the running of BI and machine learning analytics in a reliable and repetitive fashion in a production environment.
The Infoworks agile data engineering platform provides security integration for user authentication and data security policies. It supports single-sign-on/LDAP integration, and Kerberos based authorization. It also supports encryption for data in motion and at rest.
A drag and drop interface makes it possible to build analytics and machine learning workflows that combine a set of subtasks like data ingestion, transformation pipelines, and OLAP cube generation as well as the ability to issue notifications that tasks have completed or failed. Infoworks Orchestrator also supports distributed, parallel execution of tasks.
Infoworks Orchestrator automatically retries and restarts failed workflow jobs when possible. Orchestrator also makes it possible to pause, resume and dynamically control production data workflows.
Infoworks supports the creation of users with different levels of user access, as well as domains, so administrators can control which users have access to specific data sets. Users within a domain can share data, pipelines, and workflows to enable team-based development of end-to-end data workflows and pipelines.
Infoworks provides audit logs that track who has created or changed data pipelines, cubes and workflows as well as tracking what changes were made and when.
Data lineage is automatically generated and tracked from the source all the way to the cubes and in-memory models where the data is consumed in reports or dashboards.
Data ingestion, transformation, cube generation and workflows built in the Infoworks designer can run in any execution environment supported by Infoworks without re-coding. Infoworks takes advantage of each environments’ native capabilities (e.g., Impala on Cloudera, Google Data Proc, etc) so data engineering processes are automatically tuned to run optimally.
Process and performance monitoring “watches” data jobs and tracks the time it takes to complete each subtask without requiring any configuration of the monitoring tasks.
Infoworks rapidly synchronizes data and metadata across different execution environments to support high availability (HA) and disaster recovery (DR) scenarios. It delvers very high performance data synchronization, in batch and incremental modes, that handles the petabyte scale data replication required to deliver HA and DR for modern data architectures.
Many organizations are not standardizing on a single data execution environment, but are using multiple environments that can span on premise (Cloudera, MapR, Hortonworks) to the cloud ( AWS, Azure, GCP) or include multiple cloud environments. As a result, Infoworks supports the capability to run end-to-end data pipelines across execution environments.