After ingesting data into your on premise or cloud data lake, it needs to be transformed in preparation for downstream use. Quickly designing analytics, AI and machine learning data pipelines via a self-service GUI is only one aspect of data preparation. As part of an enterprise data operations and orchestration system, data pipelines need to be designed to be automatically be promoted into production at enterprise scale to run anywhere, in the cloud, on premise or in a hybrid environment.
Infoworks provides a drag and drop environment for developing analytics and machine learning data pipelines. No Hadoop or Spark knowledge is required to create scalable and highly tuned data pipelines.
Migrating legacy data warehouse jobs is significantly accelerated through automatic conversion of SQL into easily maintained, optimized, portable, visual data transformation pipelines.
Data pipelines built in Infoworks can be integrated immediately into end-to-end workflows and run in production environments, either on premise or in the cloud, without any code changes. Pipelines are built to be fully controlled, started, stopped or paused as part of a workflow and automatically scale with the size of the execution environment. Data pipelines are also automatically monitored for performance by the Infoworks orchestration layer.
Infoworks data pipelines provide portability across all major distributed data execution environments on premise and in the cloud. Infoworks abstracts the underlying execution engine and automatically optimizes for different backends (Hive, Spark, etc.). As a result, Infoworks’ visual data pipelines are both portable and high performing without requiring re-coding.
Classic ETL tools require users to separately code data pipelines for full and incremental data loads. Infoworks makes the process of adding incremental pipelines literally as simple as clicking a button. An incremental data pipeline is generated automatically behind the scenes.
Infoworks manages end-to-end data dependencies that start with data ingestion of data sources and go all the way to the generation of high performance data models. Once tables are loaded from a source, dependent pipelines will start building. Pipelines will automatically wait for dependent tables to be loaded and resume automatically as well.
Developing data pipelines is significantly simplified through Infoworks visual designer. However, data pipelines must also be tuned to perform, scale and subsequently meet service level agreements (SLAs). Infoworks automatically optimizes pipeline builds to meet SLAs without requiring a big data tuning expert. As a result, data pipelines created with Infoworks perform 25-40% faster than hand coded SQL or HQL.
Infoworks allows developers to include advanced analytics directly into their data pipelines. The platform is directly integrated with advanced analytics libraries for decision trees, clustering (k-means), classification, and more. Users can import trained models from other applications via PMML into the data transformation pipeline and integrate machine learning and analytics into a single pipeline.
Users can work interactively with the data, without needing to materialize intermediate steps for testing. They simply click into the intermediate step of the data pipeline and the Infoworks platform intelligently samples data so users can see the data changes that occur as pipelines are built. Automated data sampling allows data analysts and engineers to interactively work with data regardless of the size of the overall data environment.
Infoworks automates structure, syntactical and semantic validation of data pipelines. Interactive, immediate validation during development provides significantly faster development and testing times.
Infoworks DataFoundry significantly accelerates the creation of highly performing data pipelines. The table below provides real-life examples of the effort required using Infoworks when compared with actual efforts and third-party estimates of similar work using alternative approaches.