Data Preparation and Transformation

Take a Free Test Drive

After ingesting data into your on premise, cloud data lake, or Databricks Delta Lake, it needs to be transformed in preparation for downstream use. Quickly designing analytics, AI and machine learning data pipelines via a self-service GUI is only one aspect of data preparation. As part of an enterprise data operations and orchestration system, data pipelines need to be designed to be automatically be promoted into production at enterprise scale to run anywhere, in the cloud, on premise or in a hybrid environment.

Infoworks DataFoundry automates the development of production-ready analytics and machine learning data pipelines that are manageable, monitorable and scalable

No-Code Environment for Developing Data Pipelines

Infoworks provides a drag and drop environment for developing analytics and machine learning data pipelines. No Hadoop or Spark knowledge is required to create scalable and highly tuned data pipelines.

Automatic Conversion of SQL into Big Data Pipelines

Migrating legacy data warehouse jobs is significantly accelerated through automatic conversion of SQL into easily maintained, optimized, portable, visual data transformation pipelines.

Promotion of Data Pipelines from Dev to Production

Data pipelines built in Infoworks can be integrated immediately into end-to-end workflows and run in production environments, either on premise or in the cloud, without any code changes. Pipelines are built to be fully controlled, started, stopped or paused as part of a workflow and automatically scale with the size of the execution environment. Data pipelines are also automatically monitored for performance by the Infoworks orchestration layer.

Seamless Portability Across Execution Engines

Infoworks data pipelines provide portability across all major distributed data execution environments on premise and in the cloud. Infoworks abstracts the underlying execution engine and automatically optimizes for different backends (Databricks, Hive, Spark, etc.). As a result, Infoworks’ visual data pipelines are both portable and high performing without requiring re-coding.

Automatic Generation of Incremental Data Pipelines

Classic ETL tools require users to separately code data pipelines for full and incremental data loads. Infoworks makes the process of adding incremental pipelines literally as simple as clicking a button. An incremental data pipeline is generated automatically behind the scenes.

Dependency Management

Infoworks manages end-to-end data dependencies that start with data ingestion of data sources and go all the way to the generation of high performance data models. Once tables are loaded from a source, dependent pipelines will start building. Pipelines will automatically wait for dependent tables to be loaded and resume automatically as well.

Data Pipeline Optimization

Developing data pipelines is significantly simplified through Infoworks visual designer. However, data pipelines must also be tuned to perform, scale and subsequently meet service level agreements (SLAs). Infoworks automatically optimizes pipeline builds to meet SLAs without requiring a big data tuning expert. As a result, data pipelines created with Infoworks perform 25-40% faster than hand coded SQL or HQL.

Direct Integration of Machine Learning Algorithms

Infoworks allows developers to include advanced analytics directly into their data pipelines. The platform is directly integrated with advanced analytics libraries for decision trees, clustering (k-means), classification, and more. Users can import trained models from other applications via PMML into the data transformation pipeline and integrate machine learning and analytics into a single pipeline.

Interactive Data Transformation and Preparation

Users can work interactively with the data, without needing to materialize intermediate steps for testing. They simply click into the intermediate step of the data pipeline and the Infoworks platform intelligently samples data so users can see the data changes that occur as pipelines are built. Automated data sampling allows data analysts and engineers to interactively work with data regardless of the size of the overall data environment.

Validation of Transformation Logic

Infoworks automates structure, syntactical and semantic validation of data pipelines. Interactive, immediate validation during development provides significantly faster development and testing times.

Data Transformation Metrics

Infoworks DataFoundry significantly accelerates the creation of highly performing data pipelines. The table below provides real-life examples of the effort required using Infoworks when compared with actual efforts and third-party estimates of similar work using alternative approaches.

Ready to Unlock the Value of Your Data?

Take a Free Test Drive