Data ingestion and synchronization into a big data environment is harder than most people think. Loading large volumes of data at high speed and managing the incremental ingestion and synchronization of data at scale into an on premise or cloud data lake or Databricks Delta Lake can present significant technical challenges. Plus data ingestion is just the first step of a complete Enterprise Data Operations and Orchestration system.
Our data ingestion tools provide a no-code environment for configuring the ingestion of data from a wide variety of data sources. Infoworks also uses native connectors when available to provide the highest possible speed of data ingestion.
Data types on relational sources map differently depending on the Hadoop or cloud data storage environment you select. Infoworks automatically handles data type conversions which reduces the errors typical in manual handling of data type conversion. In addition, this automation makes it easy to move data from the data lake or Databricks Delta Lake to other consuming systems without having to recode data types.
Infoworks’ automated process parallelizes the ingestion of data into your data lake or Databricks Delta Lake and significantly accelerates the loading of large tables with small ingestion windows, without requiring code development.
When new columns are added to source systems, data ingestion processes often break if they aren’t manually updated prior to the change. Infoworks automatically detects source side schema changes, adjusts for those changes and ingests the new columns automatically into the data lake or Databricks Delta Lake.
Infoworks automates log and query-based change data capture and also manages slowly changing dimensions (type I and II). Infoworks reconciles and merges incremental data at ingestion time with the base data that had previously been ingested. Our data ingestion tool’s continuous merge capability supports fast ingestion and continuous fresh data availability while keeping the data optimized for downstream query performance.
Infoworks supports both batch and streaming use cases. Configuration of a streaming data flow is done via a simple menu-based interface with no coding required. Infoworks uses Kafka as the underlying streaming engine and can connect to any data source to stream large amounts of data in real time.
As part of the synchronization and merge process, Infoworks tracks slowly changing dimensions (SCD) and automatically keeps a history table of prior state data, including the date of any changes as well as tracking any errors that might have occurred in the SCD process.
Infoworks automatically validates data ingested into the data lake for full and incremental loads coming through change data capture. For all data sources loaded, Infoworks provides:
Infoworks DataFoundry has been proven in production customer deployments to perform much better than alternatives. The examples in the table below illustrate the level of performance improvement Infoworks’ customers have obtained.