DataFoundry for Databricks

AI and Machine Learning

Written by Amar Arsikere - February 24, 2020 | Category: AI and Machine Learning

DataFoundry for Databricks

Use Infoworks DataFoundry to Rapidly Onboard Data Sources Into Databricks

Data onboarding is the critical first step in operationalizing your data lake. DataFoundry automates data ingestion as well as the key functionality that must accompany ingestion to establish a complete foundation for analytics. Data onboarding with DataFoundry automates:

  1. Data Ingestion — From all enterprise and external data sources
  2. Data Synchronization — CDC (Change Data Capture) to keep data synchronized with the source
  3. Data Governance — Cataloging, data lineage, metadata management, audit and history

Data Ingestion

Infoworks DataFoundry automatically crawls data sources, ranging from relational databases and data warehouses such as Oracle, Teradata, SQLServer & others to flat files, XML, and JSON. Learns the metadata and infers data relationships for the ingested data from external data sources as well as for data sets created using Infoworks, making metadata searchable via a metadata repository.

Infoworks DataFoundry ingests source data in a high-performance, parallel process, while automatically creating type mapping to preserve source data precision. DataFoundry provides a no-code environment for configuring the ingestion of data into your delta lake via batch, change data capture and data streaming.

Data Synchronization

Infoworks DataFoundry continuously synchronizes source data from enterprise databases, data warehouses, and file sources. Changing data is captured from the source systems using log-based and query-based methods. The changed data is merged with the base data in a high-performance continuous merge process.

  • Automatically handles slow-changing data and schema changes and creates current and historical tables
  • Supports incremental export functionality to other consumption data warehouse systems such as BigQuery, SQL Data warehouse and others

Governance and Lineage

Infoworks DataFoundry creates and synchronizes a Data Catalog that can be tagged and searched using the UI. It tracks end-to-end data lineage so users can trace data elements back to the original source systems and perform downstream impact analysis. It also provides audit logs that track who has created or changed raw data and semantic data. It also provides the ability to track changes to data pipelines and workflows that operate on the raw data.

DataFoundry supports the creation of users with different levels of user access as well as domains, so administrators can control which users have access to specific data sets. Users within a domain can share data, pipelines, and workflows

Ingestion Automation built into DataFoundry enables the fastest way to onboard a data source

Infoworks automates many of the data operations tasks required to onboard a data source. This provides an order of magnitude faster approach to onboard a data source compared to other approaches. Some of the data operations that are automatically handled by Infoworks are:

  • Schema Discovery and schema change handling
  • Data type discovery & mapping
  • Duplicate record handling
  • High-speed, parallelized data ingestion
  • Change Data Capture (including log-based)
  • Data Validation and reconciliation
  • Time Axis/SCD2 at record level
  • Native paths to data sources (e.g TPT for Teradata) for efficient data ingestion
  • Connectors for databases, data warehouses, NoSQL, SaaS apps, APIs, files (CSV, JSON etc) and more…

Benefits of DataFoundry native integration with Databricks
Infoworks DataFoundry natively integrates with Databricks. There are a number of benefits of a native DataFoundry Databricks integration.

  • All ingestion processes are run using Databricks Runtime processing and not JDBC. This is a much more efficient approach to ingestion and CDC than using JDBC.
  • Automatically handles duplicate records in Delta merge
  • Automated deployment of auto-scaled clusters tailored for individual jobs and data sizes
  • Data is natively stored in Delta Tables
  • Augments Delta Merge by automatically maintaining record versioning, for any time period
  • All Delta tables created by Infoworks data onboarding are registered in the metastore, allowing easy SQL access via notebooks
  • Auto-optimizes Delta Tables to Overcome fragmentation due to multiple small files
  • Onboarded data is automatically cataloged

See how you can rapidly onboard a data source using Infoworks DataFoundry.

About this Author
Amar Arsikere
Prior to co-founding Infoworks, Amar built the first data warehousing platform on Bigtable at Google, then built one of the world’s largest in-memory database infrastructures at Zynga. To deal with the complexity of these environments, Amar created an automation layer to simplify the entire data workflow. Amar started Infoworks to create a commercially available product based on his experiences. Infoworks presents the third time he has eliminated the complexity of big data.