It all starts by connecting rivers to your lake. Let's look at how this work.
Ingestion is one of our core feature areas. It is the process of efficiently loading an initial data set, then configuring how that set will remain current with its external source.
DataFoundry goes far beyond mere data loading. 1 It actively queries and crawls its way through data sources you target to identify sources to onboard. From this crawling process, 2 you choose how you want to build and maintain your data lake. 3 DataFoundry then ingests your chosen sources, centrally organizing all its loads along time axes as part of ensuring a robust data lineage. 4 In the field, we've seen customers ingest over one hundred thousand source tables in less than 2 months.
We're a bit of a Swiss Army Knife when it comes to ingestion. While the specifics vary by specific product version and your technical ecosystem, we support seven general ingestion categories. 1 From the RDBMS and SQL category we support [read list]. 2 We also ingest numerous flat file formats [read list]. 3 We support ingestion from external Hive clusters. 4 We support ingestion from MongoDB and MapR-DB. 5 From Cloud stores, we ingest from [read list]. 6 We ingest streams from [read list]. And, we can attach and ingest from REST APIs. If you need more, let us know.
A fundamental goal behind DataFoundry is to break down your silos by pouring your many streams into a common data lake. 1 We do so by first ingesting a historical load from a designated source. If the source is relatively small, or provides no way to support change data capture, we will truncate and load the whole source. If the source is large, or has operational access constraints, parallelized ingestion can be configured to run within time windows and connection limits. 2 Once historical data is loaded, a continuing connection is designed to ingest changing data at a rate you specify. We can automatically track and capture changing data by reference to log and journal entries, by querying timestamps, batch IDs, and similar metadata, by referencing stream metadata, and by using Oracle Golden Gate. Further, we also use queries to monitor schematic and structural changes in your targets. When detected, we propagate your changes into the data lake and backfill data as specified. 3 Our ingestion technology can also be used to configure ongoing incremental export to cloud targets managed outside DataFoundry.
To ensure the validity of ingested data, 1 we track and compare source data with data ingested to a data lake target. 2 Row counts will automatically be compared. You are also able to configure additional automatic reconciliation formulas to report, such as aggregations calculated and compared between your source and target. 3 The results of these calculations are reported for investigation.
As you know, when handling large data volumes, 1 the physical record distribution relative to query plans is significant to ongoing performance. 2 During ingestion and transformation, DataFoundry can help you manage data skews by partitioning your data appropriately. Physically segmenting it to support parallel query execution. For example, a population table could be partitioned by region, or even hierarchically partitioned by region and then state fields. 3 Alternately, or in addition to partitioning by field, DataFoundry can equally divide data sets into buckets, accessed virtually by hashed keys, to further support query speed by distributing physical data to 4 support maximum parallelization based on your expected query plans. If you need to change your partitioning and bucketing later, DataFoundry automates that for you, too.
Ultimately, data is no use unless your analysts can find it. 1 We provide you robust cataloging capability, letting you name, tag, and describe all your sources, targets, models, and even cubes, as relevant, and later perform fast lookups based on this metadata catalog.
To provide perspective on the ROI provided by automating your data engineering with Infoworks, 1 here are some metrics. 100 million data files that consumed one hour and forty minutes of ingestion time using point tools, were ingested in ten minutes. 2 A nine hundred twenty-five million record table which took fourteen hours to ingest using point tools, was ingested in six hours by a small Infoworks cluster, and in one hour using a right sized cluster. 3 Our continuous merge process can reduce hours spent merging data sets using point tools to minutes, while providing near real time data access throughout the merge process. 4 We provided a nearly 9x improvement in CPU impact relative to a string of point tools ingesting the same fifty million five hundred five column rows from Teradata. 5 The coding and maintenance labor to onboard new data sources is reduced from days or weeks to a few hours of configuration. 6 The coding and maintenance labor to buiild your own CDC, Merge, Time Axis, and History processes is also reduced from weeks to a few hours of configuration. 7 And the months you could spend coding and maintaining validation, type management, schema management, and system wide orchestration is just there for you, built right in.
So, what have you learned? 1 Infoworks actively crawls over 28 source formats, dramatically speeding up the data source onboarding process. 2 Once onboarded, ingestion involves a one time full or incremental historic load, configurable by time window and connections to reduce impat on operational systems, 3 followed by ongoing ingestion of your changing data and changing schema, 4 with incremental export to external cloud stores also available, if relevant. 5 Ingested data is configurably validated and reconciled with its source. 6 Your data is continuously available throughout ingestion, up to and including near real time access. 7 Data skews are managed by hierarchical partitioning and/or bucketing, aligned to your anticipated query plans, to maximize speed via parallelization. 8 Metadata cataloging allows your to search your artifacts by name, description, and/or a tagging scheme. 9 And, we have metrics available to demonstrated the dramatic ROI available by automating data ingestion processes.
Yes, you could always keep doing this all by hand. But come on back and we'll dive into transforming your ingested data.