For years, the big data world has been talking about how Hadoop and cloud would provide cost and flexibility advantages that would eliminate data warehouse appliances like Netezza and Teradata. For those companies who have made the move, the results are amazing. Analytics that used to take 30 hours covering a month’s worth of historical data now take 30 minutes analyzing an entire decade’s worth of historical data. Unfortunately, very few companies have fully made the move to modern data architectures. In fact, according to Gartner, over 85% of big data projects fail to deploy to production.
The reason this mass migration never occurred is that modern distributed data platforms, whether on premise or in the cloud, are still too hard to use & deploy, and the expertise is expensive and extremely hard to find. No wonder so many organizations have never made the jump. Now thanks to the fact that IBM has announced the Netezza end of life, there are lots of organizations who are forced to think about a Netezza migration. In addition, Teradata offload continues to be top of mind for cost-conscious CIOs.
The answer is very simple. The Netezza product was old technology whose time had come to be sunset so they are replacing it with newer technology. This is normal for any technology and no one should be shocked. Most people aren’t still using flip phones and migrated to smartphones. This is more or less the same thing. As a result, IBM is sunsetting Netezza Twinfin and Striper models and are replacing them with a new system called Sailfish.
IBM is promising free migration to Sailfish, but it is clear that Netezza customers are using this forced migration and an opportunity to evaluate other kinds of technologies like Hadoop, cloud data warehouses or other big data environments. Why is it clear? Because we are getting lots of requests to use the Infoworks.io software to help with these kinds of data warehouse migrations.
So given the current market reality, it is time to revisit this topic. First of all, migrating from your legacy EDW to a modern architecture isn’t as simple as just copying over the data. You have to also rewrite the workflow logic to run in the new environment which means you need an expert who knows both the old and the new technology. You then need another expert who knows how to import data in bulk and then make sure that the data stays in synch while you complete the transition from the old system to the new system. Yet another expert is required to convert and optimize your old transformation logic to the new system. Then finally, all of this work has to be written in a manner so it is easily manageable while running on top of a distributed infrastructure.
Fortunately, this process can be automated and the risk and the time involved can be significantly reduced. In one real-world example, an Infoworks customer migrated a 300 table data warehouse from Teradata to a Hadoop cluster, including all of the data transformation and workflow logic in less than a month. A system integrator who bid on the project offered up a proposal to complete the project in 9 months. The solution to enable this 9x time savings is illustrated below.
First, you have to rewrite the high-level workflow logic. This is the only somewhat manual step in the process but this is a pretty small effort because most of the overall logic that has to be migrated is the transformation logic and that step is automated (see below).
For this step, all that happens is business logic contained in Teradata BTEQ scripts, for example, are recreated using a drag and drop environment. A real case study is shown at the end of the blog showing how much time this takes in a real example
|Legacy Teradata BTEQ ( in the red box)||Graphical Representation of BTEQ in Infoworks|
The result is a nice graphical representation of a data process flow which is much easier to maintain than legacy code. Now that this step is done, the automation really kicks in!
The next step is to sync the data and metadata from the legacy data warehouse into the new environment. This is easier than it sounds, especially for large EDWs. First, you have to extract the data from the legacy environment at high speed. This means that in the Teradata instance, for example, you don’t want to use JDBC or ODBC, you need to use TPT (Teradata Parallel Transport) and then you need to be able to load your Hadoop or cloud data warehouse at high speed as well.
It doesn’t end there, because once you have completed the initial bulk load of data, you have to continue to merge and sync new data from the legacy environment into the new platform until you completely cutover. If you are loading the data into Hadoop, this is particularly problematic if you aren’t familiar with all of the issues of data ingestion into a Hadoop environment like data type discovery and conversion as well as handling issues around change data capture. Fortunately, Infoworks is a data ingestion platform and automates away any issues as well.
Infoworks deals with this by reading the SQL or Teradata BTEQ, stripping out the SQL from the BTEQ, and then generating data pipelines from the legacy SQL. Infoworks handles the conversion of syntax and semantic differences and also handles functionality that the EDW SQL supports, but is not available on the big data or cloud platforms. The result is a nice graphical representation of a data flow which is not only automatically generated, but much easier to maintain.
|Complex Teradata SQL (could also be Netezza SQL)||Data Pipeline Automatically Generated from SQL|
In addition, data lineage is also automatically tracked in the new environment. In this case, Infoworks converted the transformation logic as part of the first step. But it also then optimizes the performance of running that logic on the new platform. This step is critical to the transformation logic that may not behave the same way on the new platform as on the legacy platform. As a result, Infoworks automatically tunes the transformation pipeline performance optimizing it for the user. In one recent example, Infoworks was able to recreate transformation logic that had been manually tuned over a period of months on a legacy environment to optimally run in 4 hours. After one day of set up and ingestion of the transformation logic, the Infoworks system running on a Hadoop cluster auto-tuned the performance to run in 10 minutes. This example reflects the power of a distributed big data system when combined with the embedded knowledge the Infoworks system has to tune the performance of the transformation logic to run with optimal performance.
The final step is to manage workloads in production. Legacy EDW platforms like Teradata and Netezza do a great job of providing tools and interfaces to manage production processes. Unfortunately, newer environments like Hadoop and modern cloud-based infrastructures like Azure, AWS or GCP leave operational management as an exercise for the user. Note that this is one of the reasons why many organizations have never made the leap to modern architectures. While there are lots of tools for generating code for these new environments, that only addresses data pipeline development, they don’t address ongoing management of those pipelines and if you are trying to run analytics on a daily, hourly, minute by minute or second by second basis, you need to have a bullet-proof operational environment.
Infoworks addresses this issue by taking the visual code that was generated in the earlier steps and managing the orchestration of running these pipelines in production. It handles starting, pausing and restarting of jobs. It detects when jobs fail and automatically restarts them and also notifies users of any failed jobs. In a nutshell, it isn’t just a code generator, but a full platform for both the creation AND operationalization of data engineering pipelines. This second point is critical and is worth repeating because when you start running 100s or 1000s of data pipelines, you will spend more time managing the operational environment than creating new data pipelines IF you have not automated the operational aspects of your DW migration.
In real-world examples, migration from legacy data warehouses to modern architectures can be done at much lower cost and risk than ever before. In the example below, an actual project that was expected to take 10 months based on the estimates of several systems integrators took only 26 days, a 9x savings in time and effort.
|Amount of data/metadata||
5 BTEQ Files
Promote to Production
|Traditional Process||2 Months||1.5 Months||4.5 Months||1mth|
Infoworks Automated Process
|7 Days||6.5 Days||11.5 Days||1 Days|
The bottom line is that Netezza Migration, Teradata Migration and enterprise data warehouse migration now is possible and makes more sense than ever before.