This Eckerson Group report recommends 10 vital steps to attain success in DataOps.READ MORE
It is 20+ years since ETL tools came out and we are still revisiting the same argument but in a newer context. For big data, either on premise or in the cloud, should I hand code or should I use agile data engineering automation and visual data integration programming tools and platforms to build my data pipelines?
First, I would note that those who never learn from history are doomed to repeat it. Which means that graphical development platforms will deliver completed projects with higher performance in less time than hand coding. The issue has been, that up until recently, there were no big data automation tools available to replace hand-coding data integration. So the only choice in the big data space up until about 2-3 years ago was to hand code. Today’s reality is that there is plenty of good automation infrastructure to reduce the amount of hand coding required.
Now, I know there are some “bro-grammers” out there who think that they can out code modern automated big data integration environments. So for those of you who think your hand coding will be faster and better than a point and click data integration environment, my first message to you is…it’s time to get over yourself. You’re not that good.
For those of you who are more humble, I apologize for the digression but I must admit my frustration about having to repeat the same argument that was already proven to be true in a highly similar and relevant market.
Big data is complicated and the Hadoop and cloud platform vendors leave a lot of work for the user. It is one thing to hand-code an ad hoc analytical trial. But even then, I would argue there are some amazing tools that look and behave like super smart spreadsheets with a Hadoop or Spark back end with ML algorithms that are much faster than hand-code for ad hoc analytics.
What I know for sure, however, is that to write enterprise-class production quality code, that is tested, has error handling and high availability built-in, automated development, and operational platforms are orders of magnitude faster. This is because they take advantage of the knowledge of people who have successfully implemented scalable big data projects and they’ve put their knowledge into the platforms.
One real example is an Infoworks customer who spent 1 month building a data ingest engine to load Terabytes data into their data lake. That same task was completed in 4 hours using an agile data engineering platform which automated away the coding effort and replaced it with drag and drop simplicity.
For those of you who think this is a one-off example and that you have a real big data project that you want to complete, give me a chance and I will prove it to you.
Big data environments are very scalable. BUT, once again, the developer has to know things like, how many mappers to run, how to handle change data capture so you don’t have to load all the data every time and how to handle the syncing and merging of data into the big data cluster when you do take advantage of CDC.
Coding all of the above takes time and significant expertise. Heck, the expertise itself requires time to develop. In an automated agile big data platform, loading data at scale and handling CDC is literally drag and drop because the expertise of that super knowledgeable data engineer who has years of experience in implementing scalable big data projects has been coded into the platform.
In the same example mentioned above, the 1 month of hand-coding resulted in a data load time of 18 hours. Using an agile data engineering automation platform, that load time was reduced down to 4 hours. Why? Because the expertise is built into the ML algorithms of the platform and it knows how to tune the performance.
This point was true for ETL and it is true for the new generation of big data solutions. Data integration has some additional challenges that application coding does not. In many industries, there are requirements to know where data came from and where it is going to. Also, if you make a change, you need to know the downstream effects of that change.
Trying to figure out the implications of a code modification by looking at someone else’s handwritten code is incredibly difficult. But tracing those changes through a visual paradigm that shows you exactly where that data is used is several orders of magnitude easier. The result is that automated visual environments make it trivial to trace the impact of changes, make it easier to trace the impact of errors, and because they track who made “code” changes, they also make it easy to audit change history.
The bottom line is that the visual development of data integration projects is much easier to support than hand-coding. This was true for ETL tools, and it is still true today.
Organizations only have a few superstar big data programmers and they just don’t have enough of them to handle the number of data projects most organizations want to complete. The solution is self-service, but you can’t put Hadoop in the hands of most data analysts. Even if they have the raw intelligence, most of them are just not interested in learning how to make a big data platform scale for them.
They are interested in using the data to answer business questions. This means you can’t give them a raw platform that requires programming. To scale, you have to enable managed self-service that allows a data analyst without big data engineering skills to leverage the power of big data without having to be an expert.
Otherwise, IT becomes a bottleneck and the list of projects that don’t even get started begins to pile up. Hand coding doesn’t help here, but simple point and click big data integration tools can.
While data is important to your business, it is a tool that you use to run your business, but it isn’t your business. A lot of data engineers don’t want to use automated graphical tools because it isn’t fun for them. But if you are in retail, or manufacturing or oil and gas, etc., data isn’t your business, it is a support function to your business.
The problem is that once you go down the hand-coding route, you now have to not only write the code but as the code base grows, a larger percentage of your IT organization gets allocated to just fixing bugs. That means there are fewer people available to implement new data analytics projects.
In the end, IT ends up becoming the bottleneck. One major computer manufacturer, I know of, built their own data ingestion engine for Hadoop years ago. It took 5 people to build and maintain this environment on top of Hadoop. Keep in mind, they had tens of thousands of data ingestion pipelines, so this was a complex environment. At the time, they had no choice but to hand-code their own tooling because there were no automation tools available in the commercial market. However, once the market caught up with them and they could move from their own code to a commercially available tool, they were able to free up 4 of their 5 engineers to work on projects that were more core to their business.
In their case, they were originally ahead of the market for automation, but once the market caught up to them, they embraced new tools and platforms because they were able to refocus their energy on working on projects that provide them a competitive advantage.
So what is the argument that people, who want to hand-code, make when there is an automated alternative? They always resort to the same argument: “Hand-coding gives more control” ( = Bogus argument).
Unfortunately for this argument, most automated environments provide integration points that let you call out to hand-coded algorithms at any point in your process. So 90%+ of your development could be accelerated and you could hand-code the 10% or less that isn’t directly supported by an agile data integration platform and gain all of the advantages I noted above.
For those of you who read this blog and think, of course, I should use modern data integration tooling that runs on a big data environment, hides complexity and makes my team more productive, I encourage you to explore the plethora of big data automation vendors out their, including Infoworks.io.
For those of you who read this and still think that you are John Henry and can beat the automated pile driving machine, I offer you this challenge:
If you have a real big data project and you are in the process of hand-coding it, give us a fair chance to prove that an automated data integration platform is faster and better than hand-coding. I promise you won’t regret it.