Uber, Airbnb & Netflix are Driven by Big Data Analytics Using Spark & Hadoop

Written by Todd Goldman | Category: Big Data

There is quite a bit of chatter about companies wanting to digitally transform themselves.  Companies 100-years-old want to learn how to be driven by big data analytics using Spark or Hadoop just like Uber, Airbnb, and Netflix which were all born in the digital age.

In some ways, these businesses are data analytics companies that also provide a specialized service based on those analytics like analytics plus car transportation service, analytics plus room reservations, or analytics plus movie rentals.  More fundamentally, they are analytics plus a matching service connecting people to drivers, people to rooms or people to movies.

Once each of them mastered those basics, they then used their analytics to expand into other related businesses like food delivery in the case of Uber or using data about what films people watch to figure out what kind of films that should be created. I don’t know if people have noticed, but it seems like the rom-com so popular in the ’80s and ’90s disappeared until Netflix recently decided to bring it back.

That is no accident.  Netflix clearly used its knowledge of its audience’s love for movies like “When Harry Met Sally” to determine that if it created its own rom-coms, it could increase viewership and subscriptions.  This was clearly a great use of data to make even better business decisions.

All of these companies have published various blogs about how they scaled their big data analytics using Spark and Hadoop.  More importantly, in those same blogs, they discuss the additional investment they have put into solving various problems they have run into to make big data solutions actually work in production.  

Common Problems When Scaling Big Data Analytics

While these companies are dealing with scaling problems that most organizations have yet to face, the reality is that the problems they ran into are issues that most companies will very quickly face when they try to put big data analytics use cases into production.  For example:

  • Incremental data ingestion – Uber built a solution to deal with the very common problem on how to load incremental changes to data without having to reload the entire data set.  This is an extremely common issue if you are trying to use Hadoop for transactional data.
  • Workflow management – Airbnb built a system to avoid writing scripts and having scripts call other scripts.
  • Data Catalogs – Netflix built a system to make it easier to search for data existing in various data sources so they could more easily put that data to use.

The fact that these companies invested in projects to solve these very fundamental issues point to the reality that if you want to deploy a big data solution into a production environment, you are going to have to invest time, money and most importantly, people, into solving real problems that you will face.  And some of those problems are frankly items that most organizations thought were already solved.

It isn’t an issue when you are Uber, Airbnb or Netflix because one thing all three of these companies have in common is a wealth of big data engineers with amazing big data experience. So they have the luxury to be able to invest the time into automating what are pretty fundamental issues.  Most organizations do not have that same luxury.

Uber, Airbnb, Neflix Big Data Analytics Using Spark and HadoopThe good news is that in all of the cases I mention above, they have all open-sourced their projects.  The bad news is that for the most part, these projects were designed to automate a solution in a relatively specific manner that will work well for them but won’t work well out of the box for the general use case without some additional coding.  

That is because when you solve a problem for your own company, the approach you take is very different from building commercial software intended to be used by lots of different companies in lots of different environments. In addition, those solutions require yet more coding on top of them so the users are limited to other developers and data engineers.  

If your company is hoping to enable self-service analytics, more open-source code won’t get you to the finish line because your typical data analyst or business analyst won’t be able to use this code themselves. Ease of use is definitely not a criteria in these projects that were built by engineers, for engineers.  

That said, the broader point can’t be ignored. If you want to be like Uber, Airbnb or Netflix and implement big data analytics using Spark or Hadoop just like them, you should be aware that there is a lot of automation you will need to add on top of your underlying big data fabric if you want to make it work in a production environment.  


If you want to learn more about how Infoworks.io agile data engineering platform can automate your on premise or cloud data architecture and accelerate the creation of data pipelines in a code-free, production-ready environment, feel free to contact us.

About this Author
Todd Goldman
Todd is the VP of Marketing and a silicon valley veteran with over 20+ years of experience in marketing and general management. Prior to Infoworks, Todd was the CMO of Waterline Data and COO at Bina Technologies (acquired by Roche Sequencing). Before Bina, Todd was Vice President and General Manager for Enterprise Data Integration at Informatica where he was responsible for their $200MM PowerCenter software product line. Todd has also held marketing and leadership roles at both start-ups and large organizations including Nlyte, Exeros (acquired by IBM), ScaleMP, Netscape/AOL and HP.

Eckerson Report: Best Practices in DataOps

This Eckerson Group report recommends 10 vital steps to attain success in DataOps.