Automation

So You Want to do Big Data Analytics like Uber, Airbnb and Netflix

Posted by Todd Goldman

There is quite a bit of chatter about companies wanting to digitally transform themselves.  100 year old companies want to learn how to be data analytics driven businesses like Uber, Airbnb and Netflix, which were all born in the digital age.  In some ways, those businesses are data analytics companies that also provide a specialized service based on those analytics like analytics plus car transportation service, analytics plus room reservations, or analytics plus movie rentals.  More fundamentally, they are analytics plus a matching service that matches people to drivers, people to rooms or people to movies.

And then once each of those businesses mastered those basics, they then used their analytics to expand into other related businesses like food delivery in the case of Uber or using data about what films people watch to figure out what kind of films that should be created.  I don’t know if people have noticed, but it seems like the rom com so popular in the 80’s and 90’s disappeared until Netflix recently decided to bring it  back.  That is no accident.  Netflix clearly used its knowledge of its  audience’s love for movies like “When Harry Met Sally”  to determine that if it created its own rom coms, it could increase viewership and subscriptions.  This was clearly a great use of data to make even better business decisions.

All of these companies have published various blogs about how they used Hadoop and Spark to scale their data analytics.  More importantly, in those same blogs, they discuss the additional investment they have put into solving various problems they have run into to make big data solutions actually work in production.  And while these companies are dealing with scaling problems that most organizations have yet to face, the reality is that the problems they ran into are issues that most companies will very quickly face when they try to put big data into real production.  For example:

  • Incremental data ingestion – Uber built a solution to deal with the very common problem on how to load incremental changes to data without having to reload the entire data set.  This is an extremely common issue if you are trying to use Hadoop for transactional data
  • Workflow management – Airbnb built a system to avoid writing scripts and having scripts call other scripts
  • Data Catalogs – Netflix built a system to make it easier to search for data existing in various data sources so they could  more easily put that data to use.

The fact that these companies invested in projects to solve these very fundamental issues, points to the reality that if you want to deploy a big data solution into a production environment, you are going to have to invest time, money and most importantly, people, into solving real problems that you will face.  And some of those problems are frankly items that most organizations thought were already solved. But that isn’t an issue when you are Uber, Airbnb or Netflix because one thing all three of these companies have in common is a wealth of big data engineers with amazing big data experience.   So they have the luxury to be able to invest the time into automating what are pretty fundamental issues.  Most organizations do not have that same luxury.

The good news is that in all of the cases I mention above, they have all open sourced their projects.  The bad news is that for the most part, these projects were designed to automate a solution in a relatively specific manner that will work well for them, but won’t work well out of the box for the general use case without some additional coding.  That is because when you solve a problem for your own company, the approach you take is very different from building commercial software intended to be used by lots of different companies in lots of different environments. In addition, those solutions require yet more coding on top of them so the users are limited to other developers and data engineers.  So if your company is hoping to enable self-service analytics, more open source code won’t get you to the finish line because your typical data analyst or business analyst won’t be able to use this code themselves. Ease of use is definitely not a criteria in these projects that were built by engineers, for engineers.  

That said, the broader point can’t be ignored.  If you want to be like Uber, Airbnb or Netfix and do analytics just like them, you should be aware that there is a lot of automation you will need to add on top of your underlying big data fabric if you want to make it work in a production environment.  

 


 

If you want to learn more about how Infoworks.io agile data engineering platform can automate your on premise or cloud data architecture and accelerate the creation of data pipelines in a code-free, production ready environment, feel free to contact us.

 

About this Author
Todd Goldman