With 2018 coming to a close, it’s time to engage in that long standing New Year’s tradition of putting on our sorcerer hats and predicting what’s ahead for big data in the coming months. After a bit of a shakeup in the market with Cloudera and Hortonworks merging and IBM exiting the Hadoop market, big data is beginning to prove there’s still great promise as it starts to break out of the “for developers only” mold in a big way.
People have been talking about digital transformation for years without ever really knowing what it meant. Very often digital transformation was used to describe finding ways to sell the data that was being generated to create new revenue streams. These were generic ideas not specific to any particular business, which is why they went nowhere for most organizations. What they are now coming to realize is that digital transformation is really about taking a data driven approach to every aspect of their business in an effort to create competitive advantage. If you’re a retailer, it might be about providing real time “next best offer” program offers while customers are in your physical stores or getting more out of your inventory to provide a better online and in store experience. If you are an oil exploration and production company, it is about using data to perform well head drilling adjustments hourly instead of daily to maximize the yield from an oil field. These are discussions that cut to the core to even very traditional businesses, which are now beginning to correctly identify digital transformation as a means of investing in an agile data platform that reflects the state of the business and can pivot to support new business models as quickly as they emerge. In the same way you wouldn’t start a company without ERP or CRM system, the same is now true for data and organizations. To see evidence of this evolution in 2019, look for mentions of data in yearly and quarterly reports and mentions about data and analytics in very business specific use cases in earnings calls.
One thing we’ve started to see toward the end of 2018 is the shift toward operationalization of big data pipelines that everyone’s been clamoring for but not so many have been able to achieve. While many companies are successful in getting a few big data pipelines for machine learning and analytics use cases into production, it takes an incredible amount of effort to get them to run and to keep them running. After their initial success and the excitement that comes with it, all of their resources go to maintaining the existing use cases/ The result is that they are not able to scale their organization to take on the backlog of new business use-cases that arise after the initial success.
If anything, it’s the constant newness of the technology as well as the focus on ad hoc data analytics that have hindered progress in implementing production quality big data implementations. Key to this evolution is the recognition that there is a difference between developing a machine learning (ML) algorithm or running a data pipeline once vs. running it over and over again. The introduction of automation frameworks specifically designed to operationalize big data workflows makes going from initial development of a new analytic use case to putting it into production much simpler. Also, with more CDOs now available to drive change (thanks in no small part to GDPR, which mandates their appointment), 2019 will see an increasing number of organizations get their companies on board with a singular data vision to move from ad hoc analytics to full operationalization of enterprise-wide big data analytics at scale.
Organizations seem to be having a hard time standardizing on a single big data environment as technologies advance and the economics continue to change. They are implementing both on-premises and cloud-based Hadoop and Spark infrastructures and are evaluating serverless technologies as well. In fact, while the trend over the past few years has been to embrace the cloud for big data analytics with machine learning, in 2018, we started hearing about companies beginning a “repatriation” of data back on premises. At several recent CDO Summits (organized by Evanta, a Gartner Company), many organizations commented they were moving back on-premises because the economics of cloud, even with the additional automation, was significantly more expensive than on-premises, especially for non-dynamic workloads. Even Amazon seems to be acknowledging this phenomenon with the introduction of AWS Outposts.
We are also seeing many of our own customers implement some aspect of their data analytics environment both on-premises and in the cloud. In fact, they are sometimes going with multiple cloud vendors, using different providers for different capabilities. Note that Infoworks customers may represent a leading edge here, because our platform makes it possible for end users to redeploy data pipelines on different big data infrastructures without recoding. The result is that they can more aggressively manage data pipelines across clouds and on-premises. That said, organizations are becoming more savvy in how they are deploying their data solutions and will likely look to use on-premise platforms for relatively stable workloads and cloud for more dynamic workloads.
The bottom line is that there is a lot of choice, and organizations will avail themselves of that choice for time to market and economic value. That means that data ecosystems will likely be a tangled mess of different environments for the next few years.
Next year will bring greater alignment between traditional analytics with ML. We’ll see more organizations using ML to augment everyday operational analytics pipelines and normal line of business activities. In the past, ML was somewhat restricted to what data scientists could evaluate and test before a data engineering team could deploy into production. In fact, in most organizations, you have the traditional BI/analytics team and then a separate team of data scientists and yet another team of data engineers. Those groups and skills sets will begin to merge or at least start overlapping to a greater degree.
Two trends are accelerating the use of ML. The first is the introduction of the “citizen data scientist” who can use some basic ML algorithms within their data pipelines as those capabilities being to show up in more traditional BI and data integration platforms. The second is the ability for data scientists to use more automated tools to put advanced ML algorithms into production. Currently, data scientists create clever ML algorithms and then wait for a data engineering to create production ready data pipelines. In 2019, automation frameworks will allow data scientists to create their own data pipelines that are close to production ready. This combination of bringing data engineering to data scientists and data science to data analytics will drive an increase in the amount of actual ML algorithms that going into enterprise level production.
The consolidation of big data vendors (as measured by the dwindling number of vendors at trade shows like Strata) will level off. Soon, the number of companies getting acquired or simply disappearing will match the number of new entrants. Fewer vendors will get funded, but more of them will be delivering a greater level of innovation and value. The market is no longer in the mood to tolerate crowds of vendors with little differentiation or providers that aren’t delivering real value or rev gen. Let’s face it. The first wave of big data vendors included many organizations that weren’t building businesses–they were building features. As more of them get rolled up into an integrated stack, it will be up to the next generation of players that come in to transform big data into something that’s truly big.
That wraps it up for 2018. Both Ramesh and I wish you all a great holiday season and a very happy new year!