Databricks Is Going To Be The Next Platform For Many Enterprises

The hyperscalers, cloud builders, HPC centers control the design and manufacturing of own AI infrastructure. They have big bucks, and they can afford to get exactly what they want. For the rest of the world, and particularly large enterprises who cannot afford to start from scratch and who want to be able to play the clouds and their own infrastructure against each other to mitigate risks and drive down costs, they need an AI platform that just works.

And even more precisely, they need a single platform with commercial support that includes a data lakehouse with data analytics, traditional AI, and generative AI capabilities all rolled into one and available on Amazon Web Services, Microsoft Azure, and Google Cloud.

And for many, that will mean they need something like the Databricks platform, which is based on the open source Apache Spark in-memory database and its various analytics and machine learning extensions.

That, in a nutshell, is why Databricks has been able to talk private equity investors into a stunning ten rounds of funding, with the latest announced this week weighing in at a record-breaking $10 billion, bringing the company’s total funding to date to just under $14 billion since it was founded way back in 2013.

Sparking A Data Analytics Revolution

When we started The Next Platform in late 2014, the Apache Spark in-memory database, which was designed to run across large clusters of commodity X86 servers, was very much on our minds, and particularly as a replacement for the batch-oriented, clunky, and slow MapReduce method created by Google that was cloned by Yahoo and commercialized as Hadoop. There was a lot of hype around Hadoop but it ultimately turned into cheap and deep storage with a painfully slow SQL interface. Spark emerged as a very fast platform that was aimed at less expansive datasets but one which could drive real-time applications and therefore, in that sense, was more immediately useful.

Back in 2009, when Hadoop was mainstreaming, Spark was a research project started by Matei Zaharia, who was working at the AMPLab at the University of California at Berkeley. Hadoop itself was three years old at the time, and was the first of the large scale analytics platforms that was conceived of by a hyperscaler and cloned or directly open sourced.

Google did not open source the code for the Google File System or the MapReduce method of chewing on massive amounts of unstructured data stored in it, but Doug Cutting and Mike Cafarella at Yahoo recreated what they read about in the famous Google MapReduce paper. And almost immediately, everyone wanted something that was faster and smarter. MapReduce employed distributed storage and compute associated with it (that’s the Map part) and chewed on that data sifting for specific pieces of information (that’s the Reduce part) and then collated the results across a cluster.

By contrast, Spark implements a distributed shared memory for data sitting in a fault tolerant datastore, called a resilient distributed dataset, and allows all manner of algorithms – including standard statistics as well as machine learning algorithms – to quickly retrieve and to analyze, to manipulate, or to learn from that data. The original Spark was written in Scala and open sourced under a BSD license back in 2010. The co-founders of Databricks, who wanted to commercialize Spark, formed a company in 2013, and given the difficulties of running a startup and not wanting to boil the ocean, they decided to only sell Spark running in the cloud, which they called Databricks. If you wanted to run Spark in your datacenter, you could, and if you poked around, you might be able to get some technical support through a Hadoop distributor or a Linux distributor.

In 2015, when Hadoop was still important, but Spark was on the rise because of the low speed of running SQL queries against HDFS, Databricks told us that there were over 500 companies running Spark in production and there were on the order of 2,000 Hadoop installations worldwide. You could run Spark natively on its in-memory database, but you could also use Hadoop YARN, Kubernetes, Mesos (remember that?), or other schedulers to manage the cluster that Spark runs upon.

The use of Spark kept doubling and doubling and doubling, and today over 10,000 companies are using the Databricks implementation of Spark in production, and no one knows for sure how many are using the open source Apache Spark in-memory database in production. It could be 2X or it could be 10X that number of Databricks users. It could be a lot more.

That business around Spark in-memory databases for data analytics and for traditional machine learning was doing alright, but the advent of generative AI has sent its revenues and the amount of funds that Databricks can raise skyrocketing – and its valuation ahead of an anticipated initial public offering, too.

Incidentally, this quarter, which ends the Databricks 2025 financial year in January, will be the first one where Databricks will be cash flow positive.

The watershed event that is contributing to this wealth is, without question, the acquisition of MosaicML, a developer of machine learning models and the tools to customize them and turn them into applications, in June 2023 for $1.3 million. MosaicML was founded in early 2021 by Naveen Rao, one of the co-founders of AI chip startup Nervana Systems, and Hanlin Tang, who was an algorithm engineer at Nervana. You will remember that Nervana was acquired by Intel in the summer of 2016 for a rumored $350 million and which we think was acquired to try to sell accelerators to Google, which was rolling out its initial Tensor Processing Unit (TPU) accelerators at the time. And soon thereafter, in December 2019, Intel shelled out $2 billion more to acquire rival AI chipmaker Habana Labs for its Goya and Gaudi lines of AI accelerators.

At the time that Databricks acquired MosaicML, the latter company had just rolled out its MPT-30B model and had over 3.3 million downloads of its MPT-7B model. The secret sauce at MosaicML is automatic optimization of the model training, which is what takes so long to do by hand with experts. With a shortage of AI experts in the world, and this being a very difficult task to begin with, such automation is the key to mainstream deployment of AI. In March of this year, the combined Databricks and MosaicML teams created DBRX, a mixture-of-experts model with 132 billion parameters trained against 12 trillion tokens with a context window that is 32,000 tokens in size. This DBRX model stands toe to toe with similar models from OpenAI, Anthropic, Google, and Mistral, and was trained on a cluster interlinked with 400 Gb/sec InfiniBand ports based on Nvidia “Hopper” H100 GPU accelerators.

Databricks runs the control plane for its data management and compute platform on AWS, but customers can deploy compute and storage nodes on AWS, Microsoft Azure, or Google Cloud infrastructure. In October, Databricks and AWS “strengthened” their partnership, with Databricks consuming more than $1 billion this year on various AWS services to support its customers, which we presume also includes AWS capacity used by its customers, who we presume largely deploy their storage and compute on the Amazon cloud. (Rather than on Azure or GCP. Apache Spark customers can deploy anywhere, including on their own iron.)

Under that partnership, Databricks is joining Anthropic in using the Trainium AI accelerators, designed by the Annapurna unit of Amazon, to train its models. Both companies are going to work together to move companies to the cloud, and use GenAI as additional bait to do so.

A Series J funding round is unusual in the tech sector, and such a big up-round is also unusual, too. If you look at the numbers, Databericks will grow revenues by 1.9X this year to around $3 billion, it has boosted its fundraising by 2.5X, but its valuation has only risen by 1.4X to $62 billion as this whopping $10 billion round closed. The company apparently could have raised nearly twice as much money in this Series J round but chose not to, leaving room for a Series K or perhaps an IPO in 2025.

Thrive Capital was the lead investor in this Series J round, with Andreessen Horowitz, DST Global, GIC, Insight Partners, and WCM Investment Management kicking in representative amounts and ICONIQ Growth, MGX, Sands Capital, and Wellington Management being new investors.

Interestingly, Databricks is going to help current and former employees cash in on their Databricks shares ahead on an IPO, giving them both cash to buy shares as well as to pay taxes on their earnings from options. This seems very generous, but is not unheard of.

After those liquidity and taxes issues are paid for on behalf of employees, the rest of the money will be used to invest in new AI products, to do acquisitions, and to expand its go to market activities with the clouds and the systems integrators of the world.

And maybe now that it is rich, Databricks will roll up its software and create a supportable variant of its Databricks platform, including the version of Spark called “Photon” that has been ported from Scala to C++ and that we presume runs like a bat out of hell compared to the open source Apache Spark. Databricks is now big enough to go commercial with its full platform both on the cloud and on-premises. Not everyone wants to pay the Amazon, Microsoft, and Google cloud premium for infrastructure.

Databricks Is Going To Be The Next Platform For Many Enterprises

Sparking A Data Analytics Revolution

Sign up to our Newsletter

1 Comment

Leave a Reply Cancel reply

Sparking A Data Analytics Revolution

Sign up to our Newsletter

Related Articles

Vast Data Builds Out Data Platform With Block Storage And Kafka Streams

Graphing The Coronavirus Pandemic

The GPU Database Evolves Into An Analytics Platform

1 Comment

Leave a Reply Cancel reply