It would be hard to find something that is growing faster than the Nvidia datacenter business, but there is one contender: OpenAI.
Open AI is, of course, the creator of the GPT generative AI model and chatbot interface that took the world by storm this year. It is also a company that has a certain amount of first-mover advantage in the commercialization of GenAI, thanks only in part to its massive $13 billion partnership with Microsoft.
And given OpenAI’s very fast growth rate in terms of both customers and revenues and its gut-wrenching costs for the infrastructure to train and run its ever-embiggening AI models, it is no surprise at all that rumors are going around that OpenAI is looking to design its own AI chips and have them fabbed and turned into homegrown systems so it is less dependent on GPU systems based on Nvidia – whether it rents the Nvidia A100 and H100 GPU capacity from Microsoft’s Azure cloud or had to build or buy GPU systems based on these GPUs and park them in a co-lo or, heaven forbid, its own datacenter.
Given the premium that the cloud builders are charging for GPU capacity, companies like OpenAI are certainly looking for cheaper alternatives and they are certainly not big enough during their startup phase to move to the front of the line where Microsoft, Google, Amazon Web Services, and increasingly Meta Platforms get first dibs on anything they need for their services. The profits from GPU instances are staggering, and that is after the very high costs for GPU system components in the first place. To prove this point recently, we hacked apart the numbers for the P4 and P5 instances based on the Nvidia A100 and H100 GPUs at Amazon Web Services as well as their predecessors, showing the close to 70 percent operating margin that AWS commands for A100 and H100 for three-year reserved instances. If instances are reserved for less time, or bought under on demand or spot pricing, then the operating income on the iron would be even higher still.
There is some variation in cloud pricing and configuration of GPU systems, of course, but the principle is the same. Selling GPU capacity these days is easier than selling water to people living in a desert with no oasis in sight and no way to dig.
Nobody wants to pay the cloud premium – or even the chip maker and system builder premium –if they don’t have to, but anyone wanting to design custom chippery and the systems that wrap around it has to be of a certain size to warrant such a heavy investment in engineers and foundry and assembly capacity. It looks like OpenAI is on that track, and separately from the deal is has with Microsoft where it sold a 49 percent stake in itself to the software and cloud giant in exchange for an exclusive license to use OpenAI models and to have funds that are essentially round tripped back to Microsoft to pay for the GPU capacity on the Azure cloud that OpenAI needs to train its models.
According to another report in Reuters, which broke the story about OpenAI thinking about building its own AI chips or acquiring a startup that already has them, OpenAI booked $28 million in sales last year and Fortune wrote in its report that the company, which is not public, booked a loss of $540 million. Now you know one reason why OpenAI had to cozy up to Microsoft, which is arguably the best way to get AI embedded in lots of systems software and applications. Earlier this year, OpenAI was telling people that it might make $200 million in sales this year, but in August it said that looking out twelve months, it would break $1 billion selling access to its models and chatbot services. If this is true, there is no reason to believe that OpenAI can’t be wildly profitable, especially if Microsoft is paying it to use Azure, which means there is a cost that nets out to zero.
Let’s say OpenAI might have $500 million to play with this year and maybe triple that next year if its growth slows down to just tripling and its costs don’t go haywire. If this is the scenario, this is good for Sam Altman & Co because we don’t think the OpenAI co-founders and owners want their stake to go below 51 percent ownership right now because that would be a loss of control over the company. OpenAI might have enough money to do AI chips without seeking further investors.
So, again, no surprise that OpenAI is looking around for ways to cut costs. Considering the premium that Nvidia is charging for GPUs and the premium that clouds are charging for access to rented GPU system capacity, OpenAI would be a fool if it was not looking at the option of designing compute and interconnect chips for its AI models. It would have been a fool to do it before now, but now is clearly the time to start down this road.
The scuttlebutt we heard earlier this year from The Information was that Microsoft had its own AI chip project, code-named “Athena” and started in 2019, and apparently some test chips have been made available to researchers at both Microsoft and OpenAI. (It is important to remember that these are separate companies.) While Microsoft has steered the development of all kinds of chips, importantly the custom CPU-GPU complexes in its Xbox game consoles, developing such big and complex chips is still increasingly expensive with each manufacturing process node and risky in that any delays – and there will always be delays – could put Microsoft behind the competition.
Google was first out there with its homegrown Tensor Processing Units, or TPUs, which it co-designs and manufacturers in partnership with Broadcom. AWS followed with its Trainium and Inferentia chips, which are shepherded by its Annapurna Labs division through manufacturing by Taiwan Semiconductor Manufacturing Co, which is also the foundry for Google’s TPUs. Chip maker Marvell has helped Groq get its GroqChip and interconnect out the door. Meta Platforms is working on its homegrown MTIA chip for AI inference and is also working on a variant for AI training. The AI training chip field also includes devices from Cerebras Systems, SambaNova Systems, Graphcore, and Tenstorrent.
The valuations on these AI startups are probably too high – multiple billions of dollars – for OpenAI to acquire them, but Tenstorrent is unique in that the company is perfectly willing to license its IP to anyone who wants to build their own AI accelerator or own its RISC-V CPU. Given the importance of the GPT models in the field of AI, we think that any AI startup would do a similar IP licensing deal to be the platform of choice for OpenAI, which almost certainly has the ability to shift to homegrown hardware should it find the Microsoft Azure prices a bit much.
Let’s have some fun with math. Buying a world-class AI training cluster with somewhere around 20 exaflops of FP16 oomph (and not including sparsity support for the matrices that are multiplied) costs north of $1 billion using Nvidia H100 GPUs these days. Renting capacity in a cloud for three years multiplies that cost by a factor of 2.5X. That’s all in, including the network and compute and local storage for the cluster nodes but not any external, high capacity and high performance file system storage. It costs somewhere between $20 million and $50 million to develop a new chip that is pretty modest in scope. But let’s say it is a lot more than that. But there is a lot more than building an AI system than designing a matrix engine and handing it to TSMC.
It probably costs the cloud builders close to $300,000 to buy an eight-GPU node based on Hopper H100s with its portion of the InfiniBand network (NICs, cables, and switches) apportioned to it. That assumed NVSwitch interconnects across the nodes. (That’s a lot cheaper than you can buy it with single-unit quantities.) You can have a smaller node with only two or four GPUs and use direct NVLink ports between those GPUs, but your shared memory domain is smaller. This has the virtue of being cheaper, but the size of the shared memory is smaller and that affects model training performance and scale.
That same eight-GPU node will rent for $2.6 million on demand and for $1.1 million reserved over three years at AWS and probably in the same ballpark at Microsoft Azure and Google Cloud. Therefore, if OpenAI can build its systems for anything less than $500,000 a pop – all-in on all costs – it would cut its IT bill by more than half and take control of its fate at the same time. Cutting its IT bill in half doubles its model size. Cutting it by three quarters quadruples it. This is important in an market where model sizes are doubling every two to three months.
It is important to remember that OpenAI may also suffer its own fate if things go wrong with an AI chip design or its manufacturing and at that point, OpenAI would moved to the back of the line for GPU access from Nvidia and certainly further down the line with Microsoft, too.
So there is that to consider. And that is why all of the clouds and most of the hyperscalers will buy Nvidia GPUs as well as design and build their own accelerators and systems. They can’t afford to be caught flat-footed, either.
The novelty of AI processing devices, specialized to skin that Shrodinger’s cat, seems to me to be waning a bit. They are mostly dataflow-oriented systolic array processors, with differences in granularity, or in scale of integration, between chiplet and wafer. The device’s memory may be almost entirely distributed within individual computational units (like a large cache), or mostly available through HBM and LPDDR. The winner of that space should be whomever manages to produce and sell such a device at the least cost. With its claims of “super-intelligence”, OpenAI should certainly be able to do exactly that (ahem!)!
The most interesting thing that I’ve heard in this field may well be the idea of performing computations in a logarithmic scale (like decibels, dB). The sensory response of our nervous system is rather nonlinear, with a sensitivity that decreases as signal intensity increases. The linearly scaled controls commonly implemented in digital audio and video systems is accordingly quite a mismatch to our physiology, and should really be replaced with logarithmic knobs (for volume, light intensity, color mixing, etc …), as found in former analog devices (much knowledge was lost in the digital translation).
Log space converts multiplication into sums, and sums into essentially a “max” operation (comparison plus selection, as in the CMP + CSEL/CSET fused MOp of the Issue/Execute slide of the “Who will use ARM” TNP piece of 09/13/23). Chips for AI processing in log-space would accordingly be ultra simple (and ultra low-power), with no floats, no mul/div, and implementable essentially as networked in-memory micro ALUs. Piece of cake! The main challenege may be how to convert the smooth-gradient-oriented horror-backpropagation training algorithm plague to the discontinuous gradients caused by the max function.
Right smack on! As long as you’re not looking to dual-purposing with HPC, your AI log-space in-memory chippery can deal with just 4-bit signed integers (approx 1e-2 to 1e2 dynamic range in log2) or 8-bit if pushing it (1e-38 to 1e38 maximum overkill). Biological signaling has neither negative nor perfect zero values in the living brain, making the rhythm of logarithms a logical algorithm (many thanks to Al-Khwârizmî, Napier, Boole, and Jimi Hendrix!)!
Interconnect complexity, to represent the synaptic connectome, should be the critical component here. Apart from that, just evenly sprinkle 4-bit adders and max units, like peperroni, in a bed of shredded SRAM or SDRAM cheeses, atop that tasty interconnect secret sauce, and presto: a most satisfying, and inexpensive, AI dataflow pizza, right in time for the weekend! For a more chunky texture, pack the 4-bit add/max units into vector meatballs, combine with chunks of HBM or DDR5, spread over a lighter interconnect sauce, and voila!
Quite obliviously, none of these 4-bit cookbook log recipe shenanigans will do much to solve HPC’s memory-wall sticky spaghetti problem, nor the bedbugpocalypse rampaging through Paris and points South (looking at you, Bruyères-le-Châtel!) as we speak. For those, we need some real cauliflower! b^8
“It would hard to find something that is growing faster than the Nvidia”…,
There is one minor grammatical error in the first sentence.
It should be,” It would be hard to find something”…,
Your analysis is impressive.
Given the very clear almighty advantage that successful development of novel AI processed, processing chippery would/could effortless unilaterally deliver, ….. and for the very good, prime directive reason of ensuring present and future guaranteed maintenance and failsafe stealthy secret COSMIC* national and international and internetional security facility and utility, ….. that first sentence most definitely would be better revised to ask such a question as ….. “Should it be next to impossibly hard to find something that is growing faster than the Nvidia datacenter business?” ……. with even that question being further revised to accommodate/mitigate very possible, and therefore highly probable, existential threat exploit vulnerabilities/features, ……. “Can it be made next to impossible for IT and AI to find everything and anything growing faster than themselves.” ……. with the answer to both of those questions best being accurately realised and accepted as Yes and No.
COSMIC* .. Control Of Secret Materiel in an Internetional Command
“Can any part of life be larger than life?”
“Can any part of life be larger than life?” one asks.
Well Yes, of course, …… they be all of those parts self-actuated to practically remotely realise with physical abandon, virtualisable phorms, absolutely fabulous dreams and rogue renegade nightmares in reciprocal agreement with both Einstein and Rushian thoughts, and without the worryisome fear of unwarranted doubt forbidding and preventing total information awareness access providing ITs AWEsome Utility for Enhancing Abilities …… “I am enough of an artist to draw freely upon my imagination. Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world, and all there ever will be to know and understand” …… “And the knowledge that they fear is a weapon to be used against them …. And the things that he fears are a weapon to be held against him”
And the future really holds no fears for masters and mistresses of ultimate weapons with almighty arsenals of smart ammunition and even smarter intelligence leads and feeds.
🙂 Naturally though, will many a soul imagine all of that to be completely wrong …. and miss out catastrophically on all of the available fun. That be their great loss, Timothy Prickett Morgan. And of that there is no shadow of doubt.
Maybe that 2x part of life, also known as lifefefe? 8^b
Like we French HPC always say, unless giddily rush the weapon, there is nothing to fear … but AI/ML processed cheese itself (and logarithms!)! It is old Franklin proverb: fear not, want not. (hé-hé-hé!)
Maybe they should buy Tachyum?
hahahahahahaha
I think you might need to add a zero to your custom ASIC cost estimate… The article you cited is almost a decade old now and maxes out at a pretty old technology node. Just the mask costs for a 4nm ASIC (similar to what Tenstorrent is targeting in their partnership with Samsung: https://www.prnewswire.com/news-releases/tenstorrent-selects-samsung-foundry-to-manufacture-next-generation-ai-chiplet-301943705.html) will be close to your low end number. Chip Architecture, Design and Verification costs, IP Costs for SerDes and standard SoC interface IP (e.g. PCIe, HBM/(LP)DDR), and SoC infrastructure, P&R costs, firmware, software… eek! That adds up to hundreds of millions of dollars even for “modest” ASICs in advanced nodes these days… Otherwise, great article!