A year ago, at its Google I/O 2022 event, Google revealed to the world that it had eight pods of TPUv4 accelerators, with a combined 32,768 of its fourth generation, homegrown matrix math accelerators, running in a machine learning hub located in its Mayes County, Oklahoma datacenter. It had another TPUv4 pod running in another datacenter, probably within close proximity to Silicon Valley. And in the ensuing year, for all we know, it may have installed many more TPUv4 pods.
And despite how Google is using TPUv4 engines to do inference for its search engine and ad serving platforms, the fact remains that Google is among the largest buyers of Nvidia GPUs on the planet, and if it is not already doing so, it will be buying AMD Instinct GPU accelerators in volume because any GPU is better than too few GPUs in an AI-driven IT sector. And that is because Google is a cloud provider and it has to sell what customers want and expect, and for the most part, enterprises expect to be running AI training on Nvidia GPUs.
Generative AI features across the Google software portfolio were the center of this week’s Google I/O 2023 event, which was no surprise at all, and the consensus today is that maybe Google is not as far behind the OpenAI/Microsoft dynamic duo as might have seemed the case when Google’s Bart chatbot front end for its search engine was released in a limited public beta back in March. Which means maybe OpenAI and Microsoft might not end up being a duopoly in AI software and hardware much like Microsoft and Intel were a duopoly in the PC four decades ago that got extended into the datacenter starting three decades ago.
Ironically, OpenAI is the software vendor and Microsoft Azure is the hardware vendor in this possibly emerging duopoly. Microsoft is said to have used 10,000 Nvidia A100 GPUs to train the GPT 4 large language model from OpenAI and is rumored to be amassing 25,000 GPUs to train the GPT 5 successor. We presume this will be on a mix of Nvidia A100 and H100 GPUs, because getting their hands on 25,000 H100 GPUs could be a challenge, even for Microsoft and OpenAI.
Customers outside of Microsoft and OpenAI using the Azure cloud are more limited in what they can get their hands on. What we do know, from recently talking to Nidhi Chappell, general manager of Azure HPC and AI at Microsoft, is that Azure is not doing anything funky when it comes to building out its AI supercomputers. Microsoft is using standard eight-way HGX-H100 GPU boards and a two-socket Intel “Sapphire Rapids” Xeon SP host node from Nvidia, as well as its 400 Gb/sec Quantum 2 InfiniBand switches and ConnectX-7 network interfaces to link the nodes to each other, to build its Azure instances, which scale in 4,000 GPU – or 500 node – blocks.
Google is referring to the A3 GPU instances as “supercomputers,” and given that they are going to be interconnected using the same “Apollo” optical circuit switching (OCS) networking that is the backbone of the Google network, why not call a bunch of A3s a supercomputer. The Apollo OCS network is reconfigurable for different topologies and, among its other datacenter interconnect jobs, is used to link the TPUv4 nodes to each other in those 4,096 TPU pods. The OCS layer replaces the spine layer in a leaf/spine Clos topology. (We need to dig into this a little deeper.)
The A3 instances are based on the same HGX-H100 system boards and the same Sapphire Rapids host systems that come directly from Nvidia as a unit and that are used by other hyperscalers and cloud builders to deploy the “Hopper” GH100 SXM5 GPUs accelerators. The eight GPUs on the HGX-H100 card use a non-blocking NVSwitch interconnect that has 3.6 TB/sec of bi-sectional bandwidth that effectively links the GPUs and their memories into a single, NUMA-like GPU compute complex that shares memory across its compute. The host node runs a pair of the 56-core Xeon SP-8480+ Platinum CPUs from Intel running at 2 GHz, which is the top bin, general purpose part for two-socket servers. The cost machine is configured with 2 TB of DDR5 memory running at 4.8 GHz.
The Google hosts also make use of the “Mount Evans” IPU that Google co-designed with Intel, which has 200 Gb/sec of bandwidth as well as a custom packet processing engine that is programmable in the P4 programming language and 16 Neoverse N1 cores for auxiliary processing on the big bump in the wire. Google has its own “inter-server GPU communication stack” as well as NCCL optimizations, which we presume at least parts of which are running on the Mount Evans IPU.
Google says that an A3 supercomputer can scale to 26 exaflops of AI performance, which we presume means either FP8 or INT8 precision. If that is the case, an H100 GPU accelerator is rated at 3,958 peak teraflops, and that means at 26 exaflops an A3 supercomputer has 6,569 GPUs, which works out to 821 HGX nodes. That is about 60 percent bigger than what Microsoft and Oracle are offering commercially at, 500 nodes and 512 nodes, respectively.
Thomas Kurian, chief executive officer of Google Cloud, said in the opening keynote for Google I/O that the existing TPUv4 supercomputers were 80 percent faster for large scale AI training than prior Google machinery and 50 percent cheaper than any alternatives on the cloud. (We originally thought he was talking about the A3 setups.) So the A3 machines have some intense internal competition.
“Look, when you nearly double performance at half the cost, amazing things can happen,” Kurian said, and had to get the crowd going a bit to get the applause he wanted.
As for scalability and pricing, we shall see how this all shakes out, both comparing the A3 instances to the prior A2 instances, which had 8 or 16 GPUs in a single host when they debuted in March 2021. For AI training, the A100 could only go down to FP16 and delivered 624 teraflops, so that was 9,984 aggregate teraflops max for a 16-ways A100 versus 31,664 teraflops for an eight-way H100 running at FP8 resolution. At the same node count, the new A3 supercomputer will offer 3.2X the throughput of the A2 supercomputer, provided your data and processing can downshift to FP8. If not, then it is a 60 percent bump.
As far as we know, Google is not offering anything like the scale we have seen being used internally at Microsoft for OpenAI. We also know that Google runs at a much larger scale to train its PaLM 2 large language model – probably well above 10,000 devices, but no one has been specific as far as we know. PaLM 1 was trained on a pair of TPUv4 pods, each with 3,072 TPUs and 768 CPU hosts. It is not clear why it did not use the full complement of 4,096 TPUs per pod, but Google did claim a computational efficiency of 57.8 percent on the PaLM 1 training run.
Google previously launched the C3 machine series based on the Mount Evans IPU and the Sapphire Rapids Xeon SPs back in October 2022 and they were available for public preview in February of this year. And the G2 instances, based on Nvidia’s “Lovelace” L4 GPU accelerators for inference, have been in public preview since March of this year, scaling from one to eight of the L4 GPU accelerators in a single virtual machine. Like the H100, the L4 supports F8 and INT8 processing as well as higher precisions (with a corresponding decrease in throughput as the precision goes up).
Pricing for the A3 and G2 instances is not yet available, but will be when they are generally available, which we reckon will be later this year. We will keep an eye out and compare pricing when we can.
One last thing. We still think that Google has many more GPUs than TPUs in its fleet, and that even today, at best it might have one TPU for every two, three, or four GPUs that it deploys. It is hard to say, but the Google GPU fleet is probably 2X to 3X the size of the TPU fleet, even if the TPU is used for a lot of internal workloads at Google and even if the ratio is shifting ever so slowly toward the TPU, there are still a lot more GPUs. Luckily, with the AI craze, there won’t be any trouble finding those GPUs some work to do.
Still, the TPU doesn’t support the Nvidia AI Enterprise software stack, and that is what a lot of the AI organizations in the world use to train models. Google has to support GPUs if it wants to attract customers to its cloud, and only after they are there will it be able to show them the benefits of the TPU. Amazon Web Services has exactly the same issue with its homegrown Trainium and Inferentia chips, and while Microsoft is constantly rumored to be doing custom silicon, we have yet to see any heavy duty compute engines coming out of Azure.
> Still, the TPU doesn’t support the Nvidia AI Enterprise software stack, and that is what a lot of the AI organizations in the world use to train models.
I don’t quite follow this bit. Devices like TPU, Tranium, Inferentia have no reason to support any NVIDIA-specific stack. They do have to support the ML frameworks (PyTorch, TF, JAX, DeepSpeed,… etc) or one of the compiler-based intermediate representation runtimes (OpenXLA, IREE, etc) to gain traction with organizations that are already using them to implement and deploy their models. NVIDIA’s enablement story for any of these frameworks will be no different than how another compute accelerator will enable support for it.
Here is what I meant. Yes, the hyperscalers and cloud builders have their own frameworks and models. They’re all different. But the most portable software stack that runs on the Nvidia GPU hardware, which is everywhere, and that has commercial support — and that runs on all of the clouds — is Nvidia’s AI Enterprise.
I think you are pretty wrong on this since more and more are moving to basically just use hugging face models and infrastructure which completely separates from device specific code. Most people do not care along as it runs reliable and fast, they do not write any nVidia specific CUDA kernel, that’s maybe 5% out there (which is neglectable)
I wouldn’t understimate the importance of a well-integrated software stack, down to the bottom layers of hardware-specific libraries, even when developing at the much higher level of PyTorch or huggingface. In “HPC” (for example) there’s quite a difference in MATLAB performance (high-level programming) based on whether one uses generic BLAS, GotoBLAS, rocBLAS (AMD), or MKL (Intel). In particular, using MKL on AMD CPUs, or rocBLAS on Intel chips, can be most entertaining … This being said, today, both PyTorch and huggingface seem to favor (support) the nVidia GPU ecosystem a bit more than others ( https://pytorch.org/docs/stable/backends.html , https://huggingface.co/pricing#spaces ) and so I would expect developers to have the better experience with that HW, or something compatible-ish 8^p. This is a situation that one hopes would improve over time, with PhDs from more HW vendors contributing highly-tuned code libraries to those high-level programming frameworks (but I could be missing somethings…).
I like that PALM2 is splitting math, coding, and essay writing into different AI models, as they correspond to different skillsets (or arts — as in that of motorcycle repair, and associated Zen). Maple and Mathematica (Wolfram Alpha) already do great math, except for the eventual simplifications, where AI could help. Also, in their description of the coding model, Google explicitly mentions Fortran, and verilog (and Prolog), where needs and opportunities are great (as discussed in TNP; eg. verilog is useful for RISC-V/VI design).
When it comes to GPU vs TPU, I think that a startup may want to do initial developments in the (elastic) cloud, then move to Co-Lo, and eventually On-Prem, as income starts flowing in, and hopefully overflowing in. TPUs, and other cloud-corp-specific hardware, while very tempting and performant, may not easily transfer to such Co-Lo and On-Prem futures, and may thus need to be avoided (if possible). Henceforthwise, a startup (and even stately elders), may be wise to develop their AI solutions on HW targets that are commodifiable — to avoid the Xeon Phi-like syndrome of dead-end vendor lock-in (NEC vector engines may be another unfortunate example of this). This leaves us with GPUs, and their NVIDIA-compatible software stack, which includes ROCm, and OneAPI (I think).
As usual, I agree.
According to the PaLM 1 paper it was trained on 6144 TPUv4 chips for 1200 hours – https://arxiv.org/abs/2204.02311.
I’ve read that the limiting factor for Nvidia’s H100 production is TSMC’s CoWoS packaging capacity. If so, the situation with AI datacenter GPUs now is much the same as with gaming GPUs during the crypto craze in 2021. When everyone is affected by the exact same bottleneck to production then the inability by the market leader to service the demand does not result in others to be able to pick up the slack. AMD was not able to make and sell particularly many extra PC gaming GPUs during the crypto craze because they just couldn’t get extra wafers and components. Nvidia is going to scoop up every drop of spare CoWoS capacity TSMC has and AMD is not going to be able to significantly increase their production. Nvidia can afford to pay more for the previously untapped capacity as they can charge higher prices to customers since their GPUs have far greater utility due to the maturity of the software ecosystem that surrounds them. The same goes for any AI ASIC that relies on TSMC’s CoWoS packaging, which, correct me if I’m wrong, is just about any of them that use HBM, including Intel’s Habana Gaudi 2, Intel’s datacenter GPUs, and Google’s TPUv4.
So is that what’s holding back Grace? Or is it that customers don’t want to put all their eggs into the same admiral?
It’s very odd that they were co-designed and now only the GPUs and DPUs are showing up in numbers…
Grace was supposed to be a 2023 part while Hopper and Bluefield-3 were to be 2022 parts. Actually Hopper and Bluefield-3 seem to have appeared more slowly than originally promised. Perhaps AMD’s and especially Intel’s delays with PCIe 5 capable platforms are the main reasons for that but Bluefield-3 especially still seems a bit scarce.
Grace seems to have been delayed by about 6 months. Just the fact that it’s Nvidia’s first foray into server CPUs makes that unsurprising. Who knows why it’s been delayed. But the CoWoS capacity issue likely only materialized after the demand surge related to ChatGPT hype. Any 1H 2023 plans for Grace should have been in place well before that, so my guess is that the delay is unrelated to CoWoS capacity. It’s likely also unrelated to demand for Grace. Los Alamos National Laboratory and the Swiss National Supercomputer Center likely wanted their Grace chips in H1 2023 and not H2 2023.
In TSMC’s latest earnings conference call, on April 20th, their CEO, CC Wei, said “…just recently in these 2 days, I received a customer’s phone call requesting a big increase on the back-end capacity, especially in the CoWoS. We are still evaluating that.” Whoever that customer is has made the request rather recently. I would guess it’s Nvidia. The rumors are that TSMC are unable to increase their capacity this year by much more than they already have.
I think they may be prioritizing the opportunity of pairing with Sapphire Rapids (nicely performant), as a substitute for Ponte Vecchio (possibly a bit lackluster at this time).
Grace wasn’t planned to ship until 2H 23. Certainly Nvidia has planned and booked packaging capacity for Grace and Grace+Hopper skus, but they are probably (typically) conservative projections.
You can research the Alps and Venado supercomputers as well as discussions/announcements of Grace itself from March 2022 to see that Grace was planned for “early 2023”. Here’s something from Nvidia’s website published August 23, 2022 which is still saying “first half of 2023”: https://developer.nvidia.com/blog/inside-nvidia-grace-cpu-nvidia-amps-up-superchip-engineering-for-hpc-and-ai/
I believe Intel does their own advanced packaging on their data center GPUs, so those are not necessarily limited by the TSM packaging bottleneck.
I do not think their yields on Ponte Vecchio for the manufacturing were very high. I have heard some horror stories. But, they will get better, and IBM will help them–they know packaging, too.