If you thought it took a lot of compute and storage to build Facebook’s social network, you ain’t seen nothing yet. The immersive and AI-enhanced experiences of the Metaverse will make all of the technology that Facebook has created to date, and largely open sourced through the Open Compute Project, seem like child’s play.
To build the Metaverse, as Facebook co-founder and Meta Platforms chief executive officer Mark Zuckerberg has pledged to do, is going to take an absolutely enormous amount of supercomputing power, and today Meta announced the Research Super Computer, or RSC for short, which will be a 2,000 node system with a total of 4,000 processors and 16,000 GPU accelerators when it is fully operational in July of this year.
Depending what is added to the RSC machine as it is expanded will determine its performance, and there is no reason to believe that the configuration of the initial nodes will be the same for the final nodes later this year. In fact, given that the full RSC machine is not being announced today, but only the first phase of its construction, that would seem to guarantee that more powerful components will be used in the second phase. Or so you might think, anyway.
That RSC has a phased rollout, given so many product transitions in compute and networking that are underway here in 2022, is no surprise. (We will get into the feeds and speeds of RSC in a second.) But what is a surprise, given Facebook’s more than decade long focus on forcing disaggregated components from its vendors and opening up its hardware designs through the Open Compute Project, is that the RSC system is built from servers, storage, and switches that are commercially available from traditional IT suppliers. Facebook is not, for instance, taking HGX-class system boards from Nvidia and building custom machine learning systems like it was showing off back in 2019. The first phase of the RSC machine looks like someone read through The Next Platform and picked the most popular and least risky commercial iron available and plugged them all together.
But again, do not assume that phase two will look exactly like phase one. We strongly believed that it would not, as a matter of principle, until we did some math. But hold on a second for that.
Facebook bought a supercomputer?
Let’s think about that for a second. The reason that Meta decided to buy rather than source components and build a custom supercomputer probably comes down to component supplies – which are pretty tight right now – and working a deal with some vendors who want the Facebook publicity and are willing to work the price and the parts availability to make this all happen a little faster than it otherwise might.
This is not the first time that Facebook has done this. Back in early 2017, the social network did install a cluster of semi-custom DGX-1 machines, which had 124 nodes based on a pair of 20-core Intel Xeon E5 processors and four Tesla P100 accelerators per node, linked by 100 Gb/sec InfiniBand switching and built by Penguin Computing. This machine, which was never given a nickname, had 4.9 petaflops of peak 64-bit floating point performance and was rated at 3.31 petaflops on the Linpack linear algebra benchmark commonly used to rank traditional HPC systems. It was noteworthy in that it was not based on the Open Compute “Big Sur” and “Big Basin” GPU-accelerated node designs that the company was in fact deploying to run its AI stack.
“One does not simply buy and power on a supercomputer,” George Niznik, sourcing manager at Meta, explained in a brief video talking about the machine. “RSC was designed and executed under extremely compressed timelines and without the benefit of a traditional product release cycle. Additionally, the pandemic and a major industry chip shortage hit at precisely the wrong moment in time. We had to fully utilize all of our collective skills and experiences to solve these difficult constraints. Fundamentally, we have leveraged the best everyone has to offer across people, technology, and partnerships to deliver and light up the ultimate in high performance computing.”
That paints such a pretty picture, especially with the calming background music. But when we turn the volume down and think for a minute, here is what we think actually happened.
Meta needed a lot more machine learning performance, and it also wanted to make a splash because that always helps with employee recruitment. It could be that Nvidia was not keen on supporting the OAM module form factor and signaling, because it is different from its own SXM4 interface and it does not support NVLink and NVSwitch.
Intel supports OAM with its future “Ponte Vecchio” Xe HPC GPU accelerator, but that doesn’t do Facebook any good today. And maybe not in the future, depending on how many Intel can make, what they cost, and how hot they run – and the fact that the “Aurora” supercomputer at Argonne National Laboratory has dibs on the first 54,000 or more of these Ponte Vecchio chips when they come out sometime this year.
AMD supports the OAM module in the “Frontier” supercomputer at Oak Ridge National Laboratory with the “Aldebaran” Instinct MI200 series GPU accelerators, which beat the pants off of the Nvidia “Ampere” A100 GPU on a lot of raw benchmarks. But we think AMD is using all of the Instinct MI200s that it can make – there are more than 36,000 of them in Frontier – to build this one machine right now, just as Intel is doing with Argonne’s Aurora system.
What is a hyperscaler to do? Buy current whole DGX A100 systems from Nvidia, including InfiniBand networking, buy all-flash storage from Pure Storage, and buy caching servers from Penguin Computing, which helps these companies make a splash and helps Meta argue for the best discounts that it can get. And then see what phase two of RSC will look like.
In phase one of the RSC machine, Facebook is installing 760 DGX A100 systems from Nvidia, which each have eight A100 GPU accelerators and we presume the 80 GB HBM2 memory variants given that those have been out for more than a year now. The DGX A100 servers have a pair of AMD “Rome” 64-core Epyc 7742 processors, which Nvidia picked because of the processor’s support for PCI-Express 4.0 peripheral controllers to link the CPU boards to the GPU boards. The DGX A100 has 1 TB of main memory, and usually has eight 200 Gb/sec HDR InfiniBand ports for cross-coupling nodes. The Facebook system has eight 200 Gb/sec HDR InfiniBand controllers (one for each GPU in the box) and the InfiniBand network has a two-tier Clos fabric, the latter being the topology commonly used in hyperscale datacenters. With the Clos network, all of the nodes are connected to all of the other nodes rather than podded in rows and then aggregated in another network layer, which adds latency and cost.
In terms of storage, Meta has worked with Penguin Computing to transform a cluster of its Altus systems into a caching cluster that has 46 PB of capacity. This caching layer is backed by 10 PB of FlashBlade object storage and 175 PB of FlashArray block storage, both of which are all-flash storage appliances that come from Pure Storage. Over time, the tier one storage and caching on the front of it will be expanded to hold exabytes of data, and will be able to feed data to the RSC machine at 16 TB/sec. We don’t know what the storage bandwidth for phase one of the RSC system is today, but it is probably on the order of 3 TB/sec.
In terms of peak theoretical performance, the phase one of the RSC machine is rated at 59 petaflops at double precision floating point (FP64), and 118.6 petaflops FP64 using the Tensor Core matrix engines; this performance is important for AI training, and at single-precision, double the above numbers for the aggregate compute. With INT8 processing on the Tensor Cores, which is needed for AI inference, the phase one machine is rated at 3.79 exaops.
We estimate that at list price this machine would cost around $400 million, with $303 million of that just being for the DGX A100 machines. (You begin to see why Nvidia doesn’t care too much that it lost out on the CORAL2 HPC procurements for the US Department of Energy …)
That brings us to phase two, where the machine will have a total of 16,000 GPUs. There is a rosy scenario and one that is more of a daffodil.
Meta’s techies never said all of the GPUs would all be the A100s, and given supply chain constraints, they could be even if the “A100 Next” GPU from Nvidia is launched in March as we expect. Let’s call it the A200 just to be simple. As we spoke about a few weeks ago, we think it is possible that the A100 Next GPU will have four slightly improved A100 motors on a package, yielding 48 teraflops of FP64 performance proper, which puts it in line with the 45 teraflops for Ponte Vecchio from Intel and the 47.9 teraflops for the Instinct MI200 from AMD. If that happens – and that is a very big if, we realize – then the phase two of the RSC machine could be based on the DGX A200 server, with eight A200 accelerators in each node and very likely a pair of either 64-core “Milan” Epyc 7003 processors shipping since last year or 96-core “Genoa” Epyc 7004 processors due later this year. The choice of CPU depends on Nvidia’s and AMD’s respective timings.
Whatever the CPU, Meta is going to need Nvidia to get its hands on 2,480 of them, and the machine will have 9,920 of the A200 accelerators if this had played out as we initially expected. The resulting phase two part of the machine would have a stunning 476.2 petaflops of aggregate FP64 performance on CUDA cores and double that to 952.3 petaflops using the Tensor Core FP64 math. On the INT8 front, the RSC phase two machine could add another 15.45 exaops of INT8 performance, if the Nvidia GPUs looked like we had hoped. If Nvidia keeps the prices for DGX machines the same, this second piece would cost $494.8 million at list price.
If, if, if …
Add the phase one and prospective phase two machines together, you would get 535.1 petaflops of base FP64 throughput (Tensor Core FP64 is twice that), and 19.2 exaops of INT8 inference performance. Probably for somewhere around $900 million at list price. Heaven only knows what Meta would pay for such a staged machine, but our guess is 50 percent of list price for the phase one machine and a whole lot closer to list price for the phase two machine, given the expected performance of the A200 GPU. Call it a cool $700 million, not counting any expanded storage from Pure Storage or Penguin Computing. The storage is expected to scale up to 1 EB, as we said above, so there may be another $100 million or so on top of that.
But maybe the A200 as we have conceived of it, after reading some Nvidia Research papers, is too aggressive or too unattainable or too pricey for Meta.
In its statement, Meta said that with phase two of RSC, it would boost the GPU count from 6,080 to 16,000, which is a factor of 2.63X increase and which Meta said would deliver an AI training performance increase of more than 2.5X. Which stands to reason, and which sure sounds like it is keeping with the DGX A100 server using the A100 GPU accelerators nodes even for phase two. Which is boring. Meta said further that the RSC machine would have close to 5 exaflops of mixed precision capacity, and if you multiply 312 teraflops of Bfloat16 or FP16 math by 16,000 GPUs, which is what the A100 GPU is rated at, you get 4.99 exaflops.
So, it doesn’t look like RSC is getting any A100 Next or what we were calling A200 GPU accelerators after all. But, interestingly, even at half list price for the DGX A100 servers loaded and another $100 million to $150 million for storage and networking (not including the expanded storage Meta talked about), you are in the ballpark of $500 million for the complete RSC system based on the existing gear, rated at 1.55.2 petaflop. Intel and AMD are offering much more generous terms for the Aurora and Frontier machines, of course, but these labs always get pricing terms that no enterprise can dream of getting. Like ten times more expensive if our guesses are right. Even hyperscalers like Meta don’t get the deals that HPC labs get. Besides, it is not like Facebook isn’t generating lots of cash.
As for this being the world’s most powerful AI supercomputer, it will take a backseat bigtime to Aurora and Frontier, and then possibly a number of other larger machines that will emerge in 2022 and 2023.
Facebook is not disclosing the location of the datacenter that is housing the RSC system, but it gets its own dedicated datacenter and we think it is outside of the Forest City, North Carolina facility that Facebook opened in 2012 and has been expanding since then.
I wonder if meta cares one iota about FP64 performance of its AI supercomputer.
I would think for a research AI supercomputer the rich development library available in NVIDIA’s ecosystem far outweighs a desire for OAM. From a business perspective, it’s easy to imagine that for high volume operations a company will focus its large resources on a narrow range of development relating to its operations to get a financially-beneficial architecture, such as OAM or Open Compute Project, up and running. But for research it will want to allow its scientists to be as productive and unrestricted as possible, else it risks falling behind the industry while trying to serve two masters. I doubt meta expects much parallelization using compiler directives for code running on its new supercomputer, and something like Tesla’s Dojo is more a steam hammer for churning out a product than a lab bench for doing research.
As far as I know, AI training still takes a fair amount of FP64 every so often, even if it is also using a lot of FP32 and FP16. So in some way, it still matters. I was using FP64 as a proxy for heavy duty performance, just I was using INT8 as a proxy for high throughout small data like that used for AI inference.
Agree with what you said, obviously. Easy is better. But at some point, cheaper is better than easier.
I am sure it will used to distort elections and implement mass mind control.