Today is the ribbon-cutting ceremony for the “Venado” supercomputer, which was hinted at back in April 2021 when Nvidia announced its plans for its first datacenter-class Arm server CPU and which was talked about in some detail – but not really enough to suit our taste for speeds and feeds – back in May 2022 by the folks at Los Alamos National Laboratory where Venado is situated.
Now we can finally get more details on the Venado system and get a little more insight into how Los Alamos will put it to work, and more specifically, why a better balance of memory bandwidth and compute that depends upon it is perhaps more important to this lab than it is in other HPC centers of the world.
Los Alamos was founded back in 1943 as the home of the Manhattan Project that created the world’s first nuclear weapons. We did not have supercomputers back then, of course, but plenty of very complex calculations have always been done at Los Alamos; sometimes by hand, sometimes by tabulators from IBM that used punch cards to store and manipulate data – an early form of simulation. The first digital computer to do such calculations at Los Alamos was called MANIAC and was installed in 1952; it could perform 10,000 operations per second and ran Monte Carlo simulations, which use randomness to simulate what are actually deterministic processes. Los Alamos used a series of supercomputers from IBM, Control Data Corporation, Cray, Thinking Machines, and Silicon Graphics over the following four decades, and famously was home of the petaflops-busting “Roadrunner” system built by IBM out of AMD Opteron processors and Cell accelerators, which was installed in 2008 and which represented the first integration of CPUs and accelerators at large scale.
More recently, Los Alamos installed the $147 million “Trinity” system in 2015, comprised of Intel Xeon and Xeon Phi CPUs from Intel with a combined 2 PB of memory and Intel 100 Gb/sec Omni-Path interconnect. Trinity was noteworthy because offloading the results of calculations from that large amount of memory required a burst buffer so the machine could keep on calculating. The replacement for Trinity, which was installed in August 2023, is the “Crossroads” supercomputer, which is based on Intel’s “Sapphire Rapids” Xeon SP processors with HBM2e stacked memory and HPE’s Slingshot interconnect.
Both Los Alamos and its neighbor Sandia National Laboratories have been eager to foster the creation of Arm servers that can be clustered into supercomputers, and Los Alamos has been driving the memory bandwidth per core with the ill-fated ThunderX4 Arm server project at Cavium (and then Marvell). Neither the “Triton” ThunderX3 or the ThunderX4 ever saw the light of day, and so Los Alamos cajoled Intel to create the HBM variant of Sapphire Rapids, and also did its part in convincing Nvidia to create the “Grace” CG100 Arm server chips that is paired with its current “Hopper” GH100 and GH200 GPU accelerators and, soon, its future “Blackwell” GB100 and GB200 accelerators.
On The Hunt For The Memory Stag
Venado which means deer or stag in Spanish and is also the name of a peak in New Mexico’s Sangre de Cristo mountains, and that is where the new machine gets its name. As you might expect, Hewlett Packard Enterprise is the prime contractor on the system, and as we expected, the system does not make use of the NVLink Switch shared memory interconnect for GPUs that Nvidia has created to make superpods of shared memory GPUs.
There was some talk two years ago when the Venado system architecture was being formalized that Los Alamos might want to use InfiniBand, not HPE’s Slingshot variant of Ethernet, inside of a Cray “Shasta” XE supercomputer system with Grace-Grace and Grace-Hopper compute engines, but as it turns out, Los Alamos is deploying 200 Gb/sec Slingshot 11 interconnects. Our guess? HPE Slingshot 11 with 200 Gb/sec per port is a whole lot cheaper than Nvidia Quantum 2 InfiniBand with 400 Gb/sec ports.
The new Venado system is not a workhorse machine in the Los Alamos fleet, but an experimental one that is built from its own budget and for the express purpose of doing hardware and software research. Most of the machines that Los Alamos acquires are proposed, built, accepted, and put immediately to work on workloads for the National Nuclear Security Administration that manages the nuclear weapons stockpile for the US military.
Back in May 2021, here is what we knew about Venado officially from the announcement:
We also learned from talking to Jim Lujan, HPC program manager, and Irene Qualters, associate laboratory director of simulation and computation at Los Alamos, that the basic idea was to do an 80/20 split on compute cycles between the two architectures. Meaning 80 percent of the flops would come from the GPUs (and we assumed this was done at FP64 precision to carve the compute pie up) and 20 percent would come from the GPUs.
Considering that in regular GPU-accelerated machines, 95 percent to 98 percent of the flops are coming from GPUs, this Venado machine looked quite a bit heavier on the CPU than you might otherwise expect. For good reason, as Gary Grider, HPC division leader at Los Alamos for the past two and a half decades and, among many other things, the inventor of the burst buffer, tells The Next Platform.
“Our applications are these incredibly complex multiphysics, multi-link scale, extreme multi-resolution, incredibly complex packages with millions of lines of code that run on half the machine for six months to get an answer,” Grider explains. “That’s normal for us, which is not true of many other Department of Energy labs, if any. They might do it as an unusual thing, but not normally. The reason that it takes six months to run those applications is because the access to memory is so incredibly sparse and irregular because of the complexity of what they are trying to do – the applications re trying to run a problem that is 50 times bigger than the machine. And so that ends up being the driver for the kinds of machines we buy for our simulation environment because we have these needs. And it ends up being the reason why we don’t buy very many GPUs because GPUs are really good if you can do dense linear algebra. But if you are doing everything sparse and irregular, and everything is an index lookup and the like, then they are no better than the CPU, really. It is truly more about how much memory bandwidth you can buy per dollar than it is about how many flops you can buy.”
It was convenient, perhaps, that the hyperscalers and cloud builders running deep learning recommendation systems (DLRMs) on GPUs also needed a way to cache larger amounts of embeddings for those recommenders than could be held in the GPU HBM memory, and the answer Nvidia came up with was Grace, a glorified, computational memory controller for an additional 480 GB of LPDDR5.
What is not a coincidence, perhaps, is that two 72-core Grace chips connected by NVLink ports formed into a superchip has 960 GB of memory capacity and 1 TB/sec of memory bandwidth into those Arm “Demeter” V2 cores etched into the Grace chip. And thanks to four 128-bit SVE2 vector engines per V2 core, the Grace-Grace superchip can deliver 7.1 teraflops of aggregate peak FP64 compute all by themselves. Under normal circumstances, when the GPUs are doing most of the compute, you would have thought Nvidia would use less heavily vectored “Perseus” N2 cores from Arm. We think Los Alamos in the US and CSCS in Switzerland with the “Alps” system, pushed Nvidia to go with the V2 cores. And thanks to the relatively few cores in the Grace CPU, the relatively cheap and low-power LPDDR5 memory, and the relatively fat 480 GB of available memory, Grace has a good balance of memory bandwidth per core and low cost per unit of memory bandwidth.
The Grace CPUs, which we detailed here, have sixteen LPDDR5 memory controllers each with a total of 546 GB/sec of memory bandwidth and 512 GB of capacity. The delivered version of Grace has only 480 GB of memory and only 500 GB/sec of bandwidth. Go figure. The two CPUs in a Grace-Grace superchip are linked to each other coherently over a 900 GB/sec NVLink chip-to-chip implementation (C2C in the chiplet lingo). That same NVLink C2C interconnect is used to hook a Grace CPU to a Hopper GPU with 80 GB or 96 GB of HBM3 or 141 GB of HBM3E capacity, depending on which model you buy.
Anyway, based on what we knew about Grace and Hopper back in May 2022 and doing an 80/20 split on compute, we did some back of the envelope math and figured you would need 3,125 Grace-Hopper nodes and around 1,500 Grace-Grace nodes. The Grace CPUs have more FP64 oomph than many expected – again, we think this is intentional and driven by HPC customers, not AI customers – and the upshot is that the actual Venado system has 2,560 Grace-Hopper nodes and 920 Grace-Grace nodes.
If you do the math on that, there are a total of 316,800 Grace cores with a total of 15.62 petaflops of peak FP64 performance. The Grace CPUs in the Venado nodes have a total of 2 PB of main memory. (Huh, do you think that is a coincidence? We don’t. It is the same amount of memory as the Trinity system.) That LPDDR5 memory has an aggregate of 2.1 PB/sec of bandwidth.
There are 2,560 Hopper GPUs in the box, with a combined 85.76 petaflops of FP64 performance on the vector cores and 171.52 petaflops of FP64 performance on the tensor cores. That’s 92 percent of FP64 on the Hoppers and 8 percent on the Graces if you use the tensor cores on the H100s, but it is 85 percent on the Hopper if you only use the vector cores. We presume that those Hopper GPUs have 96 GB of HBM3 memory each, for a total of 240 TB of HBM3 memory and 9.75 PB/sec of aggregate bandwidth. If you do the math further, then 81 percent of the memory bandwidth of the machine is on the Hopper GPUs but an amazing 19 percent is on the Grace CPUs.
As part of the contract with HPE, the Venado system will be equipped with a Lustre parallel storage cluster that resides on the Slingshot network, and Grider says that Los Alamos is looking to also try out the DeltaFS file system and perhaps others with the machine.
Grider says that Venado is installed and running now, and acceptance should come in the next two months “unless there’s some gotcha that happens sometimes,” and there should be lots of applications running on the experimental machine by July or so.