There are many interpretations of the word venado, which means deer or stag in Spanish, and this week it gets another one: A supercomputer based on future Nvidia CPU and GPU compute engines, and quite possibly if Los Alamos National Laboratory can convince Hewlett Packard Enterprise to support InfiniBand interconnects in its capability class “Shasta” Cray EX machines, Nvidia’s interconnect as well. And, maybe its BlueField DPUs as well, while we are thinking about it.
We first caught wind of the supercomputer to be installed at Los Alamos sometime in early 2023 year that is now named Venado when Nvidia announced its plans for the “Grace” Arm server chip back in April 2021. At the time, we didn’t know much about the Grace CPU and the “A100-Next” GPU was not even yet officially known as “Hopper,” but we did know how the NVLink interconnect was going to be used to very tightly couple the CPUs and the GPUs together. We also knew that Los Alamos was going to buy an AI supercomputer using Grace and Hopper, and so was the Swiss National Supercomputing Center, with the CSCS machine in Switzerland, nicknamed “Alps,” would come in early 2023 rated at 20 exaflops.
It is not clear if Venado or Alps will be started up first, and all Jim Lujan, HPC program manager at Los Alamos, said to us was that Venado would be the first Grace-Hopper supercomputer installed in the United States. So there is a race there, and we figure that CSCS, as the larger system, demanded in its deal that the Alps machine be installed first. That’s what we would have done.
A year ago, when these deals were announced, we figured that the 20 exaflops was gauged using mixed precision floating point because there is no way anyone is installing a 20 exaflops machine as measured in 64-bit floating point precision. We have had enough trouble getting machines into the field with 2 exaflops peak, to be frank. Nvidia has confirmed that the 10 exaflops rating on Venado is using eighth-precision FP8 floating point math with sparsity support turned on to come up with that number – we didn’t know Hopper had FP8 math support a year ago when the CSCS and Los Alamos deals were announced – and we guessed that based on the statement that Alps would be 7X faster than Nvidia’s own “Selene” supercomputer that Alps would come in 445 teraflops with sparsity support. (We are going to revisit this in a second as we talk about the scale and performance of both Alps and Venado.)
Since that time, the salient characteristics of the Grace CPU and Hopper GPU have been announced. At the GPU Technical Conference this year, the Hopper GPU was announced, and we did a deep dive on its technology here and also a price/performance estimate there based on historical trends for Nvidia GPU compute engine pricing. We did a deep dive on the Grace CPU here, and
As far as we can tell, neither the Alps nor Venado systems are using the NVSwitch 3 extended switch fabric in the new DGX H100 SuperPODs as the basis of their machines, unlike what Nvidia is doing with its future “Eos” kicker to the Selene supercomputer, which will be rated at 275 petaflops at FP64 on its vector engines without sparsity and 18 exaflops at FP8 on its matrix engines with sparsity. By picking the Shasta design, that means a more traditional clustering at the node level, not at the GPU memory level, and at only 256 GPUs, that is not enough scale for either Alps or Venado and it is too different from the machines that CSCS and Los Alamos are used to building. To be fair, Eos is using big fat InfiniBand pipes to lash together 18 DGX H100 SuperPODs, which both CSCS and Los Alamos could have done but decided not to.
Both Alps and Venado are being built by HPE using the Shasta design, and for capability class machines, that means using the Slingshot 11 interconnect because the HPC market wants and needs multiple high end interconnects to ameliorate risk and because HPE has pretty much demanded it as part of these high end machines so it can cover current and future development for Slingshot. Which is fair, up to a point.
Venado may be at that inflection point, particularly since Nvidia and Los Alamos have a longer deal that aimed to research the use of DPUs to create what Nvidia calls “cloud native supercomputing,” which uses less expensive DPUs to offload storage, virtualization, and certain routines such as MPI protocol processing from CPUs and put them on specialized accelerators to let the more expensive CPUs do more algorithmic work in a simulation. The Los Alamos and Nvidia deal looks to accelerate HPC application by 30X over the next several years, and as far as we know from Nvidia, the DPUs are not necessarily part of the Venado system. But they could be.
The trouble is that a BlueField-3 DPU from Nvidia bundles together a 400 Gb/sec ConnectX-7 network interface card that speaks both InfiniBand and Ethernet plus 16 “Hercules” Armv8.2+ A78 cores and, as an option for even further acceleration, an Nvidia GPU. It does not, strictly speaking, speak Slingshot 11 Ethernet in the same way that HPE’s own “Cassini” NIC ASIC does. (Although we do know that the Slingshot 10 interface cards were actually ConnectX-6 cards from Nvidia.) There probably is some way to make Slingshot 11 work with 400 Gb/sec BlueField-3 DPUs, but that would also mean gearing down their throughput to 200 Gb/sec, which is the fastest speed that Slingshot currently supports. (HPE has 400 Gb/sec and 800 Gb/sec switch ASICs and adapters in the pipeline, as we have previously reported.) With BlueField-4, due in early 2023 according to the roadmap, the ConnectX-7 ASIC, the Arm DPU, and the Nvidia GPU will be all put into one package. That is the one that Los Alamos probably wants to play with in Venado.
Given all of this, we strongly suspect that Los Alamos wants to have more options for Venado than were given by HPE for many of the capability class supercomputer deals that it has taken down, notably the 93.8 petaflops “Perlmutter” supercomputer at Lawrence Berkeley National Laboratory (using AMD CPUs and Nvidia GPUs), the new 2 exaflops “Frontier” supercomputer at Oak Ridge National Laboratory (using AMD CPUs and GPUs), the future 2 exaflops “Aurora” supercomputer going into Argonne National Laboratory (using Intel CPUs and GPUs), the future 2 exaflops plus “El Capitan” supercomputer going into Lawrence Livermore National Laboratory next year, and the 500 exaflops or so Alps supercomputer going into CSCS. All of those machines, and quite a number of smaller ones, have Slingshot interconnects.
It is understandable that Los Alamos, which has hosted some of the most important supercomputers ever created – notably the petaflops-busting “Roadrunner” system built by IBM out of AMD Opteron processors and Cell accelerators but also more recently including the “Trinity” system, comprised of Xeon and Xeon Phi CPUs from Intel and its former 100 Gb/sec Omni-Path interconnect, and its “Crossroads” supercomputer replacement, which is based on Intel’s “Sapphire Rapids” Xeon SP processors and HPE’s Slingshot interconnect – would want to explore a different path with Venado.
“We are institutionally making this as an investment, which means we are looking at the broadest possible application space that we need,” Irene Qualters, associate laboratory director of simulation and computation at Los Alamos, tells The Next Platform. “So not only the Advanced Simulation and Computing program for nuclear weapons management, but we are also looking more broadly across our portfolio for the work that we do in climate, applied energy, and global security more broadly. Because they were doing this as an institutional investment, it really does allow us to explore the space so as we make choices for the future we are really well informed on a broad array of applications and scientific challenges.”
That is, after all, one of the core purposes of the national HPC labs. They get paid to try different things and push them to their limits.
Neither Qualters nor Lujan would comment directly about the choice of network and storage for Venado, but when we explained our thinking, as outlined above, they both smiled. When we nudged, this is what Lujan could say: “We are still evaluating the interconnect technology that we want to look at this part of the mission for this system, which is to look at research areas in high performance computing along system, along network and storage vectors. And so we are closely communicating with HPE and Nvidia on what may be the best interconnect technology to facilitate that part of the role.”
What we have heard through the grapevine ahead of our chat with Qualters and Lujan is that the networking and storage choices have not been made as yet, to keep the options open as long as possible, and there is a chance that InfiniBand could be chosen for the interconnect or the storage or both. Lujan did say that storage would be directly integrated into the network as Oak Ridge is doing with Frontier but did not do with its “Summit” predecessor, and added that it would very likely be an all-flash storage system based on NVM-Express flash and very likely running the open source Lustre parallel file system.
So it sounds like Los Alamos wants one network for both even if Nvidia is pushing to get any part of the interconnect deal it can inside of a large-scale Shasta installation.
The exact compute ratio of CPUs and GPUs and the node count for the Venado system has not been divulged, but Lujan did give us some hints.
“We are looking at an 80/20 split, where 80 percent of the cycles in the nodes come from the GPUs and then the other 20 percent are coming from the CPUs only,” Lujan says. “That way we can facilitate a significant amount of research on the CPU forefront, yet still provide significant amount of cycles that we can get out of the GPUs as well. But we are mindful that we have a unique capability with Grace-Hopper because of this close coupling of the CPU and the GPU.”
That is a statement we can work with. In modern GPU-accelerated systems that are completely loaded up on each node with GPUs, the ratio of CPU compute to GPU compute in FP64 precision – which is what HPC cares about most – is on the order of 5 percent to 95 percent.
If you do the math backwards, to get 10 exaflops at FP8 precision using the Hopper GPUs would require 3,125 Hopper GH100 accelerators. If you work that back to FP64 on the vector cores on the Hopper without sparsity on, there is 30 teraflops per H100 so a total of 93.75 petaflops from the Hopper units. Using an 80-20 ratio of GPU to CPU compute, that would still be 23.45 petaflops on an all-CPU portion of Venado, based on what Lujan said.
We don’t know what the vector units in the Grace CPU look like, but we do know that a Grace chip has 72 working cores. If Nvidia uses the Neoverse “Zeus” V1 core, it has a pair of 256-bit SVE vector units. The Neoverse V2 cores in the “Poseidon” generation could have wider vectors (we don’t know) and could be used in the Grace chip. (Again, we don’t know.) We can’t speculate much on clock speeds, either.
Now, here is where it is going to get a little weird, but let’s go for it and see where it takes us.
Nvidia has shown benchmark test with a pair of Grace CPUs with 144 threads beating out a pair of 36-core “Ice Lake” Xeon SP-8360Y Platinum processors running at 2.4 GHz by 2X running the WRF weather modeling code. We don’t have gigaflops ratings for the Xeon SP-8360Y Platinum processor, but we do have gigaflops ratings for a lot of other Intel chips in this document, and one of them is the Xeon SP-8360H processor, which has 24 cores running at 3 GHz, which is 1,536 gigaflops. Add more cores and slow the clocks, and an Ice Lake Xeon SP-8360Y Platinum processors should be rated at 1,920 gigaflops. Which implies that a Grace CPU is going to be rated at around twice that, or 3,840 gigaflops.
So, given this, how many Grace CPUs will be needed in the Venado system to yield 23.45 petaflops? That works out to 6,107 in total. With rounding errors, let’s call it 6,100.
Now, the Venado system can use a one-to-one pairing of Graces and Hoppers on a single package, but if you look closely at the possible configurations that Nvidia envisions, it could be a coupling of one Grace and two Hoppers, one Grace-Grace superchip and two Hopper SXM5s, one Grace-Grace superchip and four Hopper SXM5s, or even one Grace-Grace superchip and eight Hopper SXM5s.
Now you can see why the way the CPUs and GPUs are paired matters. If you hang too many Hoppers off of a Grace, then you will need a lot more Grace-Grace modules on the compute only side to reach the 20 percent of total FP64 performance that Los Alamos is shooting for.
So we think Venado will have 3,125 Grace-Hopper nodes and sometimes not use the GPUs and around 1,500 Grace-Grace nodes to provide the rest of the CPU-only performance. Call it somewhere north of 4,600 nodes in total.
That is a guess, but it is a reasonable one, especially looking at the ratio of Xeon and Xeon Phi compute engines in the Trinity machine.
fp8 is a vanity metric, fp16 is already more or less unusable since the dynamic range is too small, which is why everyone ended up using bf16 which has a full 8 bits of just exponent. Since floating point loses a bit for the sign, that’s only 7 bits total between the exponent and the mantissa, which isn’t going to let you train anything. GIGO.
That photo of LANL is really old.
“Lujan did say that storage would be directly integrated into the network, not bolted on through a different network, as Oak Ridge is doing with Frontier, and added that it would very likely be an all-flash storage system based on NVM-Express flash and very likely running the open source Lustre parallel file system.”
Frontier’s Orion file system sits directly on the dragonfly network with over 18.5 TB/s of bandwidth between the compute and storage dragonfly groups. Those storage groups also have a high number of gateway (in HPE parlance) or router (in Lustre parlance) nodes to provide 1.6 TB/s connectivity to external resources including the analysis cluster, data transfer nodes, and cloud-based, workflow orchestration services, as well as access to Summit’s GPFS file system, Alpine.
An all-flash filesystem like that on Perlmutter is convenient for users that want a single namespace in addition to being fast. Unfortunately, flash still costs an order of magnitude more than disk which means capacity will be a tenth of a disk based system. Frontier’s Orion oil system strives to provide both within a single name space (~10 PB of flash and ~600 PB of disk) that requires more complicated software to manage migration between the tiers. Time will tell if this meets the needs of our users.
I meant “Summit.” Damnit. My bad.