The competition for the compute engines in hybrid HPC and AI supercomputer systems is heating up, and it is beginning to look a bit like back to the future with Cray on the rise and AMD also revitalized.
Nothing is better proof of confidence in the future AMD CPU and GPU roadmaps and the homegrown “Slingshot” interconnect and the “Shasta” systems design of Cray than the fact that these two vendors have just partnered to take down the “Frontier” exascale system to be installed at Oak Ridge National Laboratory as the successor to the current “Summit” pre-exascale system built by IBM in conjunction with GPU maker Nvidia and InfiniBand switch maker Mellanox Technologies.
Nvidia’s sudden desire back in March to shell out $6.9 billion to acquire Mellanox now makes a little bit more sense.
The Frontier system, to be installed in 2021, is part of the CORAL-2 procurement from the US Department of Energy, which is putting exascale systems into Oak Ridge as well as in Argonne National Laboratory – in that case the “Aurora A21” system using Intel Xeon CPUs and Xe GPU accelerators in a Cray Shasta system with Cray’s Slingshot interconnect between the nodes – and in Lawrence Livermore National Laboratory, home to the “Sierra” system that, like Summit at Oak Ridge, is based on the combination of IBM’s “Nimbus” Power9 processors, Nvidia “Volta” Tesla GPU accelerators, and Mellanox 100 Gb/sec EDR InfiniBand networks. Lawrence Livermore has yet to announce the winning bid for its CORAL-2 system, but we strongly suspect that the DOE will want to spread the money around, and its risk, and award IBM the contract to build its “El Capitan” follow-on to Sierra, using Big Blue’s Power10 processors as well as future GPU accelerators from Nvidia and future interconnects from Mellanox. We talked about the $1.8 billion that the US government is ponying up for exascale systems a year ago, and not much was known at the time about Aurora A21 (the successor to the failed pre-exascale machine based on Intel’s “Knights Hill” many-core processors and Omni-Path 200 interconnect that never saw the light of day and that was supposed to be installed last year), Frontier, or El Capitan.
With the announcement of the CORAL-2 machine at Oak Ridge to Cray, which is the prime contractor as the system maker, not the chip maker, should be, we are learning a little more about how the Frontier supercomputer will be built and the value of co-design.
In a press conference previewing the CORAL-2 award to Cray and AMD for Frontier, Pete Ungaro, Cray’s chief executive officer, said that the Frontier system would be comprised of more than 100 cabinets of machinery and would deliver in excess of 1.5 exaflops of raw double precision floating point math capability. Considering that a system architecture generally scales to 200 cabinets at Cray, that seems to imply that it could, in theory, build a 3 exaflops system if someone wanted to pay for it. To give Frontier some physicality, Ungaro said that the machine will be about the size of two basketball courts, will weigh more than 1 million pounds, and will have more than 90 miles of cabling.
Three things are immediately astounding about the Frontier system that Cray is building with the substantial assistance of AMD. The first is that the Shasta racks in the Frontier system will be able to deliver up 300 kilowatts of power density per cabinet. It doesn’t take a supercomputer to figure out that this is a liquid-cooled Shasta system, and even if this is not using a standard 19-inch rack (as we think it will not be), Frontier is going to be setting a very high bar for compute density all the same. Hyperscale datacenters can do maybe 15 kilowatts to 30 kilowatts, by comparison. That compute density in Frontier is being enabled in part by a new heterogeneous CPU-GPU blade design that Cray and AMD worked on together. Frontier will be sitting in a 40 megawatt power envelope, which is about half of what five years ago everyone was worrying an exascale system might consume.
People are clever, aren’t they? That’s why we keep them around. . . .
The second thing that is striking about Frontier is that AMD ran the compute table completely, winning the deal for both the CPU and the GPU accelerators. Given that Thomas Zacharia, director of Oak Ridge, was bragging that the current Summit system built by IBM was completed nine months ahead of schedule and $30 million lower than its anticipated budget way back when the CORAL-1 contracts were bid, and given that IBM and Nvidia did a very good job helping the scientists working at or with Oak Ridge get their codes ported from Titan to Summit, you would have thought that Frontier would almost assuredly be comprised of IBM’s future Power10 processors and a future Nvidia GPU accelerator. But that didn’t happen, and we can only infer as to why this is the case.
This win appears to be about performance and price/performance.
Everyone was expecting Frontier to crest above 1 exaflops at double precision, and very few people had been thinking that with an expected budget of between $400 million and $600 million for exascale systems that the costs would not go right up to the top of that budget. But nonetheless, promising in excess of 1.5 exaflops even at the top end of budget range – $500 million for the Frontier system itself and $100 million in non-recurring engineering (NRE) costs relating to the development of compute, storage, networking, and software technologies that makes the Frontier machine a true system – was more than many expected. AMD is coming through with bang for the buck, which historically speaking, is precisely AMD’s job.
Welcome back, AMD.
As we have pointed out before with the Summit system, the GPU accelerators, which deliver the bulk of the raw compute in the machine, also dominate the cost of the machine. This is reasonable since, in this case, the Nvidia Volta Tesla GPU accelerators are among the most dense and sophisticated computing devices ever invented, with a huge GPU, high bandwidth stacked memory, and packaging to bring it all together. But AMD knows how to make GPU cards with HBM, too, and it has apparently cooked up a future GPU that can do the heavy lifting that both traditional HPC simulation and modeling and new-fangled AI workloads both require – and at relatively compelling price.
“When Nvidia was the only game in town, they could charge a premium price for their accelerators,” Jeff Nichols, associate laboratory director for Computing and Computational Sciences at Oak Ridge, tells The Next Platform, putting it extremely delicately. “The high bandwidth memory and the accelerator costs dominate the costs of these systems.”
That has, without a doubt, not changed with the Frontier machine. But AMD seems to have pushed the envelope on price/performance. We were expecting Frontier to come in at 1.3 exaflops double precision at a $500 million budget when we did some prognosticating a year ago, which was about $385 per teraflops at the system level (including NRE and support costs), and our pricing was a little low and so was the flops. Frontier is coming in at about $400 per teraflops, but it is 50 percent higher than the baseline 1 exaflops required to get to break through the exascale barrier and, importantly for political and economic reasons, possibly large enough for Frontier to rank as the most powerful system in the world when it is operational in late 2021 and accepted sometime in 2022. Summit, by comparison, delivers 207 petaflops at a cost of $214 million, or $1,032 per teraflops. So this is a big change in bang for the buck with Frontier, and presumably a number that IBM and Nvidia could not – or would not – hit.
To put that incremental performance into perspective: The difference between what we expected with Frontier and what Cray and AMD are promising to deliver is more than an entire Summit system worth of raw performance, and the difference between where it will end up and the 1 exaflops barrier is at least two and a half Summits. The percentages alone make these leaps look smaller than they are.
The exact feeds and speeds of the AMD CPUs and GPUs that are at the heart of the system were not divulged, but Forrest Norrod, senior vice president and general manager of the Enterprise, Embedded, and Semi-Custom group at AMD, told The Next Platform what it isn’t, which is almost as useful. The CPU is a unique, custom device that is not based on the impending “Rome” second generation Epyc processor and it is not based on the future “Milan” follow-on, either, but is rather a custom CPU. Lisa Su, AMD’s chief executive officer, said that the processor used in the Frontier machine was “beyond Zen 2,” the core that is being used in the Rome chips. Norrod joked that when this custom Eypc chip is divulged, it will be named after an Italian city. . . . The Radeon Instinct GPU accelerators in Frontier are not derivative of the current “Vega” or “Navi” GPU designs, but a custom part. In both cases, the chips have had special instructions added to them for goosing the performance of both HPC and AI workloads, according to Su, but the exact nature of those enhancements are not being revealed.
The other secret sauce that AMD brought to bear in Frontier is an enhanced Infinity Fabric interconnect between the CPUs and the GPUs that will offer coherent memory access across the devices, much as IBM and Nvidia have done across the Power9 CPUs and Volta GPUs through NVLink interconnects. In fact, keeping this fat node approach for compute and coherency was critical for Oak Ridge, so AMD and Cray really had no choice but to deliver this capability. The Frontier design will lash four Radeon Instinct GPUs to each Epyc processor – a more aggressive ratio than was used with Summit, which had six Volta GPUs for every pair of Power9 processors. And it looks, at first blush, like Frontier will be based on a single-socket server node, too, which is interesting indeed.
As to networking outside of the nodes, Frontier will of course use Cray’s homegrown, HPC-style Slingshot interconnect, which is a superset of Ethernet that has dynamic routing and congestion control features like its previous “Gemini” and “Aries” interconnects used in the XT and XC supercomputer lines, respectively, but while at the same time maintaining compatibility with enterprise and campus networks that are based on Ethernet. Some of the NRE work being done by Cray and AMD is to integrate Slingshot with this Infinity Fabric such that the cluster network can enable direct addressing of the CPU and GPU memories across nodes, according to Steve Scott, chief technology officer at Cray.
The software stack for AMD compute is also going to be significantly advanced as part of the Frontier system, with substantial enhancements to the ROCm open source tool chain from AMD and its integration with the Cray Programming Environment. The system software stack will, according to Scott, keep the customer facing software looking the same to users and programmers as it did on both Titan and Summit, such as the same MPI, OpenMP, and OpenACC libraries, but the Cray Linux platform and its related system software will be less monolithic and, as Scott put it, will have a “well-defined, clean, open, and documented API stack that allows the mixing and matching of software at a bunch of different levels and that is much more modular and containerized.”
As for storage, we presume that this will be integrated directly into the Frontier cabinets, which is also one of the selling points of the Slingshot interconnect. How much of that $500 million hardware budget is for storage is not clear.