Supercomputer interconnects have not been this exciting in a long time, but the confluence of forces have come together to put high speed, low latency networks at the forefront of systems architecture. And thanks to the flood of money being spent on AI training and inference systems, which is an order of magnitude larger than traditional HPC systems spending, there has never been a better time to try to get a new network into the field and generate some revenues.
Which is a convenient coincidence for Eviden, the supercomputing unit of European systems and services supplier Atos, which has been deploying its own supercomputing interconnect, called Bull Exascale Interconnect, or BXI for short, for several years.
The other serendipitous bit of timing is that the hyperscalers and cloud builders that are driving the revenues in supercomputing (in the broadest sense of that word) have tolerated InfiniBand in their networks but are growing increasingly intolerant (if not allergic) to InfiniBand as they try to extend beyond its scaling limits with Ethernet. They would rather have a special version of Ethernet for their AI and HPC backend networks than a totally alien InfiniBand, which is why they have either created their own interconnects, as Google has done, or have banded together in July 2023 to form the Ultra Ethernet Consortium to create a better Ethernet together, as Microsoft, Meta Platforms, Oracle, Alibaba, Baidu, Tencent, and Bytedance have done formally. (We strongly suspect that Amazon Web Services and Google have joined the UEC behind the scenes.)
The UEC has the explicit goal of pushing the scale of relatively flat Ethernet networks up to 1 million endpoints over the next several years, giving Ethernet more scale than InfiniBand or even proprietary networks of years gone by such as Cray’s “Aries” XC interconnect or Bull’s BXI before that French company was acquired by Atos and its core systems business renamed Eviden.
Along with switch ASIC makers Broadcom, Cisco Systems, and Hewlett Packard Enterprise, Eviden is a founding member of the UEC. Cornelis Networks, which bought the Omni-Path variant of InfiniBand from Intel, subsequently joined up, and so did Nvidia, which obviously controls InfiniBand and which also has a very respectable high-end Spectrum Ethernet switch business as well as multi-protocol ConnectX SmartNICs and BlueField DPUs to use at the end points. Marvell Technology, which has its own Prestera and acquired TeraLynx ASICs as well as , also joined last year.
Eviden was already thinking about using Ethernet as a transport for BXI nearly four years ago, Eric Eppe, group vice president of the HPC, AI, and quantum portfolio and strategy at Eviden, tells The Next Platform. More than three years ago, long before there was a UEC, Eviden had decided to move the BXI protocol from a proprietary switch ASIC to a one based on Ethernet so BXI could be more widely adopted in the existing HPC and burgeoning AI markets. The explosion in the scale of AI clusters, which utterly dwarf the cost and size of exascale supercomputers designed predominantly to support HPC simulation and modeling workloads, is why that 1 million endpoint stake is in the ground at UEC.
We’re not there yet, of course. But Eviden wants to get there with the combination of BXI and Ethernet.
BXI, which we first drilled down into back in November 2022, is a commercialized version of the Portals protocol that has been evolving under its development at Sandia National Laboratories for the past three decades. BXI v1 and BXI v2 have been able to stand toe-to-toe with InfiniBand networks, and with BXI v3, which was recently announced, the switch and NIC ASICs are moving to an Ethernet transport. And to be very specific, Eviden has worked with a merchant silicon maker to create a custom version of an existing Ethernet switch ASIC that can do some of the special BXI magic that makes it suitable for HPC and AI workloads.
The BXI v3 switch ASIC was homegrown and etched using TSMC’s N12 12 nanometer process node, and to make the BXI v3 switch ASIC, Eviden was looking at having to use the N4 4 nanometer or N3 three nanometer nodes. “That first step in making the chip is tens of millions of dollars,” says Eppe. “And if you want to keep on the advanced process nodes, you need to have a very large volume of customers over which to amortize this investment. Which is why more than three years ago, we looked at the Ethernet market and wondered if we could reuse some of the technologies that are out there, make our changes, and benefit from the Ethernet ecosystem. So we basically found one and brought it up to our specifications.”
The port to port hop on this modified Ethernet ASIC running the BXI protocol is roughly 200 nanoseconds, which is a factor of 10X better than the 2 microseconds that a very fast Ethernet switch ASIC takes on a port to port hop because it has to go through the entire TCP/IP stack to move data.
We asked Eppe if Eviden had partnered with Broadcom or Cisco or Marvell or Nvidia for the BXI v3 switch ASIC, but all he confirmed with a smile is that it is one of them and said no more about it. With Broadcom and Marvell having extensive experience in packaging as well as shepherding things through the foundries of Taiwan Semiconductor Manufacturing Co, these are the obvious choices. Nvidia might ask to much money to license a Spectrum ASIC, but Mellanox did have a history of licensing InfiniBand technology to HPC customers. (Oracle is one, and the national supercomputing centers in China that host the Tianhe family of supercomputers, built with the TH-Express interconnect, is another.) It is hard to game theory this one. It is probably not Cisco, but it could be.
(If we were doing the choosing, we would go with Innovium TeraLynx from Marvell because it is already a stripped down Ethernet aimed at the very large, 100,000-node Clos networks used by hyperscalers and cloud builders.)
The other reason to go to Ethernet and to partner for a semi-custom ASIC is ongoing investment to keep pushing up bandwidth.
“What is important for us is that we start to keep up the cadence of doubling the bandwidth every two years,” says Eppe with a smile. “So our partner has to be someone who can afford the release of the new switch ASIC every two years.”
Marvell just announced a broad, five year technology partnership with Amazon Web Services, and switching (including Innovium and possibly Prestera ASICs), is part of that deal. The other three ASIC makers we mentioned – Broadcom, Cisco, and Nvidia – have plenty of customers – and therefore plenty of money – to keep Ethernet on the Moore’s Law curve. But Marvell needs the deal more, and has ASIC customization services as part of what it does. Hmmmmmm. . . .
One more thing: Whatever modifications were made to the Ethernet switch ASIC that Eviden picked as the foundation for BXI v3, those customizations are its own and no one else can use them.
With this generation of HPC and AI backend networks, congestion control and adaptive routing, enabled by packet spreading and multipathing, are critical. Packet spreading is great for Ethernet networks, where the kinds of workloads you are running are relatively simple and they don’t care what order packets arrive in from endpoints. But HPC and AI applications, and importantly the MPI protocol that underpins these loosely coupled shared memory – well, that is what MPI is doing, kinda, at a global level – applications, need packets to arrive in order so they can be processed in order.
The BXI v3 switch is matched to a companion BXI v3 SmartNIC, and that network interface card is designed wholly and completely by Eviden and is really the secret sauce in how you build a strong HPC and AI network. First, a slew of stuff is offloaded from the endpoint host or the switch to these SmartNICs, including the OpenMPI, NCCL, and RCCL message passing interface (MPI) implementations and the Open Fabric Interfaces (OFI) semantics processing communication API.
Second, the use of packet spraying essentially turns the synchronous process of collective operations – all-reduce and all-gather are the common ones – into asynchronous ones over the network to get around traffic bottlenecks, and the NIC then reorders all of the packets and presents them in order to the host node and its CPUs and GPUs, making it synchronous again like things were never mixed up in the first place.
Matching of reception buffers is also offloaded to the BXI v3 SmartNIC, which can register up to 32 million potential reception buffers, and it can manage over 2 million contexts in the NIC. (A BlueField-3 DPU from Nvidia, by comparison, has enough oomph to manage 250,000 contexts, says Eppe.) The BXI v3 network adapter has enough packet buffer space to manage packets that are up to 9 MB in size, which is useful for so-called “elephant flows” that are common on AI applications.
The neat bit about these much-improved Ethernet networks and their SmartNICs is how they have tons of telemetry coming off the NICs and switches and when congestion starts, the network automagically reorganizes the routes for packets. Eppe says that NXI v3 can do a full route update to all of the switches in a network in less than 200 milliseconds, and in a lot of cases that is done is less than 50 milliseconds. That seems like an eternity compared to a 200 nanosecond port hop, but this is very fast for congestion control and adaptive routing.
What may not be so obvious is why low latency, high bandwidth, and congenstion control is so important. Eppe says that in a typical GPU cluster running at any appreciable scale — meaning thousands or tens of thousands of GPUs sharing an AI training workload — the GPUs are actually active, computing stuff, about a third of the time. The other two thirds of the time, they are waiting for other GPUs to finish computing and communicating over the network, and most of that wait is networking because the computation is synchronized.
The effect of this, as we are fond of pointing out, is that the effective price/performance of an AI cluster is three times worse than you think based on raw speeds and feeds. If you think you are paying $500 per teraflops per unit of performance, you are actually paying $1,500. Anything you can do to make that interconnect between the GPU nodes work faster and get to chewing data sooner gets you closer to the ideal bang for the buck that you thought you were paying for. A better network might even mean you can have a smaller cluster, and thus spend less money in the first place.
In terms of scalability, BXI v1 topped out at 64,000 endpoints compared 11,644 with EDR InfiniBand (that is for a two-tier leaf/spine network). BXI v2 topped out at 64,000 endpoints and BXI v3 does as well. You can always build bigger BXI or InfiniBand networks, but they require another layer of switching, which adds an extra hop between endpoints which increases latency. This is why the UEC members want to two tier leaf/spine network that supports 1 million endpoints – they want a flatter, more deterministic, cheaper way to link lots of accelerators together. BXI v3 supports fat tree and dragonfly+ topologies as well as route-optimized fat tree, which is especially popular for training large language models.
The BXI v3 switch has 64 ports running at 800 Gb/sec, which can be split down into 128 ports running at 400 Gb/sec if you want to use fewer switches in the network. (But you sacrifice the bandwidth if you do that.) For now, that is probably the smart thing to do since the BXI v3 SmartNICs run at 400 Gb/sec. The BXI v3 SmartNICs come in PCI-Express 5.0 and OCP-3 form factors and take up two PCI-Express 5.0 slots on the hosts; they have two SmartNICs implemented side-by-side on a single board, yielding two 400 Gb/sec ports across two network interfaces.
BXI v3 will ship in 2025, and Eppe says that many of the features that will eventually be in the UEC specification (which is still not ratified) will be part of this Eviden networking stack.
In the fall of 2027, BXI v4 will come out, which will double up the speed of the ports on the NICs and switches to 1.6 Tb/sec. The NICs will plug into PCI-Express 6.0 ports and have CXL memory sharing support and also add more AI offload capabilities to the NIC. Eviden will work with its switch ASIC partner and its SmartNIC ASIC design team to try to push down latencies in these devices as much as possible, but the end to end latency is a much more important thing to worry about in the network.
And sometime in 2029, with BXI v5, the port speeds on the switch will double up to 3.2 Tb/sec and the NICs will move on to PCI-Express 7.0 slots while staying at 1.6 Tb/sec speeds on the ports. By this time, the AI stack will probably settle down quite a bit and Eviden hopes to be able to say that its BXI v5 SmartNIC has full HPC and AI offload support.
It looks like both Eviden and Cornelis Networks (see What If Omni-Path Morphs Into The Best Ultra Ethernet?) are going to give HPE Slingshot a run for the high-end Ethernet interconnect money, and all three are going to give Arista Networks, Cisco, and Marvell some healthy competition on the UEC front as well. The game’s a-foot, as they say.
Cool to see some competition on the Internet side of HPC interconnects. The three exafloppers on the current Top500 are HPE Slingshot-11 machines, and, with JEDI and JETI, it seems that Jupiter will be a Quad-rail Infiniband job … BXI brings in much needed diversity and competition in this space IMHO! I hope Alice Recoque can benefit from the BXI v3 innovations, as a step up from CEA-HE/HF motors that run BXI v2.
Meanwhile, BXI v4 (2027), with UEC, PCIe6 and CXL (hopefully 3.x+) should be the real game changer!
Could more networking standards be what the world of AI and HPC needs? Omnipath is out at 400 Gbps. Where is that next generation token ring network that messages directly with the ring bus connecting CPU cores?
Oh, I liked Token Ring. But IBM said it did not scale really well for large rings even back in the 14 Mb/sec days….
Both BXI and HPE’s slingshot use Sandia’s Portals messaging library as the basis for their MPI implementations.
Does UEC define portals layers, or is that vendor specific? One wonders if the portals layer will be the same between Eviden and HPE, with only the MPI library differing?
Interesting question.
I guess, while the API (Portals) would be the same (up to possible version numbers), a differentiator would come in at the effectiveness of SmartNIC offload provided by the respective systems (including provisioning of data structure space for zero copy pipelining), and how much of the related gains in efficiency and performance eventually propagate over to UEC as well (possibly through evolved hardware, like BXI v5). I imagine that Cornelis/Libfabric faces similar opportunities and challenges for performance and flexibility in their HW/SW design optimizations to “minimize impedance mismatch” between apps and fabric — if I understand well …
I remember my time with my Bull colleagues very fondly, when my employer was still part of Atos.
And I also remember them being extremely proud of BXI, while my personal interest at the time was more in data plane processing (PIN?) on networks e.g. via P4 and PIM on RAM. BXI was all about better bang for the buck and my focus was on advancing architectures for value driven computing.
If I was a betting man, I’d say Marvell for both technical and cultural reasons.
When I read what Marvell is up to in trying to create a cheaper proprietary alternative to HBM for AI ASICs, that is very similar to what drove Bull to create BXI.