The relentless need for bandwidth is probably something that all of us are well aware of these days in our home lives thanks to the coronavirus outbreak. Now we can all appreciate at a fundamental level what it feels like in the datacenter most days, and why Ethernet switch ASIC makers are all trying to push the bandwidth envelope.
Mellanox Technologies, which is still independent from Nvidia and which has gotten a certain amount of traction among hyperscalers and cloud builders for its Spectrum line of Ethernet switch ASICs and switches, has recently unveiled its third generation of chips and is pushing the bandwidth envelope like its rivals Broadcom, Intel/Barefoot Networks, and Innovium, which all play in the upper echelon of the Ethernet switch space.
The pace of Ethernet ASICs coming out of Mellanox has been pretty steady. The company got into the Ethernet market back in April 2011 with the SwitchX chips, a 1.4 billion transistor monster implemented in 40 nanometer processes from foundry partner Taiwan Semiconductor Manufacturing Corp. The SwitchX ASIC delivered Ethernet ports running at 10 Gb/sec and 40 Gb/sec and InfiniBand ports running at 56 Gb/sec, topping out at 4 Tb/sec of aggregate switching bandwidth with port-to-port hop latencies on the order of 170 nanoseconds to 220 nanoseconds depending on the protocol and workload. The SwitchX-2 follow-ons from October 2012 had the same basic feeds and speeds, but added some software-defined networking (SDN) functionality to the chip and enabled that Fibre Channel capability. While the SwitchX approach was good in terms of cutting down the number of chips Mellanox had to tape out, it had the effect of raising the latencies for InfiniBand customers, and hence the company reforked the switch ASIC lines in 2015 with the launch of the initial Spectrum-1 chips in June 2015.
The company came out swinging with those initial Spectrum-1 chips, which were etched in 28 nanometer processes from TSMC and which delivered 3.2 Tb/sec of aggregate non-blocking Ethernet switching bandwidth and under 300 nanosecond port-to-port latency. The Spectrum-1 chips could implement 64 ports running at 50 Gb/sec (negotiable down to 10 Gb/sec, 20 Gb/sec, 25 Gb/sec, and 40 Gb/sec) or 32 ports running at 100 Gb/sec (negotiable down to 40 Gb/sec and 56 Gb/sec). The Spectrum-1 had 128 SerDes using non-zero return (NRZ) encoding running at 25 GHz (with 3 GHz of encoding bandwidth overhead on that which the networks cannot use). It could handle 4.76 billion packets per second and had a port-to-port hop of around 300 nanoseconds, and do it all in a 135 watt power envelope.
Mellanox cranked the launched the Spectrum-2 chips in July 2017, shrinking down to 16 nanometer processes at TSMC for the chip etching and to PAM4 encoding on the 128 SerDes on the chip, which gets two bits per signal instead of the one bit per signal, doubling the effective bandwidth per lane up to 50 GHz. So it took half as many lanes to get up to 50 Gb/sec or 100 Gb/sec ports and 200 Gb/sec and 400 Gb/sec ports could also be supported, and with the same number of SerDes on the device, the Spectrum-2 chip topped out at the same 6.4 Tb/sec of aggregate switching bandwidth, and could be carved up into 16 ports running at 400 Gb/sec, 32 ports running at 200 Gb/sec, 64 ports running at 100 Gb/sec, and 128 ports running at 50 Gb/sec native, with negotiation down to lower speeds where appropriate. The Spectrum-2 ASIC had a port-to-port latency of 300 nanoseconds to 400 nanoseconds, depending on where you look in the Mellanox specs.
Interestingly, the SerDes in the Spectrum-2 chip can all act independently to run at lower speeds, or be ganged up in groups of 2, 4, or 8 so the ASIC can support a mix of port speeds at the same time as switch makers needed for their uplinks and downlinks. The Spectrum-2 chip could support 9.52 billion packets per second and had an extended access control list (ACL) that was implemented with a combination of an FPGA hooked to DRAM on the switch motherboard giving it 2.5 million routes.
Because of all the extra components, this Spectrum-2 chip came in at 220 watts – still less than the 295 watts of the Broadcom Tomahawk-2 that it competed against.
With the Spectrum-3, Mellanox is not just moving to 16 nanometer processes from TSMC, but it is also moving from monolithic designs to chiplet designs, just as Intel’s Barefoot Networks division is doing with its Tofino-2 switch ASIC. Specifically, the Spectrum-3 design has a large digital switch surrounded by eight analog SerDes blocks that interface with the ports and therefore the outside world. Mellanox has an ultra-short, high bandwidth interconnect that links the SerDes to the digital switch. The chiplet approach allows Mellanox to design the digital and analog components separately and to push the aggregate switch ASIC beyond the reticle limits of a monolithic design to scale up ASIC bandwidth and functions.
“It is not necessarily immediately obvious why the decoupling between the digital and analog chips is so important,” Kevin Deierling, vice president of marketing at Mellanox, tells The Next Platform. “But when developing an ASIC for a new process, getting the analog SerDes ported is usually the gating item to volume shipments. Because these SerDes chips are mostly analog, the transistors and other devices don’t really scale as well as digital circuits to take advantage of the new process to shrink the die. So a monolithic die results in slower time to market without any benefit from the analog circuit shrinking. De-coupling these means you can shrink the digital piece and still use SerDes chiplets that are fully validated on a previous process nodes. Further, it means that you can independently develop different I/O technology – 25G NRZ, 50G PAM4, 100G PAM8, or even optical interfaces – that connect to the same core ASIC device. This of course allows a larger number of device proliferations and reduces tape out costs by using older process nodes for the analog bits, which don’t benefit as much from being on the bleeding edge node.”
At the moment, both the digital and analog portions of the Spectrum-3 chiplet are etched using 16 nanometers, but in the future it is reasonable to guess that the SerDes will stay at 16 nanometers or maybe shrink to 12 nanometers and the digital switch will shrink to 7 nanometers and 5 nanometers, and that the number of SerDes chiplets will go up to add more ports or higher speeds or both.
The Spectrum-3 switch has 12.8 Tb/sec of aggregate switching bandwidth across 256 SerDes that run at 50 GHz with PAM4 encoding. Mellanox has not released packet throughput statistics on the Spectrum-3, but it is probably on the order of 19 billion packets per second if history is any guide. All Deierling could say at this time was that it would be about 70 percent more packet throughput than the Ethernet competition will be able to deliver.
As with the Spectrum-2, the SerDes can be set independently to drive port speeds, or ganged up in groups of 2, 4, or 8 to create Ethernet pipes of specific bandwidths. It natively supports 128 ports running at 100 Gb/sec, 64 ports running at 200 Gb/sec, and 32 ports 400 Gb/sec.
The Spectrum-3 chip has 10X the scale of VXLAN network virtualization support as the Spectrum-2 chip, and has a 64 MB fully-shared, monolithic packet buffer, a 52 percent increase over the Spectrum-2 chip. This doubling of bandwidth and throughput comes at a price, of course, and so does the chiplet architecture in terms of port-to-port hop latency. The power consumption of the Spectrum-3 chip complex, running full out, is 375 watts and the port latency is sub-500 nanoseconds, but that is quite a bit larger than the 300 nanoseconds to 400 nanoseconds with the Spectrum-2.
But as we have discussed before consistent, predictable latency is more important than low latency that is not consistent and only good in ideal conditions and not under load.
As always, Mellanox is not just selling Spectrum-3 switch ASICs to those who build their own switches, but is also selling whole switches for those who don’t want to go through that hassle. There are three fixed port switches and one modular switch that will be available from Mellanox, as shown below:
Here is how Mellanox stacks up its SN4800 modular switch against two modular switches based on the Tomahawk-3 and Jericho2 ASICs from Broadcom:
At the moment, in conjunction with the ConnectX-6 adapters, LinkX transceivers and cables, and the Spectrum-3 switches, Mellanox can deliver end-to-end 200 Gb/sec connectivity, from server port to switch and back again. The 200 Gb/sec wave is beginning, and it won’t be long before people start thinking about putting 400 Gb/sec switching into more than the backbones of the networks. Especially if we all try to watch Netflix at the same time.