Site icon The Next Platform

Arista Banks On The AI Network Double Whammy

Current AI training and some AI inference clusters have two networks. The back-end network links banks of compute engines, usually GPU accelerators and usually those from Nvidia – to each other so they can share work and data. The front end network links the compute engine’s CPU hosts to each other, to storage, and to the outside world.

InfiniBand has come to dominate the back-end network, and Ethernet tends to be used in the front-end network. Companies like Arista Networks and Cisco Systems want to get a piece of the back-end action while upgrading the front-end network, too.

In its recent financial report, Cisco said that it is seeing some uptake of its Silicon One ASICs in trials for back-end AI cluster networks, but it is also seeing companies upgrade their systems and their front-end networks as they get ready to deploy AI training and inference systems.

Arista is now seeing a similar, but somewhat different, pattern among its customers as it awaits an Ethernet boom in AI networking in 2025. Arista has been rejiggering its product line in anticipation of that boom for the past two years, and like Cisco is a founding champion of the Ultra Ethernet Consortium that has the explicit purpose of having Ethernet displace InfiniBand in AI clusters – and perhaps as an unintended side-effect, in traditional HPC simulation and modeling and perhaps data analytics and storage clusters, too.

This being Ketchup Week at The Next Platform – we had lots of medical stuff in the past several weeks including a vacation where we caught COVID – we circled back to Arista’s most recent financial results to take have a looksee. We also wanted to walk you briefly through the Etherlink family of switches that Arista has built to take on InfiniBand on its home AI turf. Arista is one of our Thundering Thirteen of public suppliers into the datacenter that we use as a proxy to gauge the financial health of this ecosystem.

First, let’s talk about numbers, and then we will talk about the transitions underway at Arista customers as they move from trials late last year and earlier this year with the Etherlink products to pilots more recently and to expected production late this year and into next year.

In the second quarter, Arista’s product revenues were up 12.8 percent to $1.42 billion, while its services business grew by 35.3 percent to $267.1 million. Software subscriptions accounted for 17.6 percent of sales, rising 24.7 percent to $30.4 million. Add it up, and software and services – mostly tech support for hardware and software products – together comprised $297.5 million in sales, up 34.2 percent from the year ago period.

All told, Arista brought in $1.69 billion in sales, up 15.9 percent. Operating income rose by 32 percent to $699.6 million and net income was up 35.3 percent to $665.4 million, which works out to a very impressive and better than usual 39.4 percent of revenues. Arista has tended to have net income in the range of 30 percent until recent quarters, where it has taken off. All of that cash, after taxes of course, helped Arista boost its cash hoard to $6.27 billion, which is 1.6X larger than this time last year.

Arista has the coffers to take on Cisco and Nvidia’s Spectrum-X as well as Quantum InfiniBand in AI networking, and we fully expect for there to be an aggressive fight here. As we previously reported, we think Nvidia has landed the back-end network for the 100,000-strong Nvidia H100 GPU cluster that Elon Musk’s xAI startup is building in Memphis, Tennessee with its Spectrum-X Ethernet after xAI decided not to use rented GPU capacity on the Oracle Cloud Infrastructure cloud. Juniper Networks, which is in the process of being acquired by Hewlett Packard Enterprise, got the deal for the front-end Ethernet network for that xAI cluster. So Arista didn’t get any piece of that action.

But there are other big fish in the AI waters. Arista has previously talked about the fact that it has the back-end network on one of the pair of giant GPU clusters at Meta Platforms, each of which has 24,576 of Nvidia’s H100 GPUs doing the matrix math of AI training runs. One cluster has 400 Gb/sec InfiniBand interconnects for the back end, and the other uses a mix of modular and fixed switches based on Broadcom’s Jericho 2c+, Tomahawk 4, and Tomahawk 3 ASICs. Meta Platforms and Microsoft are expected to be greater than 10 percent customers in 2024 and 2025, according to Arista chief executive officer Jayshree Ullal, and we think both will be buying lots of Arista iron for AI clusters as well as for the Clos networks that connect their other servers together to run applications or, in the case of Microsoft, that are sold as cloudy slices on the Azure cloud.

“Let me just remind you of how we are approaching 2024, including Q4,” Ullal told Wall Street analysts on a call going over the Q2 financials. “Last year, trials. So small – it was not material. This year, we are definitely going into pilots. Some of the GPUs – and you’ve seen this in public blogs published by some of our customers – have already gone from tens of thousands to 24,000 and are heading towards 50,000 GPUs. Next year, I think there will be many of them heading into tens of thousands aiming for 100,000 GPUs. So I see next year as more promising. Some of them might happen this year. But I think we are very much in – going from trials to pilots, with trials being hundreds. And this year, we’re in the thousands. But I wouldn’t focus on Q4. I’d focus on the entire year and say, yes, we’ve gone into the thousands. So we expect to be single-digit small percentages of our total revenue in AI this year. But we are really, really expecting next year to be the $750 million a year or more.”

Arista won four out of five deals to do trials at unnamed hyperscalers and cloud builders, and is hoping to convert that to five out of five next year, according to Ullal.

In the meantime, Arista is seeing action among large enterprises and Tier 2 service providers that need to upgrade their networks. Those who have 100 Gb/sec Ethernet gear now are looking at 200 Gb/sec, 400 Gb/sec, and even 800 Gb/sec gear. Those who are on legacy 10 Gb/sec and 40 Gb/sec networks are looking at 100 Gb/sec or 200 Gb/sec.

Another interesting area of business for Arista is among enterprises that moved workloads to the clouds and now they are “disillusioned with the public cloud,” as Ullal put it, and they want to build out new infrastructure to repatriate those workloads in their own datacenters or co-lo facilities, presumably to save money. And the Arista campus switch business just keeps growing, causing a certain amount of heartburn for rival Cisco.

But with AI – and a smattering of HPC – driving somewhere north of $3 billion a quarter in networking sales (with about 85 percent of that being InfiniBand), AI is the real target that Arista is very eager to attack. And it wants to span from the kind of modest clusters we expect to be deployed at enterprises, with tens of GPUs, all the way up to those 100,000-strong GPU clusters – and with a sight on the 1 million endpoint design goal that the UEC set for itself when it launched a year ago.

Here is how Arista sees the AI landscape from a networking point. First, it contends that its Etherlink 7060X switches can deliver 64 ports running at 800 Gb/sec or 128 ports running at 400 Gb/sec, and these fixed port devices are fine for interconnecting a few racks of compute engines, which can be GPU or other kinds of motors that are often referred to generically as XPUs:

The most current 7060X switches are based on Broadcom Tomahawk 4 and Tomahawk 5 ASICs.

Now, for bigger clusters with hundreds of XPUs running AI training, Arista suggests a single modular 7800R4 switch, which has 576 ports running at 800 Gb/sec or 1,152 ports running at 400 Gb/sec, and providing a single hop between the XPUs that doesn’t require congestion control or adaptive routing on the GPU ports. (You may need that if you use the same switch for the front-end network.) Like this:

The graphic above does not show the XPUs at the proper number, obviously. The current 7800R switches are based on Jericho 2c+ and Jericho 3-AI switch chips from Broadcom.

If you want to scale further than this, Arista launched the 7700R Distributed Etherlink Switch in June of this year, and it takes a leaf/spine network and collapses it down to a quasi-modular form factor and allows it to be managed as a single device with a single hop. (We strongly suspect those “virtual” single hops have a couple of actual physical ones inside the DES box, as they do inside of modular switches.)

The Etherlink DES setup has leaves and spines, as you can see, and offers the scale of this architecture. The 7700R is based on the Jericho 3-AI chip.

Arista says that a single-tier Etherlink topology can support over 10,000 XPUs, and that a two-tier topology can support over 100,000 XPUs in a single fabric. And when UEC-compatible DPUs are available in the future, Arista says that it will support these as they offload congestion control and adaptive routing functions off the switches and cluster host nodes to boost performance of the network – just like Nvidia is already doing with Spectrum-X. The 7800R4-AI and 7700R4 DES switches have been in customer testing since their June announcement and are expected to be available sometime in the second half of this year. But, as Ullal says, expect the rollout for AI clusters next year.

Don’t be surprised if Arista either buys a DPU maker or creates its own DPUs from scratch.

And in the meantime, Arista can work with potential AI customers on upgrading their front-end networks to prepare for AI, just as Cisco is doing.

Exit mobile version