There was a time – and it doesn’t seem like that long ago – that the datacenter chip market was a big-money but relatively simple landscape, with CPUs from Intel and AMD and Arm looking to muscle its way in and GPUs mostly from Nvidia with some from AMD and Intel looking to muscle its way in. And a slew of AI startups that didn’t really sell much into the datacenter.
That has changed drastically in recent years.
There’s still Intel, AMD, Nvidia, and Arm, but there also are a lot more choices when it comes to silicon. There is a massive – and growing – amount of data being generated and analyzed, and the more recent emergence of generative AI and large language models is giving rise to myriad chip startups looking to gain a foothold.
Then there are hyperscalers like Amazon Web Services, Microsoft with its upcoming Maia 100, and Google Cloud and its Tensor Processing Units, which are making their own homegrown processors.
There is a plethora of silicon options in the market, and cloud infrastructure providers will play a significant role in how all this falls together. About 70 percent of AI workloads are in the cloud now, and that promises to grow as enterprises adopt the technology and expand their workloads.
For AWS, it has its own Trainium (for training AI workloads obviously) and Inferentia (for AI inferencing obviously) – not to mention its Graviton CPUs and Nitro DPUs, all thanks to its 2015 acquisition of Israeli chip designer Annapurna. AWS has a lot of Nvidia GPUs, too, which are cornerstones of AI compute. But the rise of AI – and most recently, the accelerating innovation and adoption of the emerging generative AI technology – is creating a fluid processor environment that the company and other cloud providers will have to navigate.
AWS is set for the moment with Nvidia GPUs, Trainium, and Inferentia, but how this plays out in the future is a wait-and-see game, according to Chetan Kapoor, director of Amazon EC2 product management.
“We’re in the very early stages of understanding how this might settle,” Kapoor tells The Next Platform. “What we do know is that, based on the rapid growth that you’re seeing in this space, there is a lot of headroom for us to continue to grow our footprint of Nvidia-based products and at the same time, we’re going to continue to grow our fleet of Trainium and Inferentia capacity. It’s just too early to call how that market is going to be. But it’s not a zero-sum game, the way we see it. Because of this exponential growth, there will continue to be phenomenal growth in our fleet of Nvidia GPUs, but at the same time, we’ll continue to find the opportunistic way to land Trainium and Inferentia for external and internal use.”
Like its competitors, AWS is all in on AI, but what it can do internally and what it invests in the market. AWS late last month invested another $2.75 billion in AI company Anthropic – bringing its total investment to $4 billion – an investment that came weeks after the cloud provider said Anthropic’s Claude 3 family of models were running on Amazon Bedrock AI managed service. It echoes what Microsoft’s partnership with OpenAI (which includes more than $10 billion in investments) and Google with Anthropic (more than $2 billion invested).
To run all this, AWS is sticking with what it has now with Nvidia and its own chips, but Kapoor, who essentially calls the shots for the EC2 hardware acceleration business, says the company is “going to continue to stay engaged with other providers in this space and if other providers like Intel or AMD have a really compelling offering that we think can complement our Nvidia-base solutions, I’m more than happy to collaborate with them in that market.”
AWS doubled down on Nvidia at its recent GTC 2024 show, saying – as Microsoft Azure, Google Cloud, and Oracle Cloud Infrastructure did – that is adopting the accelerator maker’s new Blackwell GPUs, include the massive GB200 Grace Blackwell superchip, which has two B200 GPUs attached to a single Grace CPU via a 600 GB/sec NVLink interconnect.
Whether other AI chips can muscle their way into the AWS environment is unclear. Companies like Groq, Mythic, and SambaNova Systems are putting together processors for AI workloads, but Kapoor says it’s about more than the accelerators themselves. OpenAI chief executive officer Sam Altman has floated the idea of the company designing its own AI training and inferencing chips to supplement a tight market that is seeing demand for Nvidia GPUs skyrocket to meeting AI workload demands.
“It’s really hard to build chips,” he says. “It’s even harder to build servers, and manage and deploy a fleet of tens of thousands, if not hundreds of thousands, of these accelerators. But what is even more challenging is building a developer ecosystem that takes advantage of this capability. In our experience, it’s not just about silicon. Silicon is part of the offering. But then, how do we provision it as a compute platform? How do we manage and scale it? It matters, but what is paramount? How easy to use is that solution? What developer ecosystem is available around your offering? Basically, how quickly can customers get their job done?”
The accelerating adoption of generative AI doesn’t give organizations the luxury of spending months learning and using new hardware architectures. What they use needs to be a holistic architecture that is both easy to use and cost-effective.
“It has to have a developer community around it for it to have a traction in the space,” Kapoor says. “If there’s a startup that is able to accomplish that feat, then great, they’ll be successful. But it’s important to really view from that lens where it needs to be performant, needs to be cheap, it needs to be broadly available, and really easy to use, which is really, really hard for even large corporations to actually get it right.”
Organizations are under a lot of pressure to adopt AI to keep competitive against rivals. For companies, running those AI workloads typically comes down to performance vs. costs when considering the infrastructure they use.
“We’re going to see this trend where there’ll be some customers who are just focused on time-to-market, and they’re less focused on making sure they’re optimizing their spend,” he says. “They will tend to prefer a Nvidia-based solution because that gives them the ability to get to market as quickly as possible. On the other hand, we’re starting to see this trend already where some of these customers are going to look at this cost and say, ‘Well, I don’t have the budget to support this,’ and they’re going to look for alternative solutions that provide them the performance they’re looking for, but at the same time give them a way out to save 30 percent or 40 percent over the total cost it takes for them to train and deploy these models. That’s where some of these alternative solutions from us or from other silicon partners would come into play.”
That said, there will continue to be sustained demand for Nvidia products. Many of the new foundational models are being built on the vendor’s GPUs because the research and scientific communities have a lot of experience building and training AI models with Nvidia hardware and software, Kapoor says. Also, Nvidia will continue expanding the edges in terms of raw performance that a system can provide. The GPU maker is “really, really good at not only building silicon, but these systems, but they’re also phenomenal at optimizing performance to make sure that their customers are getting most out of these really, really expensive accelerators,” he says.
So hyperscalers are going to have to keep a sharp ear to what organizations are telling them, because while some 70 percent of AI workloads are in the cloud now, that will grow in the coming years. The systems AWS and others have running atop Nvidia’s A100 or H100 chips already are highly complex and at scale, and that will only increase with Blackwell, which calls for rack-integrated offerings with technologies like liquid cooling and even more density.
“There’s just a lot more enduring complexity on what it takes to design, build, and actually deploy these kinds of machines, so we expect that customers that are okay with deploying systems previously on-prem will see a lot of challenges there,” Kapoor says. “They may they may not have liquid cooling infrastructure. They may not have rack positions that supply enough power and they’re going to gravitate towards cloud because we’ll have done all this hard work for them and these resources will be just available via an API for them to consume and fire up. The same thing applies on the security side. Today we have really, really strong posture when it comes to enabling our customers to feel confident that their IP, which is typically as the parameters of the model, the weights and biases, are fully accessible to them.”
They soon also will have AI supercomputers to handle these AI and machine learning workloads. AWS is working with Nvidia on its “Project Ceiba” to build such a system that will now include Blackwell GPUs and NVLink Switch 4 interconnects, as we have outlined. In addition, Microsoft and OpenAI reportedly are planning for the “Stargate” supercomputer – or, as we noted, possibly multiple datacenters that make up supercomputer.
A very insightful and sensible perspective by EC2 product management director Chetan Kapoor! I quite like his hypeless cool-headedness combined with a broad view of where the field (AI/ML cloud) is now, and what to expect with respect to standardized and home-brewed infrastructure, in terms of cost, performance and availability. Great interview (and very sober)!