Intel Brings A Big Fork To A Server CPU Knife Fight

With Intel’s foundry still trying to get caught up with the process and packaging offered by archrival Taiwan Semiconductor Manufacturing Co, Intel’s server CPU product line has to “make do” with what the foundry has and create products that give the right mix of performance and price to compete with CPU rival AMD in the X86 space and the Arm collective that is creating a new CPU tier in the datacenter.

And so, Intel has decided to fork its product line into machines that use true Xeon cores – what are known as P-cores, short for performance cores – and gussied up Atom cores – what are known as E-cores, short for energy efficient cores. This is not so much a new bifurcation of the Intel Xeon product line as it is a hardening of the principles that Intel has had for more than a decade. (The “Knights Landing” Xeon Phi processor was a muscle car version of the Atom CPU, sporting the first implementation of the AVX-512 vector math unit pinned to a server core that would have barely been able to run your cell phone.)

We grew up in Appalachia, and we live back in the mountains again after living three decades in New York City, and we understand that under the right circumstances – or more accurately, the wrong ones – a fork can be just as dangerous as a knife. (Look at how well Aquaman does with a golden fork.) You have to sharpen a spoon on the stone wall for a long while, but you can make that dangerous, too. . . .

This time around, Intel is not creating a toy server CPU based on Atom-style cores and limiting main memory and I/O expansion and hoping companies will buy lots of them and cram them into racks like canned goods for the winter. Rather, Intel is ganging up much larger numbers of Atom cores inside a real server socket, with real memory and I/O capacities, and that plug into standard Xeon server platforms to offer excellent price/performance and thermals for high throughput workloads where a standard Xeon P-core with HyperThreading just do not do the trick.

In the long run – meaning over the next five years or so – the market will decide if having two radically different cores with nearly the same instruction set can compete against two more similar cores with different layouts and half the L3 cache per core. The latter is the strategy of AMD, which is making a more subtle distinction to distinguish between its standard Zen cores, like the Zen 4 cores used in the “Genoa” variants of the Epyc 9000 series, and the “Bergamo” high core count and “Siena” low thermal server CPUs that are based on the Zen 4c cores.

The thing to remember is that while AMD now has a 33 percent shipment share of X86 server CPUs these days, Lisa Su pointed out in her keynote address at Computex 2024 yesterday, Intel still has the other 67 percent – and that is with its foundry arm self-imposedly tied behind its back. But it is wiggling out of the ropes.

Intel will get its foundry sorted by 2025 or so and it has plenty of good architects that can deliver excellent CPU designs and maybe even a competitive GPU with its “Falcon Shores” effort. It is working on getting better yield on its packaging. Intel will compete, and life will get harder for AMD and the Arm-y. The two flavors of the Xeon 6 lineup – the initial “Sierra Forest” E-core chips that are starting to roll out at Computex and the initial “Granite Rapids” P-core chips that is coming out in the third quarter – are the first steps toward closing the CPU server gap for Intel. In a year and a half, this will be a real knife fight, and we expect that the market shares in the X86 space could be just about even. And it won’t be long before Arm has 20 percent share of overall server shipments and RISC-V starts to get a few adherents here and there.

This CPU fight in the datacenter is far from settled.

Two Targets, One-ish Architecture

Intel has been talking about this E-core and P-core strategy for a while, but it bears going into some of the central tenets before getting into the first batch of Sierra Forest chips that Intel is talking about. There will be others. Intel is not just doing a big bang launch of the entire product line all at once, and we have a suspicion that it is capacity constrained on the Intel 7 and Intel 3 processes that are used to create the Sierra Forest chips.

The chart above, which is an amalgam we made of two Intel charts, says that the P-core variant of the Xeon 6 is aimed at AI workloads, but it is also aimed at HPC simulation and modeling and indeed any kind of workload where a stronger core is a better option than a weaker one. AI is but one kind of compute intensive workload, and admittedly it might be the most interesting one for enterprises that are considering using pretrained generative AI models and retraining them with their own data to run AI workloads on premises in their CPU fleets.

With the E-core chips not having AVX-512 vector units or AMX matrix math units, they cannot really do much in the way of AI or HPC processing. They are really designed for application, print, file, and web serving, and in some cases the E-core variants might be good at other kinds of microservices applications where chunks of code are fairly modest. Video streaming, media transcoding, and other kinds of data streaming are ideal for the E-core machines, says Intel.

With both the E-core and P-core designs, the memory and I/O controllers as well as the UltraPath Interconnect (UPI) links for NUMA shared memory clustering of CPUs are separated out from the cores, which reside on one, two, or three banks of chiplets. The “Sapphire Rapids” Xeon SP v4 launched in January 2023 had everything on each chiplet and integrated four of them to make a socket. With the “Emerald Rapids” Xeon SP v5 launched in December 2023, Intel backed down to two chiplets with slightly more aggregate cores, but all of the controllers were still on the same chiplets as the cores. There were also single chiplet, monolithic implementations of the Sapphire Rapids and Emerald Rapids chips as well for low and middle core count devices.

The core complexes on the Sierra Forest Xeon 6 processors are etched with the 7 nanometer Intel 3 process and the I/O and memory chiplets are etched with a further refined 10 nanometer Intel 7 process similar to the one used for Sapphire Rapids and Emerald Rapids.

The Xeon 6 processors will come in two package families, dubbed the 6700 and the 6900, which will be further differentiated in the use of E-core and P-core tiles. There is no Xeon 6 that will mix E-core and P-core chiplets in the same package, but presumably if someone wanted such a beast, Intel would build it.

Here are the specs on the 6700 series and the 6900 series:

In essence, the 6700 series creates socket are “virtual” low core count (LCC), high core count (HCC), and extreme core count (XCC) chip stitched together with EMIB packaging. There doesn’t seem to be a middle core count (MCC) variant.

Here is what the Xeon 6 6700 series die packages look like:

And there is what the 6900 series die packages look like:

The rollout of the Xeon 6 family of server CPUs is going to be staggered, and this, according to Intel, is based on feedback from customers. The lower end Sierra Forest E-core chips come out first, followed by the Granite Rapids higher end P-core chips in the third quarter:

In the first quarter next year, the fatter Sierra Forrest chips with up to 288 cores come out, and so will lower bin Granite Rapids in the 6300, 6500, and 6700 series. There is also going to be an SoC variant of the Granite Rapids chip, most likely for edge use cases where beefy cores and vector and matrix math units are used for AI inference processing.

There has never been a beefy Atom machine from Intel before, so it is difficult to make comparisons to the current Xeon SP and the future Xeon 6 beefy cored machines. In its presentations, Intel compares the Sierra Forrest Xeon 6 6700 chips to the second generation Xeon SP processors, which most people know by their codename “Cascade Lake,” processors that were launched in April 2019. Based on Intel’s benchmarks and our own analysis, we concur that the instructions per clock of the Atom-based E-core is about the same for integer work as the Cascade Lake Xeon SPs. If you do the math, an E-core in Sierra Forest has about 65 percent of the performance of an Emerald Rapids P-core. It all matches.

We will be doing a deeper architectural dive on the Xeon 6 6700E family, but in the meantime, here is the fairly modest SKU stack, which only has seven variants:

In Q1 2025, Intel will double up the performance of the Sierra Forest chips with two compute tiles and two I/O and memory controller tiles to create the Xeon 6 6900E, which is known as a ZCC package and which will have up to 288 cores.

Obviously, if you pay for your software on a per-core basis, the E-core variants might be a tough sell. But if you write your own microservices software or pay by the socket, then software pricing is not an issue and an E-core Xeon 6 might be the answer in terms of lowering thermals and costs while getting acceptable throughput.

Here is our usual performance comparison and pricing chart, which provides a raw performance metric relative the four-core “Nehalem” Xeon E5500 from March 2009. These performance metrics take into account cores, clocks, and IPC across generations.

The “performance general purpose” high-bin parts of the Emerald Rapids Xeon SP v5 processors range from 8 to 64 cores and from 16 to 128 threads and they have a relative performance that ranges from 5.85 to 27.78 by our methodology. Prices range from $1,099 to $11,600 in 1,000-unit tray quantities from Intel. The Sierra Forest chips do not have HyperThreading, and range from 64 to 144 cores (which means you only have from 64 to 144 threads). Prices range from $2,749 to $11,350, but relative performance ranges from 22.89 to 47.20, and that means the bang for the buck is anywhere from 19 percent to 43 percent better. For a given wattage, the performance is twice as high, or for a given performance, the wattage is half as much. Speaking very generally, of course.

The comparison with Cascade Lake Xeon SP v2 server CPUs is fun. The top bin Cascade Lake from 2019 had 56 P-cores and 112 threads running at 2.6 GHz and delivered 21.69 units of oomph at a cost of more than $946 per unit of performance. The low-end Sierra Forest CPU from 2024 has 64 E-cores running at 2.4 GHz and a relative performance of 22.89, but the cost per unit of performance is just a smidgen over $120. That is a factor of 7.9X improvement in price/performance over the past five years. That top bin Cascade Lake part consumed 400 watts, compared to the low-bin Xeon 6 6710E processor in the Sierra Forest lineup.

The top bin Sierra Forest 6700E part does more than twice the work of the low-bin part does, and the cost of a unit performance is double as well, so the gap with the Cascade Lake top bin part is half as great. But even 3.95X is pretty good.

Up next, we will do a deeper architecture dive on Sierra Forest and what we can surmise about Granite Rapids.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

15 Comments

  1. What happens when a binary compiled with AVX512 is launched on a system without? I guess not much.

    What has characterises x86 for me is back compatibility in which new CPUs can run code written for old CPUs. The Sierra Forest processors break with this tradition.

    The i486SX was an i486DX without the built-in FPU. Though integer performance was the same and the FPU could be emulated, my recollection is the SX was not a big seller.

    Since AI inference is getting deployed in small amounts everywhere, workloads are changing. Maybe more vector and matrix math units are desired rather than less.

    Said in another way, the high core count AMD processors seem much more marketable to me.

    • Throws an instruction exception and the process is killed. At least that’s how it works on Linux.

  2. Your comment Eric made me hearken back to an article I read back in the 80s about the so called 5th Gen supercomputers that were coming out and the war for dominance in that domain between the US and Japan ( funny how Japan has been replaced with China now ). And the argument was pretty much the same with proponents of vector machines like Cray sparring with proponents of massive parallelism. With today’s CPUs and even GPUs although you have both massive parallelism and vector units working side by side in the same die and across multiple dies, the debate still rages about do we buy hardware and write primarily for parallelism or do we take advantage of vector units built in. Hopefully protocols like CXL, UXL, and UALink along with improved compilers and languages will make that choice a little less delineated and the grunt work involved a little less onerous.

  3. How many cores is too many? At what point does managing the numa memory hierarchy and cache coherency kill the advantage of adding more cores? I know SGI and Cray have put together numa machines with a couple of thousand processors, but those ran very specific, highly-tuned software. Can you put a thousand cores on a system, install linux and kubernetes, and turn the thing loose on a bunch of container software? At some point are we going to have so many cores on a socket that we have to run multiple hard partitions within the socket? If so, why bother putting everything in a socket?

    • At some point, it is just going to be a rack in a socket, a row in a chassis, and a datacenter in a rack. HA!

    • Just saw your question after asking similar one below. +1, LOL. I have the same issues and have wrestled with them in datacenter and cloud configurations. How do you go about loading these up with VMs and maintaining flexibility? Is Windows Server making major changes in operation and licensing, to fit? Or will it be left to boggled devops folks trying to make sense of it all after the fact? I respect Intel’s expertise in processors in the abstract but their ability to match that to reality has not always been great.

  4. AMD continues Brickland octa x# of processor(s) multiprocessing system across fabric as interconnect in MCM package. This is the traditional compute framework CS and IT are accustomed and that’s why it’s easy in terms of application extensions; adaptation. Intel attempts to disrupt reconfiguring what is a processing platform with innovations supporting new said wanted application utilities that require adoptive that may also be adaptive software development and compiling tools. mb

  5. Presumably Intel stock will rise as it gets its fabs working properly. Or you know, spends the 25B from the tax payer on stock buy backs. So I can make a buck that way.

    These E cores though… the lying, uh, pardon me, the ‘marketing’ never stops. They are really only efficient at packing cores into small areas. They’re not particularly efficient at doing work per watt. They are efficient at being able to say a CPU has lots of cores, let’s put it that way. AMD’s full-fat-with-half-the-cache ‘C’ core still make more sense. My $0.02 anyway.

  6. Sierra is a joke … just look at bergamo perf/W numbers in 1S configuration and compare this to sierra. Not to mention that Sierra will compete with Turin which oblitartes whatever Intel will have in next 2 years.

  7. Why do some people tend to associate Intel with the CHIPs act? TSMC, Samsung, and Micron are each receiving almost as much taxpayer money as Intel.

    And Intel hasn’t made any share repurchases since Q1 2021.

    Regarding the E-cores, it’s odd to imply that Intel can “lie” to hyperscalers. I don’t know which makes more sense, Intel’s full dedication to the market segment with optimized cores or AMD’s less-expensive rejiggering of their single core type, but the hyperscalers are going to make their decisions on internal testing, not on anyone’s marketing. The usefulness for any particular enterprise may come down to the licensing terms available for the software applications they use.

    • No one mentioned the CHIPS Act, but Intel is clearly a big beneficiary. No one mentioned share repurchases, either as far as I can see but I am a little tired. And while AMD might have the upper hand, it can only make so many processors and Intel will well the remainder. AMD’s job is to make what can sell and only make as many as it thinks it can sell. If AMD thought it could sell 100 percent of the X86 server market, it would do so. It does not. And hence, even it believes that Intel will continue to win sales.

      I, too, find it perplexing based on raw feeds and speeds. But that is what is happening.

  8. Very nice article, first time I’ve seen an attempt at a systematic explanation of wtf Intel is even up to! Several things I don’t understand, perhaps first is just what *is* Microsoft’s licensing now, for clients and servers, they’d gone to pure per-core licensing and that was stupid, it was counter what I wanted to do with them on SQL Server which is to use a high core count only sporadically. Another question is, doesn’t this cause “noisy neighbor” problems if you try to provision cloud servers with high core counts? I have a million questions about NUMA I can’t even begin to explain. Most of all, what specific loads are likely to be better on e-cores? You took a swing at that but I’d like more inside dope, explaining why Intel took these two specific design paths? Or was it just a decision over lunch, with independent teams assigned by Gelsinger on architectures, fab nodes, and market segments? Perhaps some further light on this would come from hearing how AMD’s similar fork has been doing in the market, who buys what and why. Thanks.

    • In my experience memory bandwidth and shared L3 cache make conventional multi-core CPUs less than ideal for multi-tenant use in cloud infrastructure.

      From what I understand the idea behind a “cloud native” processor such as Sierra Forest, Altra Max and Graviton is the weaker cores tend not to interfere with each other as much. Since AVX-512 can be particularly demanding on shared resources, it’s natural why Intel avoided AVX-512.

      With Bergamo reducing per-core L1 cache might actually increase the noisy-neighbour effect–at least on average. On the other hand, creating the VMs aligned with the compute dies should avoid the noisy neighbours. This suggests 8 cores with 16 threads is the smallest size for a good VM on Bergamo.

      It would be interesting if a reliable source would measure how much adjacent VMs interfere with each other on different architectures when running various kinds of workloads.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.