While a lot of the applications in the world run on clusters of systems with a relatively modest amount of compute and memory compared to NUMA shared memory systems, big iron persists and large enterprises want to buy it. That is why IBM, Fujitsu, Oracle, Hewlett Packard Enterprise, Inspur, NEC, Unisys, and a few others are still in the big iron racket.
Fujitsu and its reseller partner – server maker, database giant, and application powerhouse Oracle – have made a big splash at the high end of the systems space with a very high performance processor, the Sparc64-XII, and a new line of Sparc M12 that employ it. Fujitsu is not just aiming the Sparc64-XII at those large enterprises that still want Unix-based systems for their mission critical applications, but also its existing customers, and maybe every once in a while new customers, who are also looking at running deep learning, traditional HPC, and analytics workloads against the data stored in these big iron systems in a native fashion.
In The Moore’s Law Slow Lane, By Design
In many ways, the remaining chip makers who create the motors inside NUMA-style big iron are lucky. Their customers, who run their most critical transaction processing and database systems on this big iron, are extremely risk averse and therefore they are loathe to change platforms unless absolutely necessary. Such decisions to change platforms are usually made for political rather than technical or economic reasons, and they are few and far between each year. These customers also have predictable workloads, which makes capacity planning easier; they tend to see transactions rise in concert with economic growth and as they add a few applications here and there to their vast portfolios of software, which can number hundreds to thousands of distinct applications.
This all makes it fairly easy for companies like Fujitsu to plan out their processor roadmaps, and frankly, to stretch out the time between processor generations. Back in the day, big iron systems had processor upgrades every two years or so, and then it stretched to three years, and now it is even longer. With so much memory and processing available in prior generations of machines, Fujitsu did not feel like it had to rush through its roadmap. On the commercial side of its Sparc64 server lineup, the four-core Sparc64-VII processors etched with 65 nanometer processes debuted in July 2008 and its kicker, the Sparc64-VII+, came out in December 2010 with some clock speed enhancements (up to 3 GHz) and larger L2 caches to boost performance; it was sold by Oracle as the Sparc M3 processor, by the way. There was no commercial-grade Sparc64-IX processor, but the sixteen-core Sparc64-X, implemented in 28 nanometer processes, made its debut in September 2012 as a kind of unified architecture for the HPC variants (the Sparc64-VIII and Sparc64-IXfx, which did not have on-chip NUMA interconnects and which had other HPC-specific instructions). The “Athena” systems that Fujitsu created with these chips were sold by itself and by Oracle. The sixteen-core Sparc64-X+ followed in August 2013 with the Athena+ systems and a clock speed burst to 3.5 GHz on that same 28 nanometer process, plus some other tweaks. There has not been so much as a peep out of Fujitsu since then, except a move to the ARM architecture and back towards custom HPC processors for the Post-K exascale machine.
You might note that there was no Sparc64-XI processor, just a Sparc64-X+ chip three and a half years ago, and if you want to be really honest about it, it has been four and a half years between process nodes for Fujitsu with regard to the commercial Sparc64 processors. That is a long time, but still better than the ill-fated “Kittson” Itanium 9700 processors, which were supposed to be implemented in 22 nanometers in a converged Itanium-Xeon E7 socket but four years ago Intel put the kibosh on that. The last Itanium chip, the eight-core “Poulson” Itanium 9500, was unveiled in August 2011, so it has been more than five years since there was an upgrade shipping in systems. Even IBM’s Power8 came to market after a longer than usual run for the Power7 and Power7+, which together spanned more than four years from 2010 through 2014, and the gap between Power8 and Power9 is going to be more than three years. (To its credit, the Power processor jumps have been fairly regular since the dual-core Power4 was launched way back in 2001.)
Alex Lam, vice president of Enterprise Business & Strategy at Fujitsu in North America, knows that the Japanese server maker has been quiet about its commercial Sparc64 chips in recent years, enough to make companies wonder if perhaps it might switch to Oracle M series and S series chips for its Sparc/Solaris platforms and call it a day. But that is not the case, and the jump to ARM for the PrimeHPC platform has not diminished the company’s commitment to Sparc64 for enterprises.
“The PrimeHPC supercomputers are a flagship product and obviously have a halo effect,” Lam tells The Next Platform. “But ultimately, if you look at where the real bread and butter is for the Sparc64 systems, it was on the commercial side. There is definitely a market and a valid use case for these kinds of systems.”
Oracle thinks so, too, and as far as we know is continuing development on its Sparc T8 and M8 processors for midrange and high-end systems, respectively, and is additionally reselling the Sparc64-XII in the new M12 systems under its own brand. Under the arrangement between Oracle and Fujitsu, Oracle peddles its own boxes all over the world, Fujitsu sells its own inside of Japan, and Fujitsu sells in North America and Europe in a way that minimizes bumping heads with Oracle; the two collaborate on Solaris Unix operating system development on the Sparc64 platforms. “There are clear swim lanes on where we sell the products,” says Lam. “It is a very healthy collaborative approach that promotes the Sparc platform in a way that is beneficial to both companies.”
Inside The New Sparc64-XII Processor
Fujitsu is changing up its strategy for big iron systems with the Sparc64-XII and the M12 systems that use them, and the choices it is making are interesting in that they are going in exactly the opposite direction of what IBM is doing with the Power8 and future Power9 chips.
First, the core count on the Sparc64-XII chip is going down, not up. Rather than add more cores to the chip, Fujitsu is adding brawnier cores, with lots more threading and other instruction per clock (IPC) tweaks. Lam says that based on its own internal benchmark test results, the core in the Sparc64-XII chip is delivering 2.5X the performance of the core used in the Sparc64-X+ processor that came out nearly four years ago. This is a big leap in per-core performance, and it will allow many customers who are paying per-core licensing fees for software to lower their budgets for databases, middleware, and applications while getting more throughput on those applications.
The Sparc64-XII cores run at 3.9 GHz, 11.4 percent higher than the 3.5 GHz clock speed on the predecessor Sparc64-X+ cores. Even with a drop from sixteen cores with the Sparc64-X+ to the twelve in the Sparc64-XII, the aggregate throughput per socket is going to rise by about a factor of 1.9X. Yes, that was a long time to wait for a near doubling of performance. But for many Sparc64 customers, the wait will be worth it. (The wonder is why Fujitsu did not show off this processor at Hot Chips last summer, as it normally would have.)
The Sparc64-XII is implemented in 20 nanometer processes and is manufactured by Taiwan Semiconductor Manufacturing Corp. For the first time, Fujitsu is adding an L3 cache to its processor, something it has eschewed until now because of the large transistor budget it requires. The Sparc64-X+ weighed in at 2.99 billion transistors, but the Sparc64-XII comes in at 5.5 billion transistors, and a lot of that comes through the addition of that L3 cache. But that 32 MB L3 cache is vital in an architecture that is moving up to eight threads per core, up from two threads per core in the past four generations of Sparc64 commercial chips, and keeping all of those threads happily fed with data requires another layer of cache.
It is interesting to note that IBM’s Power8 chips have L4 cache in the memory controllers on its homegrown memory cards, which are used in its biggest, baddest NUMA boxes.) IBM’s high-end Power9 chip will similarly have a dozen cores per chip when operating with eight threads, and there is a version with four threads per core that will scale to 24 cores per socket for systems that need lots of compute, less main memory, and limited NUMA expansion.
Each of the cores in the Sparc64-XII processor has 64 KB of L1 instruction cache and 64 KB of L1 data cache, plus a 512 KB L2 cache. The core implements various SIMD instructions and also has a new decimal floating point math unit, which comes in handy counting money. (The Power chip has had decimal math capability for several generations.) The chip has various accelerators for cryptographic and hashing algorithm processing, and supports AES, DES, 3DES, RSA, DSA, DH, and SHA protocols natively in hardware.
The Sparc64-XII chip has four DDR4 memory controllers, which support memory speeds of up to 2.4 GHz; the Sparc64-X and Sparc64-X+ chips supported older DDR3 memory and had only two memory controllers. The memory bandwidth per socket has risen by a factor of two along with the number of memory controllers, to 153 GB/sec, but the main memory capacity has been kept the same at 1 TB per chip using 64 GB memory sticks. (The prior Sparc64-X+ machines had 32 sticks at 32 GB to reach that capacity.) To balance out memory performance, the Sparc64-XII chip has two DDR4 memory channels per controller and two DIMMs per channel for a total of sixteen sticks per socket.
The Sparc64-XII has twice as many PCI-Express controllers, and has four of them integrated with eight lanes of traffic each for a total of 32 lanes; that is twice the peripheral bandwidth. The NUMA interconnect runs at 25 Gb/sec, the same speed that NUMA links ran at with the Sparc64-X+ chips, but the Sparc64-X chips ran at only 14.5 Gb/sec. (By the way, peripheral interconnects on the Power9 chip run at 25 Gb/sec, but the NUMA links only run at 16 Gb/sec.)
The Sparc64-X+ interconnect implemented a glueless four-socket M10 node and then allowed for up to sixteen of these to be linked together over routers for a total of 64 sockets and 64 TB of memory. With the Sparc64-XII machines, Fujitsu is backstepping to a two-socket glueless node, which is sold as the M12 server, and then allows up to sixteen of these to be linked together to create a system with a total of 32 sockets and 32 TB of main memory.
The funny bit is that Fujitsu has cut the socket count and maximum memory back in the high-end M12-S system by half moving from the Sparc64-X+ to the Sparc64-XII, just like IBM cut back from 32 sockets and 16 TB of memory with the Power7 in the Power795 to sixteen sockets and originally 16 TB with the Power8 in the Power E880. (IBM did boost this to 32 TB of maximum capacity with a very pricey 128 GB memory stick back in January 2016.) Oracle can scale its “Bixby” interconnect to 96 sockets and 96 TB or, with fatter DIMMs, even 192 TB, but The Sparc M5 and Sparc M6 systems topped out at 32 sockets and 32 TB, and the Sparc M7 machines launched in October 2015 expanded to 64 sockets but only with 32 TB of maximum memory. It is unclear what Oracle will do with the future Sparc M8 systems it is creating and which will compete against Power9 and Fujitsu M12-S.
Clearly, enterprises are not looking for machines that can do more than 32 TB in a single footprint, or IBM, Fujitsu, and Oracle would be selling them. Or maybe they are just too expensive.
The Sparc M12-2 “Athena++” server that Fujitsu is building and that Oracle is reselling has two of the 3.9 GHz Sparc64-XII processors and supports up to 2 TB of main memory and has 11 PCI-Express 3.0 slots and with expansion units, it can fan out to a total of 71 PCI-Express slots sharing the bandwidth for peripherals. It has a four-port 10 GB/sec Ethernet card embedded in the system board, and significantly, the M12-2 system has an aggregate memory bandwidth of 306 GB/sec, which is pretty good and absolutely competitive with the Power8 platform. The chassis has room for eight 2.5-inch drives, which can be 600 GB or 900 GB SAS disks or 400 GB eMLC SAS SSDs.
If you plan on doing NUMA scalability beyond two sockets, then Fujitsu cranks the clock speed on the cores to 4.25 GHz (that’s a 21.4 percent boost over the 3.5 GHz clock speed on the Sparc64-X+) and allows customers to pick configurations with 2, 8, or 32 sockets and 2 TB, 8 TB, or 32 TB maximum memory configurations using 64 GB memory sticks. (If you want to use less capacious 16 GB or 32 GB memory sticks, you can.) This beast can bring to bear 384 cores and 3,072 threads on a single application running against that big block of shared memory, and this Sparc64 interconnect will have much higher bandwidth and lower latency than an Ethernet or InfiniBand network – at least for a while, anyway.
With the Sparc M10, Fujitsu used water cooling on key components to increase the cooling efficiency of the system and to allow for the components to run a bit hot. With the Sparc M12 systems, Fujitsu has an improved evaporative liquid cooling system that is twice as effective, and which we presume lets it overclock components a bit to deliver that big performance bump.
The Fujitsu Sparc M12-2 and M12-S systems are available now. The Sparc M12-2 base machine with two 3.9 GHz processors, 64 GB of memory, and one 600 GB disk drive would cost $49,660. A single node in a Sparc M12-S cluster with two of the 4.25 GHz processors, plus the same 64 GB of memory and a single disk drive plus the “XB” crossbar interconnect router costs $64,284.
One last thing: Unlike Oracle’s own iron, Fujitsu keeps supporting the Solaris 10 operating system natively on its iron, and this includes the new M12-2 system and M12-S NUMA cluster. Fujitsu customers can run Solaris 10 on bare metal or in Solaris containers on Solaris 11 machines; Oracle only allows Solaris 10 in containers on newer Sparc iron of its own design.
50k for 64GB RAM and 600GB Harddrive are they laughing? That’s ridiculous expansive these are moon prices
These servers are high end building blocks, offering Enterprise RAS etc. x86 does not have that. For instance, some SPARC cpus can back and replay an instruction if the cpu detects an error. To build in such tailor made stuff into x86 would vastly increase the price of x86. And also, as you can scale these servers out to 32-sockets – you must pay a premium for that. Of course, if Fujitsu wanted to release one 1-socket workstation without all the Enterprise stuff, the price would drop dramatically. But that low end market belongs to x86.
I see POWER9 a more mature and flexible architecture, with scale out and scale up versions. I’d switch to POWER9 also because the great IO and accelerator capabilties such as openCAPI and pciexpress v4. Nobody else have that.
POWER9 is not faster than the current SPARC M7 cpu. The SPARC M7 is typically 2-3x faster than POWER8 and the fastest x86 – if you look at the 30ish world records in different areas. For business workloads such as databases, the SPARC M7 is up to 11x faster than POWER8 and x86. Here are some 30ish world records, some of them are official, such as SPECcpu_2006 benchmarks, databases, SAP, etc.
https://blogs.oracle.com/BestPerf/entry/201510_specpu2006_t7_1
The POWER9 will only be 2-2.5x faster than the current POWER8. That means a POWER9 might maybe catch up on current SPARC M7 performance. The next year SPARC M8 arrives, which again will be 2x as fast as the current SPARC M7. Oracle has always doubled the cpu performance every generation (except for the SPARC S7 which is one quarter of a crippled M7), and today Oracle have released six cpus in five years.
So, I dont see the point of IBM releasing a future POWER9 as IBM can not even beat the current crop of SPARCs?
SPEC CPU2006 Rate Results – One Chip
SPARC M7 (4.13 GHz, 32 cores) SPECint_rate2006 at peak 1200 base 1120
POWER8 (2.92 GHz, 10 cores) SPECint_rate2006 at peak 642 base 482
Yes, 10-cores POWER8 2.92Ghz slower than 32-cores SPARC M7 4.13Ghz
Who cares if SPARC has a 32-core chip? Why would you compare 10 cores vs 32 cores. Is software priced by chip or by core? (answer=core) Not a very good comparison. This one is better one:
SPARC M7 (4.13 GHz, 32 cores) SPECint_rate2006 at peak 1200 base 1120
POWER8 850C (4.22 GHz, 32 cores) SPECint_rate2006 at peak 2520 base 1990
Power8 wins by 78% – 110%
@Kurt,
I dont know how many times I have said this to IBMers, but if we talk about the worlds fastest cpu, you can not compare core vs core and then draw conclusions about the cpu. It is like comparing apples to apples, and then draw conclusions about rockets.
Again, typically a high end CPU might use 200 watt. POWER8 with 10 cores, means that each core is designed to use 20 watt. You always have a watt budget. And transistor budget. You can not exceed those for a cpu. All high end RISC cpus maxes out at 200-250 watt. Intel Xeons maxes out at 150 watt. You can not exceed your watt budget.
POWER8 cores are beefy and designed to use 20 watt each. If you collect 32 of them cores into a cpu, it would use 32 * 20 watt = 640 watt. Such a cpu would melt the server.
If you are going to use 32 cores then each core must be weaker to not exceed the 200-250 watt budget.
I dont know how many times I told IBMers that you can not extrapolate from cores to cpus, because you violate all kinds of constraints that can not be broken. This makes me wonder if IBMers live in a universe with “alternative facts” where 640 watt cpus exist in their imagination.
Again, you can not compare core to core and draw conclusions about cpus. This is not the last time I say this, I know. The FUD is strong with IBM.
Michael – where do you see 11x performance improvement of M7 over POWER8? The link you have had a SPECINT rating of 1120 for M7 (32 cores@4.19 GHz) and 482 for POWER8 (10 Core @2.92 GHz). So at at chip level you get a 2.3x advantage. But if you look at per core performance, POWER8 has a 1.4x advantage. This would be even higher if we compare the performance against the faster POWER8 processors (4.15 & 4.35 GHz).
@jkantony
I posted a link with 30ish world records. If you look there you will find lot of benchmarks in various domains. Here is one where SPARC M7 is 11x faster than Intel E5-2699 v3.
https://blogs.oracle.com/BestPerf/entry/20151025_imdb_t7_1
Here the SPARC M7 core is 15.6x faster than Intel E5v4 core
https://blogs.oracle.com/BestPerf/entry/accelerating_spark_sql_using_sparc
And as we all know, POWER8 is much slower than Intel Xeon. Just look at the different benchmarks, for instance my link with SPECcpu_2006.
And, when we activate compression and encryption, the SPARC M7 performance drops 3-5% according to benchmarks. OTOH x86 performance drops 50% or so. I dont know how much slower POWER8 gets when activating compression and encryption? Does POWER8 also drop 50%?
No SPARC64 XI ? What about https://en.wikipedia.org/wiki/SPARC64_V#SPARC64_XIfx ?
It is a SPARC chip but not for the same purpose or product line as the ones being discussed here. Its for their PrimeHPC supercomputers and that will be using ARM next generation.
That chip you mention has been in some of the best machines since 2015.