When it comes to supercomputing, you don’t only have to strike while the iron is hot, you have to spend while the money is available. And that fact is what often determines the technologies that HPC centers deploy as they expand the processing and storage capacity of their systems.
A good case in point is the MareNostrum 4 hybrid cluster that the Barcelona Supercomputing Center, one of the flagship research and computing institutions in Europe, has just commissioned IBM to build with the help of partners Lenovo and Fujitsu. The system balances the pressing need for more general purpose computing while at the same time allowing for researchers to explore how applications might be sped up on hybrid CPU-GPU machines that mix processors from IBM and accelerators from Nvidia and alternatively on nodes based on Intel’s “Knights” family of Xeon Phi manycore processors. There is even a slice of the system that will be based on a baby version of the “Post-K” supercomputer that Fujitsu is building for the Japanese government and that is based on its own homegrown ARM processors with vector extensions it is developing with ARM Holdings.
According to Sergi Girona, operations director of BSC, the request for proposal for the MareNostrum 4 system had three main requirements. It needed to have at least two different architectures that would have a foundation that could scale to exascale performance, it needed to have general purpose computing as the main chunk of its performance, and it needed to be operational by the middle of next year.
Those conditions obviously put restrictions on the architecture options available to BSC, and it is interesting to note that IBM, the primary contractor on the deal, ended up proposing that the bulk of the computing capacity of the MareNostrum 4 system be comprised of servers based on the next generation of Xeon processors from Intel. (Everyone in the HPC community that is pre-announcing contracts based on the “Skylake” Xeon E5 v5 processors are forbidden from using the code-name in their public statements about the machines, but everyone knows it is Skylake.)
Intel’s top brass confirmed several weeks ago that Skylake Xeons are due around the middle of next year, and at the SC16 supercomputing conference in Salt Lake City, Intel announced that the AVX-512 vector instructions that were initially developed for the “Knights Landing” Xeon Phi manycore processors would be ported over the Skylake Xeons. (We told you about that back in May 2015, of course.) Interestingly, as we reported elsewhere this week, Lenovo, which is building the Skylake Xeon systems for IBM on behalf of the BSC deal, says that thanks to the AVX-512 instructions it is getting similar HPC-style performance out of the Skylake processors as it is seeing out of the Knights Landing processors.
The desire for general purpose computing and the timing of the Skylake Xeon launch helped drive the choice that BSC made. “We specified that we wanted to have a machine installed by the middle of next year and that we wanted to have a general purpose processor with no accelerators because not all applications we have in our center are being ported to accelerators, and so all of the bids we received were based on this processor,” Girona tells The Next Platform. “There are other processors on the market, but the timing is not adequate and for this particular RFP, it looks like it was only possible with this Xeon generation. We were specifying limits on the footprint, the energy consumption and the timing because this machine will be used for accomplishing the commitments of BSC for PRACE.”
PRACE is the Partnership for Advanced Computing In Europe, which has 25 European countries working together to support five HPC centers in five of the biggest countries in the region: BSC represents Spain, CINECA represents Italy, CSCS represents Switzerland, GCS represents Germany, and GENCI represents France.
To be precise, BSC, which has been a strong proponent of the Power architecture in the past, did take a look at the possibility of using plain vanilla Power9 systems without GPU accelerators as the large partition at the heart of MareNostrum 4. “We would have considered for this to be proposed, but this did not fit on our time schedule and not within our footprint.”
The word on the street is that Power9 won’t be delivered to the market until the second half of 2017, so there was not really an option for this machine at this time at BSC. Hence, the proposals for MareNostrum 4 all called for Skylake Xeons as the main compute element.
This core partition of MareNostrum 4 will be comprised of 3,400 two-socket Lenovo servers based on Skylake processors that will deliver a peak of 11 petaflops of double precision floating performance. We don’t know how many cores each server socket will have or at what speed they are running, but if you do the math on that, each Skylake Xeon socket is delivering about 1.62 teraflops of performance. Assuming you can drive that up to 90 percent efficiency, as can be done with a fast network supporting the Linpack benchmark, call it an estimated 1.5 teraflops per socket. The top-end Knights Landing Xeon Phi 7290 has an impressive 3.46 teraflops delivered by its 72 cores running at 1.5 GHz, but it also costs $6,254 a pop when bought in 1,000-unit quantities at list price. The Xeon Phi 7230 runs at 1.3 GHz across 64 cores and delivers 2.66 teraflops for $3,710, considerably less performance but much better bang for the buck but still very high memory bandwidth per socket. The interesting bit is that it seems to be hard to get Knights Landing code to run efficiently right now because it is such a new architecture, and this changes the effective teraflops ratings of the processor compared to Xeons. On the Oakforest-PACS system installed at the Joint Center for Advanced High Performance Computing in Japan, the 8,208 Knights Landing nodes using the 68 core Xeon Phi 7250 had a peak theoretical performance of 24.9 petaflops, but only rated 13.55 petaflops on the Linpack test. That means nearly 46 percent of the flops in the compute complex went up the chimney, and this implies that if customers used the top-end Xeon Phi part, the effective throughput might only be somewhere around 2 teraflops on Linpack. Assuming applications have similar efficiencies as Linpack – which we realize is dubious, but this is what we have to work with – and you can see why some customers are willing to wait for Skylake in 2017 instead of doing Knights Landing now.
The choice between Skylake and Knights Landing comes down to the need for larger main memory footprints, which you get with Xeons, versus higher memory bandwidth, which you get with Knights Landing. Both the Skylake Xeons and the Knights Landing Xeon Phi chips will have integrated 100 Gb/sec Omni-Path interfaces as an option, lowering latency and cutting the cost of networking, too. It is not clear if the core MareNostrum partition has integrated Omni-Path links on the Skylake chips, but presumably it does. It makes sense that it does, but Girona is not at liberty to say.
This core part of the MareNostrum 4 system will have an aggregate of 390 TB of “central memory,” which is an odd amount not evenly divisible by 3,400 nodes and which we think includes a mix of main DRAM memory and perhaps some other kind of memory such as Intel’s Optane 3D XPoint memory sticks or SSDs. The whole shebang will fit in 38 racks and will have ten times the compute capacity of the existing MareNostrum 3 system, which has 3,056 IBM iDataPlex nodes using “Sandy Bridge” Xeons that are more than four years old; the nodes are linked together using 56 Gb/sec InfiniBand. The MareNostrum 4 system will consume 1.3 megawatts of juice, 30 percent more than its predecessor.
While BSC is putting forth a massive Xeon cluster as the core of its flagship system, it is weaving in lots of other elements into the system. MareNostrum 4 will have a partition that is based on future IBM Power9 processors coupled to future Nvidia “Volta” Tesla GPU coprocessors, just like the nodes that will eventually be used in the “Summit” and “Sierra” systems being built by Big Blue for the US Department of Energy. This part of MareNostrum will be linked using InfiniBand networks from Mellanox Technologies, presumably the 200 Gb/sec Quantum products that the company debuted at SC16. Girona tells us that BSC already has a baby cluster based on the current “Minsky” servers from IBM, pairing the Power8 processors with NVLink with the “Pascal” Tesla accelerators, and that eventually the hybrid Power-Tesla system will be upgraded to the Power9-Volta combination and deliver around 1.5 petaflops of aggregate computing. That should be around one rack of gear, by the way – and cramming a lot of compute into a small space is a big deal for BSC, which runs its datacenter from the Chapel Torre Girona at the Polytechnic University of Catalonia in Barcelona, Spain and which is arguably the most beautiful datacenter in the world.
BSC is investing in other emerging technologies as well. It has worked through IBM to add a partition of machines linked by Omni-Path networks that will be based on a mix of the current Knights Landing and future “Knights Hill” Xeon Phi many core processors, echoing the US Department of Energy’s other architecture pick, the “Theta” testbed and “Aurora” pre-exascale system at Argonne National Laboratory. This Knights partition is expected to deliver in excess of 500 teraflops. Because of time to market differences, Fujitsu is providing the Knights Landing nodes, which are already on the floor, and Lenovo is providing the Knights Hill nodes, which come next year presumably. All of these machines will run SUSE Linux, as the current MareNostrum 3 system does.
Finally, as we pointed out above, BSC is also working through IBM to get a slice of the Post-K supercomputer architecture being developed by Fujitsu for RIKEN, which is based on its own ARM processors and which will also deliver more than 500 teraflops of double precision oomph.
Girona cannot reveal when the Power9-Volta, Knights Hill, or Post-K portions of the cluster will be delivered. But we expect Power9-Volta in the late summer or early fall next year, Knights Hill in 2018, and Post-K in maybe 2019 or 2020.
Add it all up across those different architectures and the MareNostrum 4 cluster will cost €29.97 million, or about $31.9 million at current exchange rates between the euro and the dollar. (We presume that the cost of each part of the MareNostrum 4 system is relative to its aggregate flops, more or less.) All of the elements of the MareNostrum 4 system will link to a single parallel file system based on IBM’s GPFS and using its Spectrum Elastic Storage appliances with flash embedded in them; this will weigh in at more than 10 PB and will all by itself cost another €4 million, or about $4.3 million.
We have written many times about BSC and its experimentation with ARM-based parallel machines, particularly the Mont Blanc effort, and this work continues, Girona wanted to assure us.
“We are continuing the research we have been doing,” says Girona. “We are developing a complete software stack and we have a number of applications that we are porting to the ARM architecture. All of the research we have done so far will be applicable.”