The drive toward exascale computing is giving researchers the largest HPC systems ever built, yet key bottlenecks persist: More memory to accommodate larger datasets, persistent memory for storing data on the memory bus instead of drives, and the lowest power consumption possible.
One of the biggest challenges architects and operators face is the power these massive machines consume. A significant portion of that is from DRAM memory DIMMs – as much as 25 percent, according to Antonio Peña, a researcher at the Barcelona Supercomputing Center in Spain.
“Right now, HPC applications are constrained by the amount of DRAM in the nodes and cluster,” Peña explained. “They need more and more memory but adding larger and more DIMMs with the current technology is not feasible due to the power constraints on the overall system.”
New memory technologies have emerged in the last couple of years, which allow architects to combine traditional DDR DIMMS and ECC memory with other technologies to build large and diverse memory systems at lower power. One of these technologies is persistent memory, which provides much larger memory capacity at a fraction of the power of DDR DIMMs. In the enterprise market, persistent memory is gaining traction in applications like large in-memory computing and databases. These applications use DRAM as system memory and persistent memory for fast recovery from outages. To leverage both memory technologies, though, software vendors have to modify their own proprietary codes. However, no tools exist for HPC developers to automatically optimize their applications for heterogeneous memory systems.
Peña and his team at BSC are using a system in the Intel-BSC Exascale Lab with 6 TB of Intel Optane persistent memory, often shorthanded to PMEM and “Cascade Lake” Xeon SP processors to develop new HPC software that will intelligently place data across a heterogeneous memory hierarchy for the best performance of a workload. These tools will enable HPC application developers to seamlessly take advantage of new memory capacities and technologies being integrated in next-generation HPC clusters to help reduce power demand.
“We’re trying to reduce server power while accelerating applications by using Intel Optane PMEM and intelligently managing both where the data is located in the memory system and its movements,” Peña said. “We can take advantage of the big memory footprint that the new technology offers and put more data closer to the processor. There is a slightly longer latency than DRAM, but we don’t have to pay for the penalty of even more latency going to other storage technologies.”
Peña is no stranger to heterogeneous memory architectures. At the US Department of Energy’s Argonne National Laboratory, he pioneered work on heterogeneous memory systems in HPC clusters. Now at BSC, Peña and his team, under the direction of Computer Sciences department director Jesus Labarta and in close collaboration with Harald Servat of Intel, are developing new tools for large HPC workloads that expose and leverage multiple types of memory.
Big Memory, Big Power
Those who build really big computing systems know the challenges memory brings to the architect. Today’s large HPC cluster builders typically size memory between 2 and 3 GB per core for the best performance. A 2016 study revealed that High Performance Linpack (HPL) scores tend to plateau around 2GB of main memory in HPC systems. BSC’s MareNostrum 4, with 3,240 nodes (165,888 Xeon SP-8160 Platinum processor cores), has 2 GB/core on all but a relatively few of the nodes. Some 216 of MareNostrum 4’s nodes offer large memory capacity with 8 GB/core to accommodate much larger data sets. The 2GB/core practice can widely be seen across the leading X86-based clusters on the Top 500 list. That study also predicted as data and systems continue to scale (e.g. to one million cores), it will take 7 GB to 16 GB of memory per-core to reach as much as 99 percent of the ideal HPL performance. At that scale, memory power will have a very large impact on server power consumption based on data from the memory industry.
Memory manufacturer Crucial recommends budgeting about three watts per 8 GB of DDR3 or DDR4 memory – and more for RDIMMs. So if we use this estimate of three watts for our basis in calculating how memory scales, in an HPC node with 56 X86 cores and 112 GB of memory (2 GB/core), memory budget could reach 42 watts to 50 watts. Large memory nodes with 8 GB/core could consume 168 watts or more on the memory alone, depending on memory configuration and vendor specs. And it has not gotten significantly better over the years even as DRAM has gotten faster and geometries smaller. According to Peña, the power reduction curve for DRAM is flattening with a factor of 1.5X per generation from 2000 to 2010 to 1.2X from 2010 to 2018. So, looking into the future with massively large systems based on processors with 112 cores and more per node, memory power itself will become unwieldly, possibly consuming 700 watts or higher in one node.
Today’s HPC operators want to support bigger data, while reducing power. Intel Optane PMEM lets them do that. Compared with the recommended 3 watts per 8 GB (375 mW/GB) for standard DDR3 and DDR4 DIMMs, Intel Optane PMEM’s 128 GB memory modules consume only 117 mW/GB, and the 512 GB modules use just 35 mW/GB – a 10X reduction in power compared to DRAM DIMMs.
“Large non-volatile memory as a basis for hierarchical memory solutions is a great candidate,” Peña added. “It’s byte-addressable, so we can use it for regular load-stores, offers large capacity, and uses less power.”
Innovating a Heterogeneous Memory HPC Software Ecosystem
To directly manage the persistent memory and simultaneously use other types of memory in a cluster, Peña’s team has built an optimization toolset based on Extrae, a general-purpose profiler developed by BSC, and Extended Valgrind for Object Differentiated Profiling (EVOP) methods, among others. Optimization is a multi-stage process.
The tools first analyze how data is used during normal program execution. The process profiles the calls for data based on demand and latencies for different objects. It then creates a large file listing the access information for all the data objects.
“Knowing how each data object is accessed during execution helps us decide in the optimization step where those have to be allocated in the different memories,” Peña described. “In a simplified view, we associate metrics with the different data objects. Then we count the number of accesses or the number of last level cache misses for each object. From this, we can apply different algorithms for memory allocations to maximize the performance.”
With the access profile and sizes of all data objects, plus the size and type of each memory tier, multiple knapsack problems look for an optimization solution. Knapsack optimization algorithms attempt to fit the most objects of the most value into a given ‘container’ (or knapsack) with limited capacity. With Peña’s work, the memory objects are the items and memories in the system are the knapsacks. The output provides guidance to allocate the data objects to appropriate memories in the hierarchy.
“After profiling, a script we call the hmem advisor, for heterogeneous memory advisor, parses the large data object profile and generates a distribution list of different objects to the memories,” Peña continued. “Then we can run the application binary without changes and, as regular mallocs are called, we have a runtime library, an interposer, that intercepts these calls to allocate the different data objects to the appropriate memory tier.”
Currently, Peña’s team’s code runs statically. It makes imperial decisions based on the profiling run, allocating memory objects appropriately for the best performance. The next step is to deliver a much more dynamic code base that intelligently accommodates changes at run time and responds to user-specific marking of data.
“Today, the tool focuses on allocating data for optimized performance,” Peña explained. “If there are standard DIMMs, ECC DIMMs, and Intel Optane PMEM modules in a system, we would allocate the most called data to the standard DIMMs, then less frequently demanded data to ECC, and even less demanded objects to NVRAM. But we envision the tool to not only be dynamic at run time but be able to respond to user marks of data as well. For example, if the user wants to ensure data is protected, we will allocate it to ECC instead of standard DRAM. Or, if it has certain access patterns, such as many writes, we will allocate it to regular memory instead of NVRAM, which has slower write speeds.”
Peña is targeting large HPC workloads, such as LAMMPS and OpenFOAM, but the code will also run on smaller applications, such as Intel Distribution for HPCG, Lulesh, miniFE, and SNAP. Peña’s team is benchmarking the code performance against Intel Optane PMEM’s Memory Mode, where the processor manages the data and sees the persistent memory as system memory and DRAM as a cache.
“We are testing our code on most of the applications from the US Department of Energy, like LAMMPS, OpenFOAM, and NWChem. A key goal of this project is to enable large applications to run with high performance on systems with large NVRAM capacities and smaller amounts of DRAM. We are currently seeing performance improvements in many mini-applications, plus up to 18 percent in OpenFOAM and 10 percent on LAMMPS compared to Memory Mode,” Peña concluded.
By using low-power persistent memory with a mix of other technologies, HPC architects can deliver Exascale systems at lower power than with pure DRAM. Peña’s team’s work will give HPC developers a seamless method of taking advantage of those supercomputers, while creating the beginnings of a new software ecosystem for heterogeneous memory systems.
Ken Strandberg is a technical storyteller. He writes articles, white papers, seminars, web-based training, video and animation scripts, and technical marketing and interactive collateral for emerging technology companies, Fortune 100 enterprises, and multi-national corporations. Mr. Strandberg’s technology areas include Software, HPC, Industrial Technologies, Design Automation, Networking, Medical Technologies, Semiconductor, and Telecom. He can be reached at ken@catlowcommunications.com.