Exascale-class energy efficiency cannot be defined by a simple number. Although Green500 energy efficiency HPC rankings provide a one-shot view into performance/efficiency, the complex interplay between large system operating systems, real-world applications, and the various tuning capabilities is worth digging into. The problem is, there is not necessarily a transparent view.
With so many knobs and bobs in the AMD Zen 2 architecture, which is the CPU basis for a number of pre-exascale and future machines, it’s useful to know what efficiency elements are handled in hardware and what is up to the user to tweak. An HPC team from the TU Dresden took a deep dive into the Zen 2 microarchitecture to pick all of this apart and shared some rules of thumb about how to eke maximum energy efficiency out of AMD Epyc processors.
From frequencies to characterizations of Epyc idle states to the accuracy of the processor’s own internal power monitoring mechanism, the team walks through the various pros and cons of the hardware-specific efficiency features. Their tests use a dual-socket system with AMD Epuc 7502s (32 cores) using the 2-channel interleaving mode. While 1.5,2.2, and 2.5GHz are all options, they stuck with max 2.5GHz and default memory speed (1.6GHz). The OS is Ubuntu Linux 18.04.
In their efficiency analysis, the team is using RAPL (Running Average Power Limit) function, which is new (the company replaced Application Power Management or APM). The team has a great deal of data on the accuracy of the RAPL readouts as well as mixed results. As they explain, “RAPL implementation on our system does not correctly represent the impact of data on power consumption, possibly affecting measurement accuracy in workloads with biased data.”
This might not be as black and white as it sounds. They explain that it is possible this RAPL implementation could still be used to leak information about the processed data through very small differences in the distribution of power consumption samples. “The results indicate that this is due to indirect effects, e.g., an increased temperature based on the number of set bits. Nevertheless, distinguishing the operand weight from RAPL values on this system would take substantially more samples compared to a physical measurement. Moreover, on our test system, RAPL is not accessible to unprivileged users.”
Based on assessments covering P-states, C-states, frequency options (and mixed frequencies), idle states and threads, and use of RAPL to understand all of these factors, the TU Dresden team made a few broad recommendations for maximum energy efficiency. The strongest suggestion has to do with RAPL:
“Energy measurements of AMD’s RAPL implementation should be considered inaccurate. No DRAM domain is provided, and DRAM energy consumption is not (fully) included in the package domain. Therefore, AMD’s RAPL is unsuitable to optimize total energy consumption.”
They note that hardware threads should not be disabled in the operating system because this can “disable package C-states and significantly increase idle power consumption under specific circumstances.” The hardware threads that are being used should run at the lowest frequency or they can bring up that of other threads on the same core.
Another finding, not relevant to all users, is that one should avoid using mixed frequencies on one CCX. “This can lead to performance losses on cores with lower frequency settings.” On that note, they add that keeping tabs on processor frequency is critical, especially since throttling can happen, a major problem for HPC codes using 256-bit SIMD instructions.
Graphical and other breakdown of the results can be found here.