Speeds and feeds are great, but hardware is only as useful as the software that can harness it, and, for AMD, that’s the ROCm software stack.
If you’re not familiar, ROCm is to AMD’s Instinct GPU accelerators what CUDA is to Nvidia GPUs. The respective software suites are essential to tapping into their hardware platform’s full potential, and ROCm is a particularly big one encompassing everything from drivers, development tools, libraries, and APIs necessary for running compute workloads on AMD kit.
This week, the House of Zen unleashed a rather sizable upgrade to the open source software stack with ROCm 6.3, which promises a substantial performance uplift not only for everyone’s favorite buzzword AI, but some good old-fashioned Fortran and fast fourier transform (FFT) workloads too.
ROCm 6.3 Embraces SGLang, Flash-Attention-2
This week’s release brings with it a number of notable improvements to the platform including support for SGLang, an open source model runner similar to vLLM.
Citing a research paper published late this spring, AMD says SGLang achieved more than a 6x increase in throughput over vLLM while cutting latency by up to 3.7x. These performance gains are the result of “KV cache reuse, the exploitation of parallelism within a single program, and faster constrained decoding,” the researchers explained.
They also note that the largest speed ups were experienced when using shorter outputs – a scenario we imagine would be common for a chatbot – because the key value cache (KV cache) could be reused more readily. Longer outputs, on the other hand, saw nearly zero advantage.
Along with better performance for AI inference workloads, ROCm 6.3 also extends support for a re-engineered implementation of Flash-Attention-2 that makes use of the Composable Kernel on the backend.
Flash-Attention was originally developed as a way of keeping memory consumption from ballooning when using longer sequence or context lengths. Without Flash-Attention, memory consumption scales quadratically with the sequence length. Flash-Attention overcame this by exploiting the asymmetric nature of GPU memory to achieve linear scaling.
The downside of this approach was, while it was memory efficient, it came at the expense of performance, only achieving 25 percent to 40 percent of peak FLOPS. With Flash-Attention-2, researchers were able to increase this utilization to 50 percent to 73 percent.
According to AMD, optimizations for Flash-Attention-2 in ROCm 6.3 promise a 3x improvement in performance for backward pass operations, like you’d see in training and fine tuning, and more efficient forward pass compared to Flash-Attention-1. How much more efficient that forward pass is, they don’t say.
Improved Fortran Compilers And Multi-Node Fast Fourier Transforms
ROCm 6.3 also delivers a few improvements for those using Instinct for something other than AI. Among them is a new Fortran compiler that supports direct GPU offloading via OpenMP.
Nearly seven decades after its commercial release, there’s still a lot of Fortran code in the wild. This poses something of a challenge, as the number of folks with a low-level understanding of the programming language continues to dwindle, and so too are the pipelines to learning it.
The new compiler is backward compatible with existing Fortran code, so, in theory, taking advantage of the compiler shouldn’t require much refactoring to leverage GPU acceleration on Instinct.
Along with the new compiler, AMD is also rolling out support for multi-node FFTs in rocFFT.
FFTs are commonly employed in all manner of HPC applications including oil and natural gas exploration, climate modeling, and other scientific research. With the release of ROCm 6.3, AMD is enabling FFT workloads to be distributed across multiple Instinct accelerators via an integrated message passing interface (MPI) in order to support larger datasets on compressed timescales.
And, while perhaps not as sexy as AI inferences or even distributed FFTs, ROCm 6.3 also introduces a number of enhancements for those running computer vision workloads. These include updates to the rocDecode, rocJPEG, and rocAL libraries, which add support for the royalty-free AV1 video codec, GPU accelerated JPEG decoding, and audio preprocessing.
The Gateway To Performance Is Software
The release of ROCm 6.3 continues to underscore just how big a difference optimized software can make. However, this isn’t unique to AMD.
In MLPerf training, Nvidia has previously demonstrated a 1.3X improvement of performance over its initial “Hopper” H100 submissions thanks in part to software optimizations. With that said, driving these kinds of software optimizations are arguably even more important for AMD, which trails its rival in both market share and awareness.
Thus far, AMD has managed to unlock sizable gains through its ROCm stack since the launch of its MI300-series parts back in December. At its Advancing AI event in October, AMD boasted that between ROCm 6.0 and 6.2 this summer, it’d unlocked an additional 2.4X improvement in inference and 1.8X improvement in training.
Continued gains like this could help the House of Zen pull further ahead of Nvidia’s H100 and H200 in both categories. As it stands, AMD says its newly announced “Antares” MI325X accelerators will offer about 1.4X better latency than the H200 when running Llama 2 405B at FP8 precision. However, training performance on the smaller Llama 2 70B is only on par with the HBM3E-boosted Hopper part. This is despite the fact that the MI300X and its bandwidth-boosted sibling, the MI325X, not only have more, but faster memory and higher floating point performance than either of Nvidia’s Hopper chips. This suggests that future ROCm releases could unlock additional performance gains – something AMD will certainly need to compete with Nvidia’s “Blackwell” B100 and B200 chips due out this quarter.
Glad to see that Fortran compiler in there! It harkens back (if I understand correctly) to those design decisions in the language, for storing matrices and higher-dimensional arrays, where Fortran relies on column-major order (colexicographic), while C uses row-major order (lexicographic). That basic difference makes performance-optimized computational libraries developed for one language (to enhance cache usage and multi-core partitioning that minimizes inter-core communication needs) roughly the “inverse” of what they are in the other language, and generally incompatible (mangling linear indexing).
Porting SGLang and Flash-Attention-2 to ROCm 6.3 and Instinct devices should also prove invaluable given the performance uplifts they gave to the A100s on which they were initially developed and tested. Overall, this is a great software update by AMD IMHO!