Achieving Maximum Compute Throughput: PCIe vs. SXM2

Paresh Kharya, Group Manager, Product Marketing, NVIDIA.


The explosion of applications and services powered by artificial intelligence (AI) is rapidly changing how the world converts data into insight, and driving innovation across many industries. Partners Hewlett Packard Enterprise (HPE) and NVIDIA® are committed to providing AI-ready systems that team the world’s most powerful server platforms with the latest cutting-edge GPU technologies, sharing a goal of helping every customer discover the full potential of GPU computing.

AI systems rely on large datasets and massively parallel processing power to simulate human intelligence and solve highly complex problems. These techniques are continuously evolving as computers gain greater insights from data, and become able to think and learn in a way that is similar to how the human brain builds knowledge from life experiences. Computational power and data storage capabilities have advanced to the point where specialized AI techniques like deep learning are now in use every day to drive breakthroughs in areas like speech recognition, computer vision, and predictive analytics.

As deep neural networks (DNNs) evolve and become even better at recognizing patterns in massive amounts of data, traditional CPU technologies are unsuited to these increasing computational demands. Deep learning, high performance computing (HPC), and graphics require massively parallel computation, and only GPU computing provides the power needed to fuel these workloads. GPUs are also highly effective at quickly training deep learning models using much larger training sets and a fraction of the compute infrastructure.

When these trained models are deployed in production, achieving fast processing speeds becomes important because these systems are constantly challenged to deliver fast responses for intuitive user experiences. This fast responsiveness is needed along with massive throughput to efficiently scale the exploding volumes of requests being serviced for AI-powered services like visual search, personalized recommendations, and automated customer service.

Comparing PCIe and SXM2

Internal data buses, which carry data and operations between a computer’s internal components, can have a considerable effect on a system’s overall throughput. PCI Express (PCIe) is the current high-speed serial computer expansion bus standard, which is meant to provide lower latency and higher data transfer rates than older busses such as PCI and PCI-X. Devices connected via PCIe lanes have their own dedicated point-to-point connection, meaning that devices are not competing for bandwidth because they are not sharing the same bus.

PCIe is sufficient for the needs of many modern computing systems, and is the most commonly used standard across the industry. Numerous devices in a system (including GPUs) are connected using PCIe lanes, and for some GPU set-ups, using these available lanes provides sufficient bandwidth. However, for high-end GPUs and multi-GPU systems running massively parallel functions and moving large amounts of data, the PCIe bus can quickly become overloaded and cause a performance bottleneck.

To address these issues with PCIe, NVIDIA introduced a new SXM2 mezzanine connector. NVIDIA’s SXM2 connector allows the GPU to leverage NVIDIA® NVLink™, NVIDIA’s high-bandwidth, energy-efficient interconnect that enables ultra-fast CPU to GPU and GPU to GPU communication. HPE’s servers take advantage of the GPU to GPU communication. The SXM2 design allows GPUs to operate beyond the restrictions of the PCIe bus so they can reach their maximum performance potential.

Maximum throughput for superior application performance

NVLink provides a high-speed path between GPUs, allowing them to communicate peak data rates of 300 gigabytes per second (GB/s), a speed 10X faster than PCIe. Unlike PCIe, with NVLink a device has multiple paths to choose from, and rather than share a central hub to communicate, they instead use a mesh that enables the highest bandwidth data interchange between GPUs. This significantly speeds up applications and delivers greater raw compute performance than GPUs using PCIe. By increasing the total bandwidth between the GPU and other parts of the system, the data flow and total system throughput are improved to enhance the performance of the workload.

The NVIDIA® Tesla® V100 accelerator is built for HPC and deep learning, and is based on NVIDIA’s new groundbreaking Volta architecture. The Tesla V100 is available both as a traditional GPU accelerator board for PCIe-based servers, and an SXM2 module for NVLink-optimized servers. The traditional format allows HPC data centers to deploy the most advanced GPUs in PCIe-based nodes, allowing them to support a mix of CPU and GPU workloads with their existing hardware. NVLink-optimized servers provide the best performance and strong-scaling for hyperscale and HPC data centers running applications that scale to multiple GPUs, such as deep learning.

On select HPE ProLiant servers, HPE supports computational accelerator modules based on NVIDIA’s new Volta architecture. The NVIDIA V100 SXM2 accelerator is currently supported in certain HPE ProLiant DL-series, ML-series and SL-series servers—such as the HPE Apollo sx40 and the HPE SGI 8600, which powers TSUBAME at the Tokyo Institute of Technology, the most power-efficient supercomputer in the world.

NVIDIA and HPE are committed to delivering a portfolio of GPU-optimized systems that are designed to bring deep learning to every customer. To learn more about the latest breakthroughs in GPU computing and deep learning, please follow me on Twitter at @pareshkharya. You can also visit @HPE_HPC and @NVIDIADC for more news and updates on HPC and GPU acceleration.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now