OpenCL Optimizations Make Case for FPGAs in HPC

Nicole Hemsoth Prickett

7 years ago

The use of FPGAs in HPC is limited less by the capabilities of current hardware and more by the challenges in programming them without sacrificing performance.

Recent work from Boston University has shown that with key optimizations that leverage OpenCL on Arria 10 FPGAs for 3D fast fourier transforms (FFTs), a common HPC workload, the performance can beat out FFT specific IP cores as well as GPU and CPU implementations of the same problem.

The team has implemented a broadside Radix-2 FFT module on a single FPGA. By using a set of code structure optimizations, the compiler takes advantage of the inherent flexibility of the FPGA to discover all angles of parallel. The compiler than hands the code off to HDL where the team can isolate compute pipelines from the remaining OpenCL system to addresses any limitations in the language that prevent realizing the data structures fully. This logic is then moved into the existing 3D FFT shells that were originally built for IP cores.

The research team uses Altera OpenCL for its work, which does have some limitations that they have tried to address with their OpenCL-HDL approach. They avoid using a full OpenCL generated system given language compatibility issues, the overhead of the board support package, and the problem of porting legacy codes to OpenCL from TRL. The team also notes that partial reconfiguration limits the board area where new logic can be implemented, which makes it tough to handle larger designs.

A set of ping-pong buffers are used to source and sink vectors. The initial grid is loaded into Buffer 1 while the final result can be read from Buffer 2. By providing a variable data path for writes to both buffers using MUXs, the 3D FFT shell can be used as an intermediate stage for a number of different applications.

“Both OpenCL-HDL and IP core based designs can utilize the same shell for performing a 3D FFT. Consequently, the former can be seamlessly integrated into existing logic initially designed for the latter. Since FFT is a linear operation, a high dimensional FFT can be broken down into a series of 1D directional transforms in any order.”

“In order to avoid constructing complex and expensive memory structures that can stream data every cycle in all dimensions, a transpose is performed on the overall directional FFT result to reorder data within buffers. This reordering rotates the grid and allows a different directional 1D FFTs to be performed for the same data access pattern.”

In this work the OpenCL code is compiled using Altera OpencL SDK v16.0.2. The IP Core used is the Altera FFT IP core. The CPU code is implemented on a fourteen-core 2.4 GHz Intel Xeon E5-2680v4 with ICC compiler and MKL DFTI. The GPU used is NVIDIA TESLA P100 PCIe 12GB. It has 3584 Cuda cores and peak bandwidth of 549 GB/s. FFT code is written using cuFFT library and compiled with CUDA 8.0.

Execution time for 3D FFT implementations.

Results above show that the average speedup achieved is 29x vs CPU-MKL, 4.1x vs GPU cuFFT and 1.1x vs IP Core FFT.

Full details about optimizations found here.