There has been much written about the potential for FPGAs to take a leadership role in accelerating deep learning but in practice, the hurdles of getting from concept to high performance hardware design are still taller than many AI shops are willing to scale, particularly when GPUs dominate in training and in a pinch, standard CPUs will do just fine for datacenter inference since they involve little developer overhead.
Still, companies like Xilinx and competitor Intel with its Altera assets are working toward making deep learning on FPGAs easier with a variety of techniques that reproduce key elements of deep learning workflows in inference specifically since that is where the energy efficiency and performance story is clearest.
For instance, the programmable solutions group at Intel where the Altera teams were integrated post-acquisition has just developed an FPGA overlay for deep learning inference that demonstrates some respectable results on an Arria 10 1150 FPGA.
“Intel’s DLA (deep learning accelerator) is a software-programmable hardware overlay on FPGAs to realize the ease of use of software programmability and the efficiency of custom hardware designs.”
The team explains that for the hardware side of DLA they have partitioned configurable parameters into the runtime to quickly use different neural network frameworks. They have also made compile-time tunability for performance based on those parameters. This is done through a VLIW network that Intel says delivers “full reprogrammability without incurring any performance or efficiency overhead.” This is quite a claim since overlays tend to always come with serious overhead.
Another interesting angle to what Intel has developed is Xbar interconnect, where all the core functions required for the neural net are connected. The team says having this means getting around including all possible functions in the overlay for runtime. It is instead possible to pick from a suite of pre-optimized kernels of the select frameworks DLA uses.
In addition to the hardware in DLA, the team has also developed a graph compiler, which breaks down a neural net into subgraphs, schedules subgraph execution, and allocates explicit cache buffers to optimize the use of a custom developed stream buffer and filter caches. These slicing, scheduling and allocation passes of the compiler allow for more efficient hardware implementations of target frameworks.
The full paper provides far more detail of benchmark results using both our hardware and software. Takeaways include a measured 3× improvement on ResNet 101 HD, 12× on LSTM cells, and 900 fps on GoogLeNet on Intel’s Arria 10 FPGAs.
The team hopes to further develop DLA to encompass more use-cases such as multi-FPGA deployment and “to implement similar overlays for different application domains such as genomics, packet processing, compression and encryption to further make FPGAs accessible for high-throughput computation.”
More details about Intel’s DLA architecture as well as benchmark results can be found here.