Over the course of the last year in particular, particularly following Intel’s acquisition of FPGA maker, Altera, field programmable gate array have risen to the fore as a potential accelerator cure to performance and power walls across a much wider breadth of applications. Xilinx, Altera, and the range of other smaller players in the space are now pointing to machine learning, on-the-fly processing of analytical workloads, and other areas as emerging areas to extend their reach.
At the core of this, at least for the non-specialist in programmable hardware, is OpenCL—a framework that extends the programmatic ease of other accelerators, including GPUs. While it is generally agreed that there is a performance hit created with this added level of abstraction from programming the device directly, it does mean that scientist and other end users might be more willing to give FPGAs a second look beyond the areas where they are already used—often as an option to an expensive custom ASIC.
This story is playing out at high energy physics research center, CERN, which has historically kept an open mind about adopting and integrating diverse accelerator, memory, storage, and other technologies—as long as experiments can be captured and processed faster and with less power. On the processing and acceleration front, the center uses standard X86 processors with the possibility of more ARM parts in the future, as well as GPUs. Custom ASICs and FPGAs have also been in use at various CERN sites for a range of monitoring, signal processing, and networking tasks.
For the curious, a 2008 paper sets forth the many ways FPGAs have been used at the center, but with coming technology refreshes and upgrades at CERN research sites, one might expect a potentially new set of use cases for FPGAs—specifically as processors for major portions of CERN workloads. With the availability of OpenCL as a gateway, the possibilities for non FPGA experts have opened wide, making FPGAs a source of interest not just for traditional uses at CERN, or even for workload acceleration, but as key processing elements for vital segments of major experiments, including the Large Hadron Collider LHCb experiment.
Following upgrades, this particular experiment will cull 500 data sources, each of which will create data at 100 Gbps. This work presents challenges on both the data acquisition and algorithm acceleration front, which put even state of the art FPGAs to the test, according to Srikanth Sridharan, Marie Curie Fellow at CERN. His team is looking at how to use OpenCL for FPGAs in a way that goes beyond acceleration—all the while leveraging OpenCL to demonstrate its adoptability for the larger sets of domain scientists who have little time to dig into the complexities of hardware description languages and techniques.
Sridharan has spent much of his career working with FPGAs working in industry for companies like Qualcomm as well as large research hubs like the NSF Center for High Performance Reconfigurable Computing (CHREC). Now at CERN, he is turning his attention to the role FPGAs might play in high energy physics, although not in ways one might expect. While his team is focused on acceleration of various algorithms (the area where a great deal of research into GPUs and other accelerators tends to fit) a new idea—using FPGAs not as accelerators, but as low power, high performance data acquisition system processors, is where is most recent work lies.
“The idea behind implementing a data acquisition (DAQ) system was to explore the possibility of using OpenCL for more than just acceleration. Many of the design elements needed to realize a DAQ system in OpenCL already exist, mostly as FPGA vendor extensions, some of which are going into subsequent versions of the OpenCL specification. However, a small number of elements are missing, preventing full realization of a complete DAQ system but since these elements have simple, feasible solutions, they could also be implemented if the FPGA tool vendors so desire.”
FPGAs are not a new addition to the data acquisition (DAQ) platform at CERN, but the OpenCL use to program FPGAs for more than acceleration is a new element. Sridharan says they were initially used for this part of the experiment workload to “collate the streaming data coming off the front end electronics over multiple channels.” He says they can also be used in the “low level trigger system where the acquired data needs to be quickly processed to arrive at trigger decisions.” However, he says that the custom nature of this and the need to do operate in high radiation environments “make any other technology unsuitable for these purpose and ASICs are suitable only for high volume production and are unviable for these applications due to prohibitive costs.”
Although the DAQ tests revealed FPGAs using OpenCL as a viable tool, with the few missing pieces he notes above, none of which will prevent future application, algorithm acceleration on FPGA using the Altera compiler for OpenCL revealed scattered results. For one test, FPGAs performed far better and with much better efficiency than the GPUs that were used for this part of the task, and for the other experiment, they were significantly worse. This could be in part to a lack of thorough optimization, Sridharan says, but the team will continue to explore.
Overall, he says an optimized implementation for FPGA on the algorithm acceleration side would provide a better sense, but ultimately, “it cannot be denied that OpenCL makes exploiting FPGAs for acceleration as easy as exploiting GPUs. That is a long way from the days of painstaking efforts to create a cycle accurate HDL design, functionally verifying it, debugging the design errors, and fixing the timing violations to realize a working system.” Further, he notes that in cases where optimization work for FPGAs have been done, the performance per watt story is an attractive one for CERN. “Extracting more parallelism from the algorithm, creating an FPGA optimized implementation, investigating the huge drop in performance for some kernels, and also accurate power profiling of the design could be the direction of future work.”
“it remains to be seen how such a system would perform compared to a custom implementation in VHDL/Verilog, but there definitely exists a case for OpenCL in this application due to the massive productivity gain and ease of use it offers,” he says. Further, he says that the wider accessibility of OpenCL means non-FPGA experts can design, debug and maintain the code.
A description and classification of the test results for both the DAQ and algorithm acceleration workloads can be found here.