Many have waited years to hear someone like Prahlad Venkatapuram, Senior Director of Engineering at Meta, say what came out this week at the RISC-V Summit:
“We’ve identified that RISC-V is the way to go for us moving forward for all the products we have in the roadmap. That includes not just next-generation video transcoders but also next-generation inference accelerators and training chips.”
He explained that in the last four years of robust plotting to bring the RISC architecture into the fold, they’ve not only rolled out production hardware but they’ve set the stage for future custom RISC-V silicon via a standardized a RISC-based control system and made it scalable so any IP Meta develops for any domain will adaptable and connect to the NOC with ease.
In other words, Meta has a template to quickly move any such new silicon into production, which is a big deal for those looking for RISC-V success stories at massive scale. All of this at a time when high-end GPUs are in short supply–with pricing to reflect.
Venkatapuram says Meta’s reasoning in looking to RISC was driven by a need for a acceleration of all “the business critical things we couldn’t do on CPUs” and also “power efficiency, performance, and absolute low latency on the server.” He adds that flexibility to support different workloads and resiliency in the architecture were also paramount.
“Anytime we design or deploy, we expect it to be there for 3-4 years so there has to be resiliency and also programmability—we want to put software in charge of how we use our hardware resources.”
He adds that 64-bit addressing is critical as well as both vector and SIMD capabilities in the core are essential and stressed the need for deep customization. “It’s apparent RISC-V does all of these things; it’s open, has strong support, there are multiple providers of IP and a growing ecosystem we’ve seen emerge over the last 4-5 years.” But ultimately, customization is key.
The video transcoding hardware Meta built on RISC-V alone has provided context to the customization piece. According to Venkatapuram, Meta’s Scalable Video Processor (MSVP), which is where Meta’s production RISC journey began, is in production and handling 100% of all video uploads across its Facebook, Messenger and Instagram services. “We were doing this on CPU earlier but we’ve now replaced 85% of those so we’re using only 15% of those CPUs.”
The real story, the one that should get the processor world sitting up to notice, is that Meta is skipping the ubiquitous GPU and building AI inference and a training chip on RISC-V. The Facebook giant revealed some of this work in May, as we covered here, a project that has been ongoing since 2020.
For now, the RISC-V AI processors are dedicated to accelerating recommendation models, both in inference and training. The architecture is not unfamiliar with an 8×8 grid of processing elements, each element hosting 2 RISC-V cores (one scalar, one vector) and a single core for control. The scalar and vector cores sync with a command processor, which work in conjunction with baked-in fixed functions developed at Meta.
Beyond that, not much is known but we’ll press for more insight, including what volume of production Meta has of the RISC-V architecture.
While all of this is promising, there are some challenges that do give pause, although they do not seem to diminish Venkatapuram’s optimism.
Despite massive customization, Meta still needs more from the existing IP options. “There’s very little in terms of offerings that allow seamless integration of custom instruction and resources into RTP, simulators, software tools and compilers,” he explains. Another challenge is a lack of interoperability from the various vendors but he did not provide deep detail. Ultimately, the sense was the challenges are not insurmountable.
One of the most important hurdles, especially as Meta seeks to build more production AI workloads on RISC-V, is around support for matrix extensions. Matrix math is a key component of AI and while RISC-V has vector extensions there are not standard extensions for matrices, he explains. There is work happening on this front that he cites (many vendors, including Stream Computing and T-Head Semi) but ultimately, whatever they come up with should be standardized.
Venkatapuram emphasized above all the importance of broader ecosystem support, from support for all the major libraries and tools to the hardware ecosystem:
“RISC-V, due to its open standard nature, has the potential to attract a bigger set of third party providers of tools, software, peripherals, and more than a proprietary ISA but this potential has yet to be realized fully.”