What is that famous maxim in computer science about something that doubles on a regular cadence?
Moore’s Law is one answer, but that’s a doubling (in this case, transistors) every two years. On a related note, we need a new term for machine learning models, which double in complexity every two months. Unlike with Moore’s Law historic cost predictability, ML’s Law demands a massive increase in costs with each transition.
Measuring training in the strange, non-specific “petaflop-days” unit is a bit odd but it at least adds color to the main point: In 2018, Google trained BERT with 340 parameters in 9 petaflop-days. In 2019, Microsoft one-upped with an 11 billion parameter “Megatron” model run in 900 petaflop-days. Last year, they bolstered that, using one trillion parameters—an achievement that took 25,000 petaflop days to train. These are 1000x leaps and ML’s Law, unlike its transistor counterpart, shows no sign of slowing.
And this is where the neat package and historically predictable curves of the Moore’s Law comparison ends. Because, frankly, training starts looking even more impossible if we drag that line out to 2025. That’s an insane number of GPUs to parallelize work across (it takes GPT-3 with 175 billion parameters four months to train on a thousand GPUs).
Hardware is the bottleneck. And if we leave out the cost entirely, that problem is intractable without either an entirely new type of model—one designed for vast parallelism and instant data movement—or a new way of computing. Quantum isn’t the answer (yet). Massively heterogeneous architectures with thousands of chiplets dispatching what they’re best at is a nice middle ground. But from what we can see at TNP, the only thing that might scale with demand is a waferscale approach. And to date, there’s only one company doing it.
In a chat this week leading up to Hot Chips today, Andrew Feldman, Cerebras CEO and founder, tells us the trick now is to take a device that can handle a huge model and get that to scale to many waferscale systems without making users jump through the scalability hoops of a many-nodes rich GPU training machine.
Cerebras is adding some noteworthy features that take aim at ML’s Law in terms of capacity and scalability. Cost isn’t quite as easy to track but we’ll get more clarity on that as we see what the scalability sweet spot is for the waferscale startup’s early customers, mostly in supercomputing centers/national labs and in drug discovery.
The big news is that Cerebras thinks it’s possible to string 192 CS-2 waferscale engines together without performance degradation. The “how” on that is worth an explanation. It starts with a revision of how they saw systems coming together with their first generation machines.
Cerebras sees disaggregation as the only way to bring capacity and scalability. They’ve separated memory and compute so they can independently scale model memory (where the parameters sit) no matter how many CS-2s are looped together via an external memory appliance called MemoryX, which itself can store up to 120 trillion parameters.
In previous generations, they talked about their Swarm fabric, which was what connected 850,000 cores on their waferscale devices. That’s been swapped and renamed SwarmX, which takes that same fabric off-chip to let them connect up to 192 CS-2s with linear scaling—to a point, of course.
One of the tricks to this was to allow for a new software execution mode. The funny thing about it is that they already had the techniques, they just had to reverse the order for the largest scale machines and models.
Before, the mere size of the wafer allowed for parameters to be held locally while activations streamed in. For extra-large models (which will be the norm for the world’s largest systems buyers) it’s flip-flopped. The activations are held on the wafer and the parameters stream in. In other words, the weights are off the wafer entirely in that memory storage (MemoryX) piece, which can handle up to 2.4 petabytes, Feldman says. This means they can, in theory, store trillions of weights of chip on MemoryX, which is still a bit mysterious but is probably a DRAM/flash hybrid that connects with the new SwarmX fabric.
Disaggregation had to be the fundamental design goal, Feldman says, pointing to the costs of how GPUs work. “Each time you want to add compute, you have to add memory and vice versa. Very quickly the solution becomes insanely expensive. Our goal was to separate, to allow an off-chip store that could take advantage of all the different memory technologies and have enough smarts to stream the parameters—all the info needed by the wafer—in such a way that the wafer can stay busy at all times.”
“We had to invent techniques to use a memory store for these massive weights but also get them to the wafer so it’s never waiting. We could do that because the wafer is so big and can do so much work at one time.” The challenge, of course, is organizing that work in a way that it’s being delivered to the wafer, and perhaps many of them, in enough volume/time to always keep it busy.
To be fair, splitting that work across millions of GPUs would not be practical. The software footwork would be prohibitive. So how is it that Cerebras can do this across potentially many millions of wafers (there are 850k cores per device) in a way that makes sense? This is where things get hazy and we were treated to (no kidding) a slide with a great big red “easy” button that you push to make magic come out.
Sensing, if not hearing the groan, Feldman explains that in their approach, users configure a single CS-2 and that exact same image is loaded onto all the additional CS-2s in the cluster. The only manual intervention is telling the cluster which data goes to what part of the machine. We’ll get some clarity on that in short order.
“The only thing different in these machines is giving them a different portion of the dataset. We’ve shown impressive scaling properties as we’ve increased the number of CS-2s. The speedup increases linearly and for neural networks at ten billion parameters that speedup remains linear. Add a second system, the time is cut in half, add four and it takes a quarter of the time,” Feldman says.
Those limits don’t hold forever, of course. He admits that at about 60 systems that no longer holds for that parameter size and for 100 billion-sized networks it starts falling apart at 118 systems. But still, if that is the case in the real world, a) that’s worth remarking upon and b) at current rates, those insane scale parameter counts aren’t inconceivable.
“Today, Cerebras moved the industry forward by increasing the size of the largest networks possible by 100 times,” Feldman says. “Larger networks, such as GPT-3, have already transformed the natural language processing (NLP) landscape, making possible what was previously unimaginable. The industry is moving past 1 trillion parameter models, and we are extending that boundary by two orders of magnitude, enabling brain-scale neural networks with 120 trillion parameters.”