For those involved with large-scale system planning, design, and procurement for campus-wide supercomputers, the target is balance. Depending on what workloads or goals are driving the design optimization, whether it is peak performance, heterogeneity, broad applicability and usability, these decisions can be tremendously complex.
The question is, what happens when we insert an overarching pressing design consideration into the mix; one that goes beyond processors, accelerators, networks, or storage? This addition is cost effectiveness, but it is much more nuanced than we traditionally hear about in academic supercomputing.
While cost effectiveness is generally always at the forefront of these decisions, many large academic research computing centers have established ways of pulling together a budget from multiple departments, along with broader university and other funding. The goal is to build a system within those constraints and provide the best campus-wide resource possible. What many of these centers do not do, however, is provide ongoing, complete cost recovery for the systems they provide.
That requirement to provide 100 percent cost recovery as an academic supercomputing resources makes that initial chain of design and selection of a large system even more complicated. It is not simply a matter of starting with a general budget and building from that groundwork, it’s about understanding the ROI from a revolving set of workloads from many departments over the lifetime of a machine.
All of this takes some dramatically new thinking about supercomputing investment and its value over time, says Brock Palen, who heads the Advanced Research Computing Technology Services group at The University of Michigan. His team was tasked with helping put together the specs for the campus-wide ‘Great Lakes’ supercomputer that Dell delivered this year.
In addition to considering all of the performance and application requirements from the base of over 2,500 anticipated users of the machine, Palen and his team had to envision how particular components could yield dramatic performance gains that could be traced back to the overall monetary value of the system.
That, as one might imagine, is no small task, especially in a university system that leads the nation with its research funding: Around $1.5 billion each year, which means a vast, evolving set of workloads that range from large-scale parallel jobs that could take over the machine to small, one-core jobs. Serving both types of workloads is always a challenge for university supercomputing centers, but Palen’s decisions had to optimize for user requirements and ROI.
The 13,000 core “Great Lakes” supercomputer has to account for everything, from the hardware to datacenter maintenance, electricity, personnel, software environments, and full depreciation. It has been designed to expand up to 50 percent in the future, which is another decision factor in how it will scale based on user applications and how those evolve.
The question is, how does a university strike that balance?
We will discuss a few of those areas in a moment but from the highest level, the system design for maximum, ongoing cost recovery was aided by The University of Michigan’s infrastructure partner, Dell. The expert integration and balancing of features for cost effectiveness, optimal performance, and wide usability was based on their seasoned teams working with Palen and his team to match this with their application profiles and user requirements. Dell helped Palen and his group implement traditional HPC infrastructure with some high-value, innovative elements as well, including HDR100. In fact, the University of Michigan Great Lakes system is the first supercomputer system on the planet to sport this capability from Mellanox, allowing far higher bandwidth and higher port counts at a price point that could fit into their strict budget mechanisms.
Extensive application profiling from the previous “Flux” supercomputer helped inform the system design, but so did keeping a careful eye on what future workloads might require. At the forefront of this forward-looking design element was the rise of machine learning in a wide range of application areas, including for one of the more noteworthy initiatives at the University of Michigan, the MCity project and its 32-acre autonomous vehicle and connected city test track.
These emerging workloads with deep learning as a central feature spurred the addition of not only more GPU nodes than the University of Michigan had previously, but PowerEdge servers with intensively powerful accelerators – with the Nvidia V100 Tensor Core GPUs and their AI software stack – to dramatically speed AI training in particular.
As one might imagine, building a system robust enough for the traditional classes of HPC simulations along with the emerging demands of AI on the compute and networks in particular, is not a cheap undertaking. While full cost recovery was the goal, so was elite application performance. However, as Palen argues, some of the components on the system might be expensive choices from a capital expense standpoint, yet the cost recovery and long-term ROI pan out better than one might expect.
Using the high-end Dell-Nvidia AI solution stack as an example, Palen notes “it’s not just about demand. The speedups for certain applications by moving to the Nvidia V100 GPUs on Great Lakes blows the legacy GPUs out of the water.” He adds that indeed, as accelerators get faster, one spends more “but if these jobs run far faster, it’s worth the higher cost.” Ultimately, the customer ends up saving money and realizing the value of the whole solution. The same rings true for the spend on the premium network fabric from Mellanox Technology. That expansive boost in bandwidth is great news for users, but it also has some efficiency and ROI benefits in that they can get far more out of a single port, which cuts costs in other ways.
Ultimately, Palen’s team had to build a system that would be attractive for their wide base of university user groups since each department spends its resources for compute time on the system. It should also be noted that it would be just as possible for users to take their research to the cloud as an option to queue times, among other constraints. This keeps Palen’s group on its toes to build a resource that is highly available, primed with the latest and greatest hardware for the most demanding applications along with efficient cost-controlled lighter nodes for single-core jobs as well.
In many ways, the University of Michigan will be operating its supercomputer in a similar intensively cost-controlled way of enterprises, whose information teams keep careful watch on the evolving, static costs operationally on a daily basis. This is a new way of thinking about supercomputing investment, but it’s also a fresh approach to designing a system for the long term.
“It helps to have a partner like Dell along with us. We would have had a difficult time without their assistance from beginning to end. From technology advising to important pieces of the system, especially the HDR100, which they had to use their leverage with Mellanox to round up for us on this machine, they have been a driver for this,” Palen says, adding, “We would work with them in a heartbeat.”
Read in much more detail how Dell has worked with the University of Michigan to make 100 percent cost recovery supercomputing possible.