Over the course of the pandemic, you have probably encountered data from IHME, the Institute for Health Metrics and Evaluation. The grant-supported organization based at the University of Washington’s School of Medicine is global in reach, providing everything from population health, death, and other statistics to help guide policymakers and worldwide health efforts.
As one might imagine, bringing together population data from massive sources (not to mention in multiple formats, which is its own data integration challenge, consuming up to 40% of IHME developer time) and analyzing it to produce reports for data laymen takes significant computational, data science, storage, and other resources. With COVID-19, those demands were sent soaring.
This new push to beef up infrastructure, which already includes a 27,000-core cluster on-prem (AMD and Intel combination with increased AMD on the horizon) includes a rethink of the storage backbone. As Serkan Yalcin, Director of IT and Technology Infrastructure/DevOps tells The Next Platform, they were not sure what needs their modelers would have but they invested in four different Qumulo appliances, from flash to high-capacity, to fit demand.
IHME is already using around 80% of the Qumulo capacity for its work but expects future projects to keep pushing limits on what they require. They have ambitious goals, including a global population forecast that seeks to provide population projections to 2100. Helpful to that and other projects is the arrival of a donated DGX appliance, which his teams are currently experimenting with to see how their codes will translate to a GPU beast, especially since the organization is relatively new to GPUs (its main cluster is CPU only). For now, they are able to work with standard 10Gb/E and 100Gb/E Ethernet on the main machine.
Yalcin was already familiar with Qumulo, having spent nearly a year of his career (between separate stints at IHME) at the company managing key customer relationships, which might have played a role in the selection. Nonetheless, the decision makes sense given the need to integrate quickly with existing storage infrastructure with a file system that could bridge the divide. Prior to the storage upgrade, IHME had been using Quantum’s StorNext appliances but Yalcin says they were facing an issue with speed given the high-throughput and mixed analytics workloads of their modeling team.
Western Digital and Qumulo paired up for IHME COVID-19 health analytics and vaccine roll out. IHME utilizes Qumulo’s scalable File Data Platform with Western Digital’s Ultrastar HDDs and SSDs to compute up to 2PB per day to provide public pandemic research, statistics and projections, including vaccine rollout. This helps enable IHME to distill hundreds of millions of data points into a single visualization, which allows policymakers to view results and communicate them with teams.
To give a sense of their workloads, consider that IHME isn’t just providing a visualization of current static cases. Policymakers can use their data as an easy to digest way to view trends under different conditions (wearing of masks, easing of restrictions, etc.). They are also bringing in data to help project hospital resource use and the effects of social distancing. As one might imagine, a lot of projections means great potential for integrating more AI/ML into the mix, which will be good for the end users but could mean yet another boost required for infrastructure for the grant-driven organization.
With all the data integration challenges at IHME and the need to be able to scale as new information comes in or becomes less important, it might seem they could get around all the storage and datacenter maintenance by moving to the cloud. Yalcin says they are exploring some work with Microsoft AI now on their Azure platform, but he adds the cost, based on their calculations, would be 3-4X more using an AWS or other cloud platform.
“We have a data gravity problem,” Yalcin explains. “We have about 8 PB of data on disk and another 15 archived.” He says that data transfer and archiving would be a big part of that cost, especially since so much is on disk now with jobs taking a terabute of RAM and up to 50-70 CPU cores.
With storage locked down for now, the next step for IHME is to look again at compute, which he says is dominated currently by Intel processors with the sweet spot being around 28 cores. “We just finished getting AMDs out a year and half ago. A large portion of the machine was AMD originally. We are after threads; we have massive data manipulations we need to do, each year we’re producing around 250 TB,” Yalcin adds.
It’s always interesting to get the infrastructure scoop behind services we use or see daily. As we’ve noted previously, storage is the one place where anything biotech/medical is seeing hardware infrastructure growth this year. It sounds like that might shift to compute for IHME as they try to loop in more data points for larger predictions but by that time there might be a bigger GPU angle to their story.