Data science, AI/ML, and HPC have been influential in tackling complex issues like the coronavirus pandemic. Unfortunately, these kinds of large-scale, cross-discipline collaborative efforts are the exception as opposed to the norm in the United States. Now a group of scientists from across the country are looking to change that.
“Many nations are positioning themselves to be scientifically competitive in the years to come,” reads an article contributed to the journal, Science. “But the U.S. is falling behind in the accessibility and connectedness of its research, computing, and data infrastructure.”
The article’s authors aren’t calling for more advanced supercomputers, bigger facilities, staffs, or budgets. Instead, they’re asking policy makers and their financiers to make it easier for them to do their jobs by establishing an open research commons (ORC) for scientific data and compute resources.
The scientists contend that the failure to establish such a repository has left much of the country’s scientific research and data siloed away and fragmented throughout the many scientific institutions in the United States. This, they argue, has left the US at something of a disadvantage compared to countries in Europe or China, which have already established ORC-like initiatives of their own.
The scientists envision something akin to the Cirrus banking network, which can deliver funds to anyone anywhere in the world in a matter of hours or days, but for scientific data.
“Often data on disparate topics – such as a country’s homelessness rates, average income, neighborhood food and health resources, air pollution, flood risk, predicted water resources, and predicted average temperature – are spread across a range of locations on the web, infrastructures, and management regimes,” they wrote.
Unfortunately, the way research is often funded limits the scope to new and novel ideas, said Christine Kirkpatrick, division director of research data services at the San Diego Supercomputing Center and co-author on the article published in Science. “There doesn’t tend to be as much money to do something that needs ongoing infrastructure.”
But if that data is properly normalized, attributed, and tagged with the right metadata, it can then be reused and even correlated against other data. One example cited in the article is using data on homeless populations and climate to determine what locations in the U.S. are at the highest risk to global warming.
Suddenly, society changing science can be done by anyone regardless of their background or level of funding, Kirkpatrick tells The Next Platform.
“If the data was already available in a way that was properly annotated, and has provenance, and all these things that take so much time, we could, instead, have citizen scientists and researchers inquiring in these kind of large-data commons, looking for things to study,” she says.
What’s more, instead of making an educated guess and using resources to see if there’s a connection between two disparate data sets, AI/ML could be used to passively identify data sets that have a high degree of correlation.
The good news is the barriers for establishing an ORC aren’t technological. “Much of the hardware, software, and knowledge needed to make the ORC a reality largely exists,” the article in Science reads.
Instead, they contend the issue is an institutional one rooted in leadership and sustained commitment.
“What is needed is commitment from the US Congress and the administration so that parties can come together to chart the future and address the fragmented nature of the US research computing and data enterprise by establishing an ORC,” the article states.
It is slightly ironic that the article advocating for ORC (in Science) “an interoperable collection of data and compute resources […] accessible to all” is itself behind a paywall.