News Archive

Researchers Make a Case for National Strategy around Integrated, Accessible Data

Published July 15, 2022

 Credit: sdecoret, 123RF

By Cynthia Dillon, SDSC External Relations

Scientific research is trending rapidly toward more open, accessible and supportive rapid-response discoveries. At the same time, researchers are collaborating across the U.S. to address complex challenges, such as COVID-19 and supply chain issues.

Around the world there are robust responses to the need for a unified open research commons (ORC)—an interoperable collection of data and compute resources within both the public and private sectors that is user-friendly and broadly accessible. So, while other nations are gearing up for future competitiveness in this way, the U.S. is lagging behind, according to a group of researchers from several universities and institutes across the U.S. and in the Netherlands.

The problem, according to the researchers from places such as MIT, John Hopkins and Argonne National Lab, is that the U.S. needs a more committed effort toward making research computing and data infrastructure accessible and connected.

According to the researchers, including San Diego Supercomputer Center’s (SDSC) Christine Kirkpatrick, the lag compromises competitiveness and leadership, plus it limits beneficial U.S. contributions to global science.

“The U.S. has critical mass in experts, forward-thinking program officers and no end to the societal challenges and science use cases that call for a unified research commons, yet, it calls for organization at a level higher than these initiatives are usually funded. Immediate and sustained leader­ship and support in the U.S. are needed to chart the course, starting with policymak­ers and research funders,” said Kirkpatrick, division director of  Research Data Services at SDSC, who leads SDSC’s FAIR (findable, accessible, interoperable and reusable) efforts via the U.S. GO FAIR Office located at SDSC.

In an article published in Science, the researchers affirm the value of broad cooperation around technol­ogy and data. For example, they point to shared governance and infra­structure, as well as standard agreements, that permit a shared system such as the North American electrical grid to direct electricity to where it is needed. They also cite the CIRRUS banking network, which can deliver funds from an individual’s bank account to most places around the world. The researchers note that similar coordination in the research en­terprise could pay enormous dividends.

“We now have vast amounts of publicly available research data, but to fully leverage the po­tential power of these data beyond individ­ual and often heroic efforts, these data need to be identified, made interoperable and aligned so that they can be broadly used by the scientific community,” said Philip Bourne, first author of the paper, currently with the University of  Virginia’s School of Data Science, and formerly with SDSC as associate director of the RCSB Protein Data Bank, and with UC San Diego as a professor of pharmacology, and bioinformatics and systems biology.

According to the researchers, data on disparate topics—such as a county’s homelessness rates, average income, neigh­borhood food and health resources, air pol­lution, flood risk, predicted water resources and predicted average temperature—often are spread across a range of locations on the web, infrastructures and management re­gimes.

“If these data were integrated—brought together based on common data elements in each dataset—we could use these data for powerful analyses, like identifying locations with high homeless populations that are also likely to be hit hardest by floods, droughts or heat waves, or places with poor cardiac health that also have high or increasing par­ticulate matter 2.5 (PM2.5) pollution, which could lead to more heart attacks,” noted Bourne.

Support by policymakers and funders who are driving the development of research infrastructure can facilitate such work, similar to the ur­gent cooperation seen among scientists during dire times of need, such as the COVID pandemic, the threat of war and the disrup­tion to the global economy.

The researchers hold that the U.S. has a vibrant research ecosystem with no lack of computation and data re­sources. But, the struggle lies in the cultural and insti­tutional obstacles that require policy leadership and a sustained commitment to overcome.

The approach needed per the researchers is a coherent national strategy that includes: mutually beneficial U.S. industry partnerships, formal executive representation in international ORC-focused initiatives, AI-ready data and long-term data preservation for reproducibility,  professional data stewardship and ultimately federal commitment to charting the future and establishing a national ORC. According to the researchers, incentive to create a unified system is paramount.

“Scientists are not yet pre­sented with the adequate incentives. Man­dates from funders—such as data-sharing policies—help, but there are not enough defi­nitions of requirements and rewards for com­plying or, indeed, a unification of what is ex­pected of researchers regardless of the source of their research funding,” explained Kirkpatrick, who also serves as secretary general for the International Science Council's (ISC) Committee on Data (CODATA).

Kirkpatrick pointed to some of the efforts SDSC—a leader in high performance (HPC), data-intensive computing and cyberinfrastructure—has made toward supporting accessibility and connectedness:

  • Participation in the National Science Data Fabric pilot project – the first infrastructure capable of bridging the gap between massive scientific data sources, the Internet2 network connectivity and an extensive range of HPC facilities and commercial cloud resources around the nation;
  • The Open Storage Network – a national, distributed storage resource for sharing data at scale;
  • The National Research Platform – an NSF-funded innovative, all-in-one system that combines computing resources, research and education networks, edge computing devices and other instruments to expedite science and enable transformative discoveries;
  • CloudBank – an NSF-funded service to help researchers access and use public cloud computing resources;
  • Open Science Chain – a program for providing a secure method to efficiently share and verify data and metadata while maintaining privacy restrictions necessary for the reuse of the scientific data;
  • Open Science Grid – a consortium that builds and operates a set of pools of shared computing and data capacity for distributed high-throughput computing; and
  • Participation in the AI Institute for Intelligent Cyberinfrastructure with Computational Learning in the Environment (ICICLE)—an institute focused on user-friendly, next-generation intelligent cyberinfrastructure for user-friendly AI applications.

About the collaborative Science article, Bourne noted, “It was wonderful to engage with Christine on this important policy forum and to reengage with SDSC where I spent many happy years. Collectively, we have made an important statement for the future of research computing, and I look forward to helping turn words into action.”