Published November 1, 2021
Kimberly Mann Bruch and Cynthia Dillon, SDSC Communications; Robin Lally, Rutgers University
Established in 1971 as the first open access digital data resource for biology and medicine, the Protein Data Bank (PDB) is now a leading global resource for experimental data integral to scientific discovery. In 50 years, it has gone from just seven protein structures within its data bank to more than 180,000 structures used worldwide by researchers to unlock the mysteries of human disease.
The PDB was founded by Board of Governors Distinguished Professor Emerita of Chemistry and Chemical Biology Helen Berman at Rutgers-New Brunswick. Berman also established the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), which operates the U.S. data center for the global PDB archive, and makes PDB data available at no charge to all data consumers without usage limitations.
In 1998, the RCSB PDB brought the PDB to UC San Diego – specifically to the San Diego Supercomputer Center (SDSC) on campus. Site Lead Jose Duarte reflected on the ways in which this worldwide collaboration has transformed how scientists collect and share their structural biology data.
“Not only does the PDB continue to showcase the power of open data and community, but more recently we are proving to be an enabler of a major AI breakthrough coined AlphaFold, which achieves highly accurate computational modeling of protein structures by learning from 50 years of data deposited at the PDB,” Duarte said. “Thanks to open data and standards defined by the PDB, we have also witnessed the birth of a thriving sub-branch of bioinformatics known as structural bioinformatics, which is a direct consequence of the existence of the PDB.”
PDB Research Highlights
Researchers ranging from computational chemists to artists have utilized PDB for their work. One such user is Rommie Amaro, distinguished professor of theoretical and computational chemistry at UC San Diego.
“The PDB has been essential to much of my lab's work and countless others ─ we have used data from the PDB to gain important insights into the molecular piece parts of the SARS-CoV-2 virus, as well as used the data, together with molecular dynamics simulations, integrative modeling, and AI to understand how the viral spike protein opens,” said Amaro. “Importantly, the PDB showed the biological world how to think about, organize, collect and develop data into a useful ecosystem ─ this ecosystem shows the centrality of PDB data across different biological domains and scales.”
Another “power” user of PDB is Artist/Scientist David Goodsell, a professor of computational biology at the Scripps Research Institute and a research professor at Rutgers University. “The PDB is an essential resource for education and outreach, providing a detailed look at the molecules that perform the processes of life,” he said.
According to Goodsell, who also serves as scientific outreach lead for the RCSB PDB, the companion portal PDB101.RCSB.org provides materials that allow exploration of these structures by students, educators and the general public. “It’s the first place I go when I need to create accurate illustrations of enzymes, DNA, ribosomes or just about any other biomolecule,” he said.
PDB’s Impact
More than $5 billion in funding has been provided by the National Institutes of Health (NIH) to structural biologists in the U.S. who have generated more than 50,000 of the structures currently available from the PDB.
Biomedical researchers using the structure data stored in the PDB have published more than two million scientific papers, some of which have helped researchers and pharmaceutical companies tackle major health challenges, including heart disease, cancer, diabetes, Alzheimer’s disease and HIV-AIDS.
Ann Stock, distinguished professor in the Department of Biochemistry and Molecular Biology at Robert Wood Johnson Medical School and associate director of the Center for Advanced Biotechnology and Medicine (CABM), said the data shared through the PDB is central to understanding biological systems at the molecular level – an integral part of drug development being done to treat human diseases by both biotechnology and pharmaceutical companies.
“While some investigators wanted to keep information to themselves to guide their own investigations in the early days of structural biology, the PDB enabled data sharing and had support of the government and the academic scientific community, who understood that this information was critical to researchers throughout the world,” Stock said.
Today, this means that structural biology researchers who want to publish in peer-reviewed scientific journals must share their data via the PDB.
“Sharing of scientific data is something that has evolved with a lot of progress over the last couple of decades,” Stock said. “The PDB was one of the first databases that provided a comprehensive set of data for a particular field and set policies early on about what needed to be shared.”
According to Stephen Burley, professor and Henry Rutgers Chair at Rutgers-New Brunswick and director of the RCSB PDB and the Rutgers Institute for Quantitative Biomedicine, open access to 3D structure information from the PDB facilitated discovery and development of more than 90 percent of the 210 newly approved by the U.S. Food and Drug Administration
(FDA) between 2010 and 2016. “Looking more closely at the 54 new anti-cancer drugs approved by the FDA in 2010 to 2018, revealed that more than 70 percent of them were the products of structured-guided drug discovery accelerated by open access to PDB structures of the drug targets.”
The PDB is managed by the Worldwide Protein Data Bank partnership, with data centers in the U.S., Europe and Asia.
About SDSC
The San Diego Supercomputer Center (SDSC) is a leader and pioneer in high-performance and data-intensive computing, providing cyberinfrastructure resources, services and expertise to the national research community, academia and industry. Located on the UC San Diego campus, SDSC supports hundreds of multidisciplinary programs spanning a wide variety of domains, from astrophysics and earth sciences to disease research and drug discovery. SDSC’s newest National Science Foundation-funded supercomputer, Expanse, supports SDSC’s theme of “Computing without Boundaries” with a data-centric architecture, public cloud integration and state-of-the art GPUs for incorporating experimental facilities and edge computing.
Share