Published 09/02/1998
For more information, contact:
Ann Redelfs, NPACI/SDSC
619-534-5032 (
redelfs@sdsc.edu)
SAN DIEGO, CALIF. -- A biologist might never think to compare a snake toxin and a harmless bacterial enzyme involved in nitrogen transfer. However, a database created by biologists at the San Diego Supercomputer Center (SDSC) at the University of California, San Diego, is already turning up discoveries such as a structural feature shared by these proteins. The uncovered feature may provide clues to some distant evolutionary relationship between the proteins, or at least an energetically favorable arrangement repeatedly seen in nature.
In the September 1998 issue of Protein Engineering, SDSC researchers Ilya Shindyalov and Phil Bourne report on the Web-accessible database they created -- see http://cl.sdsc.edu/ce.html -- that compares the structure of over 8,000 publicly available protein structures to each other. They built the database, which would have taken a year and a half on a smaller computer, in several weeks on the CRAY T3E at SDSC.
"The result is a database of structure neighbors -- proteins with varying degrees of structural similarity," Bourne said. "This tool is a very valuable community resource in our quest to understand more about biological systems." The database is supported by SDSC's Biological Data Representation and Query initiative and the Molecular Science thrust area of the National Partnership for Advanced Computational Infrastructure (NPACI).
While there are many methods for comparing protein structures, the SDSC database is one of the few Web-accessible resources with calculated comparisons. The Web site includes alignment and Java-based visualization tools to investigate a protein's structure neighbors. Users may also submit structures not already in the database and get comparison results via e-mail.
BUILDING THE NEIGHBORS DATABASE
If DNA holds the blueprint of life, then proteins turn that flat blueprint into a working 3-D city -- a living organism. When a stretch of DNA expresses itself as a protein, the protein curls and folds upon itself into a distinctive shape that determines how the protein functions. The 3 billion bases of the human genome produce an estimated 70,000 proteins.
Until recently, science hadn't pinned down enough protein structures to make comparisons and draw conclusions about the structure and organization of living systems. Today, with more than 8,000 structures available, useful inferences can be made, but comparing 8,000 proteins against 8,000 others remains a daunting chore.
On a typical computer, it takes about 30 seconds to compare one protein polypeptide chain against another using this new algorithm. Comparing each of 8,000 structures (more than 11,000 polypeptide chains) against every other would take about 57 years. Using standard shortcuts based upon protein sequence, biologists can reduce the time required on a typical computer to a mere 1.7 years. Bourne and Shindyalov cut that time dramatically -- to several weeks -- using 24,000 processor hours on SDSC's 256-processor CRAY T3E.
"With a handful of new structures coming in per day, it's possible to update the database on a regular computer," Bourne said. "But it would not have been possible to establish the initial database without a supercomputer."
LEARNING FROM THE NEIGHBORS
Experiments have shown that proteins with very different functions and sequences -- like the toxin and enzyme odd couple -- can have very similar 3-D structures. Molecular biologists so far have not had much success detecting these similarities from the genetic blueprint alone, so there's a lot of interest in comparing 3-D structures.
"For example, similar structures might suggest that two proteins may be related by evolution," Bourne said. The database might also give biologists a deeper understanding of biological processes.
In exploring the folding pattern in a snake neurotoxin, for example, the database shows that a similar fold is found in an enzyme from E. coli (called glutamine amidotransferase). Since one is a toxin and the other is involved in nitrogen transport, it is hard to imagine their having similar functions. Instead, the fold appears to be an energetically favorable pattern that nature has adopted as a tool to accomplish various biological processes.
The SDSC database uses a method developed by Shindyalov and Bourne called combinatorial extension. Combinatorial extension is a fast method based on local geometry, unlike some other methods that depend on large-scale features such as secondary structure. Work is already underway to compare the SDSC database to two other available protein neighbors databases, FSSP ( Fold classification based on Structure-Structure alignment of Proteins) and VAST ( Vector Alignment Search Tool).
"The three methods use three fundamentally different approaches to a non-trivial problem and hence produce different answers," Bourne said. "There is a large core of comparisons they agree on but many weak comparisons that differ. It is these that are potentially the most interesting. This competition also helps the science advance at a rapid rate."
The San Diego Supercomputer Center (SDSC) is a research unit of the University of California, San Diego, and the leading-edge site of the National Partnership for Advanced Computational Infrastructure. SDSC is sponsored by the National Science Foundation through NPACI and by other federal agencies, the State and University of California, and private organizations. For additional information about SDSC, contact Ann Redelfs at SDSC, 619-534-5032, redelfs@sdsc.edu.
NPACI was established in 1997 as part of the National Science Foundation's Partnerships for Advanced Computational Infrastructure program to empower the environment for tomorrow's scientific discovery. Led by UC San Diego, the partnership receives support from the NSF, the State of California, the University of California and other agencies. NPACI activities are built upon the foundation established by SDSC, which since 1985 has served the country as a national laboratory for computational science and engineering.