Skip to content

GRID COMPUTING | Contents | Next

Knowledge from Biological Data Collections

PROJECT LEADERS
Russ Altman, Stanford University
Reagan Moore, SDSC
PARTICIPANTS
Chaitan Baru, Phil Bourne, Jon Genetti, Michael Gribskov, Greg Johnson, John Moreland, Dave Nadeau, Bernard Pailthorpe, Arcot Rajasekar, Ilya Shindyalov, SDSC
Hector Garcia-Molina, Andreas Paepcke, Allison Waugh, Glenn Williams, Stanford University
Andrew Grimshaw, Katherine Holcomb, University of Virginia
Thomas Hacker, University of Michigan, David Hillis, University of Texas
Lennart Johnsson, Montgomery Pettitt, University of Houston
Bertram Ludäscher, UC San Diego

The amount of information in biological data collections is growing at phenomenal speeds. Analytical tools are necessary to make sense of the myriad of data collected by biologists. Such tools would need to be scalable, to deal with the heterogeneous nature of the data collections; accessible, so researchers could easily find the data they need; and mobile, to move large data sets across the Grid. This is where bioinformatics--the bridge between biology and computing technology--steps in. Bioinformatics is the acquisition of biological knowledge by means of computational tools for the organization, management, and mining of biological data.

Russ Altman of Stanford University and Reagan Moore of SDSC lead NPACI's Bioinformatics Infrastructure for Large-Scale Analyses alpha project. The project focuses on using data manipulation, analysis, and visualization infrastructure to integrate data from molecular structure data resources such as the Protein Data Bank (PDB). Legion will then be used to perform large-scale computations across the databases. Together, these components create a discovery environment that will apply to many scientific disciplines.

FEDERATING DATABASES

LARGE-SCALE ANALYSES

FEDERATING DATABASES

One of the challenges of building bioinformatics infrastructure is federating the databases, that is, bringing them together to make them accessible from a single interface. This challenge is being faced head-on by projects such as the PDB, which needs to link to databases that contain complementary information.

"These projects are similar in that they generate knowledge by organizing information into collections and federating the collections so that they can talk to each other," Moore said. "The researcher pulls out of the application what he needs to do another analysis and uses the results in the next application. If we can automate the transfer of information in this cycle, then we may be able to make sense of the human genome, determine the proteins expressed between different genes, and understand how proteins govern cell processes and organ development."

"The PDB is an example of a collection that is replicated across multiple sites around the world," Altman said. "It works like a digital library that supports queries against protein structures, so that a user can find everything that is identified with that protein structure. One of the goals of our alpha project is to automate the annotation of PDB data entries, and to develop infrastructure to support routine annotation and analysis of PDB entries."

Altman leads the Helix Group at the Stanford Medical Informatics laboratory, which has developed algorithms for identifying active sites and binding sites in PDB entries. "As part of large-scale structural genomics initiatives, these automated annotation tools may be invaluable for providing a first assessment of the functional capabilities of molecules," Altman said.

Top| Contents | Next

REFERENCES

R. Moore, C. Baru, A. Rajasekar, B. Ludäscher, R. Marciano, M. Wan, W. Schroeder, and A. Gupta. 2000. Collection-Based Persistent Digital Archives. D-Lib Magazine 6(3,4).

L. Wei, E. S. Huang, and R.B. Altman. 1999. Are Predicted Structures Good Enough to Preserve Functional Sites? Structure 7(6):643-650.

LARGE-SCALE ANALYSES

In the analysis component of the project, participants are developing molecular scanning and comparison algorithms for various collections, and Legion will recruit the computing resources required for large-scale analyses, such as "linear scans" through the databases and "all-to-all" comparisons across databases, as well as analyses that require identifying the geometric and thermodynamic properties of molecules. The system will also provide access to protein structure and sequence databases. The result will be an infrastructure that integrates computations and data analyses across multiple, heterogeneous collections.

Recently, Altman's team performed a preliminary scan of about one-tenth of the PDB, using Legion and the Feature code, developed by the Helix Group. Feature was hooked onto Legion, Altman said, to scan for calcium binding sites. This was the first major test of the alpha project's capabilities.

"Over the past year, we arrived at a common information model for exchanging information content," Moore said. "This is a major step because we're starting to understand how to organize information and manipulate the organization independent of the content."

The key to this manipulation is the SDSC Storage Resource Broker (SRB), which is building a collection of distributed data sets and manipulating them as a coherent collection. NPACI partners operating the PDB have accelerated the transition schedule of production access, and the Data-Intensive Computing Environments thrust area has released version 1.1.7 of the SDSC SRB, which can use the Grid Security Infrastructure for authentication.

"The initial collections, such as the PDB, are growing rapidly," Moore said. "We now have the basic pieces, but need to put them together into a high-performance computation infrastructure. Thus, we need to combine the databases with the SDSC SRB and with the metacomputing capabilities of Legion. A lot of computing capability is available to handle this information, but we need a system that is usable by biologists, not just those who are experts at high-performance computing."

"The application of automation and computation to the gathering and analysis of biological data will directly affect the lives of most inhabitants of our planet," Altman said. "The production of more and better food and drugs, as well as improved public health, are obvious areas of change. And bioinformatics techniques are essential in enabling researchers to answer today's pressing questions about diseases, pharmaceuticals, and the processes of life." --AV

Top| Contents | Next
Top| Contents | Next