Skip to content

Computing the Tree of Life

NPACI MOLECULAR SCIENCE |Contents | Next
PROJECT LEADERS
David Hillis
University of Texas, Austin

Paul Lewis
University of Connecticut

A fundamental objective of biology is to produce an accurate tree of life for the world's 1.7 million known species. Before this tree can be completed, biologists face three computational challenges. They must develop the tools to analyze the large data sets and determine relationships for all living organisms, train people to collect the data and use the computational tools once they are developed, and engineer an infrastructure that can support the data collection and computational systems. Along with a team of scientists at the University of Texas, Austin, David Hillis is working to overcome these obstacles with the help of NPACI supercomputers.

PARALLEL ALGORITHMS FOR GENETICS

COMMUNITY-WIDE TOOLS

Figure 1. A Branch from the Tree of Life
A phylogenetic estimate from a small sample of the tree of life created by David Hillis. With current computational tools, phylogenetic analyses for 1,000 species is possible with adequate computer resources. However, it is currently impossible to reach a reasonable solution for 500,000 species, even with months of computation. Hillis, Paul Lewis, and their collaborators are studying how to make analysis of such large data sets possible.
Through the study of evolutionary relationships--a field known as phylogenetics--researchers can now use computers to reconstruct the history of a virus such as influenza or HIV, determine its origin, and use this information to predict the course of the epidemic or to develop potential vaccines. With the help of supercomputers, scientists like Hillis will someday take phylogenetics a step further and complete the tree of life.

PARALLEL ALGORITHMS FOR GENETICS

Hillis, director of Texas' School of Biological Sciences, is working on approaches to reconstruct the history of thousands of species and, with this history, to interpret biological information on the behavior, physiology, or molecular evolution of the species. Hillis' primary limitation is the inability to handle the large, complex data sets that are becoming more and more common within phylogenetic analysis.

"In practice, a modern phylogenetic analysis consists of finding the pattern of relationships that best accounts for the DNA or RNA sequence data, given a particular model of evolution," Hillis said. "Evaluating a potential solution to a phylogenetic problem is computationally very intensive, and the number of possible solutions to a phylogenetic problem increases precipitously as more organisms are added. This puts the phylogeny problem into the class of computational problems that are NP-complete; the solution time goes up exponentially as a function of the number of sequences in the analysis."

The rapid increase of the size of phylogenetic data sets is largely due to discoveries in molecular biology and genetics that are producing gene sequences of more and more organisms. These large sets of sequences create phylogenetic problems that are very difficult to solve--even with the world's most powerful supercomputers.

Because of the difficulty of the problem, there is no provably optimal algorithm for solving it; instead, there are several heuristic approaches. Hillis' project is looking at the performance of an approach based on genetic algorithms, which can be easily scaled to take advantage of large pools of computer resources. They are intrinsically parallel algorithms, and the project has used the Cray T3E at the Texas Advanced Computation Center to refine and test the approach.

Top| Contents | Next

COMMUNITY-WIDE TOOLS

Through collaboration with Paul Lewis, professor of ecology and evolutionary biology at the University of Connecticut, Hillis' group is testing a genetic-algorithm-based software package for parallel supercomputers that helps scientists depict the relationships among large sets of organisms. Lewis' program, GAML, has recently been applied to estimate the evolutionary relationships among the more than 100,000 species of flowering plants. Due to its large size, the flowering plant problem is such that conventional approaches cannot be used to address it effectively.

"GAML has been shown to have great potential to work across large, distributed networks of workstations, as well as on supercomputers," Hillis said. "Future large-scale phylogenetic analyses might therefore take advantage of unused computer resources or, as has been done most successfully by the Search for Extraterrestrial Intelligence, of the computer resources of interested amateurs."

Beyond his own phylogenetics studies, Hillis is working to make such computational tools widely available for determining and manipulating large sets of complex data, such as the entire database of tens of thousands of HIV sequences. Once Hillis and Lewis perfect GAML, they will share it with the biological community. And as a partner in NPACI's Bioinformatics Infrastructure alpha project, Hillis is contributing phylogenetics algorithms to the suite of large-scale analyses that the infrastructure will make possible.

"With GAML, we are hopeful that we are one step closer to reconstructing the complete tree of life," Hillis said. "Reconstructing the tree of life is a major objective of evolutionary biology and will require a serious, coordinated international effort to complete. However, once we complete this feat, scientists will have the framework in place to incorporate newly discovered species into the tree. We will then have a more lucid understanding of the patterns of biodiversity and an extremely powerful tool for interpreting all aspects of biological variation." --KMB *

Top| Contents | Next