Skip to content

DATA-INTENSIVE COMPUTING |Contents | Next


Data Caches Form Framework
for Distributed Data-handling Testbed

PROJECT LEADER
Reagan Moore
SDSC

The goal of the NPACI Data-intensive Computing Environment (DICE) thrust area is to develop software to allow a researcher to identify, access, and analyze data located anywhere on the high-speed networks connecting the partners. To reach this goal, NPACI's partners are working with a computing infrastructure of unprecedented size--the system will be capable of supporting analysis of terabyte-sized data collections.

The difficulty is that the information to be analyzed resides at the NPACI partner sites where experts have assembled their domain-specific data collections. To be useful to other researchers in the partnership, this information must be organized and accessible on a global scale. The solution being implemented by the DICE thrust areza is to establish data caches to serve several functions:

  • They will allow faster access to key discipline-specific databases at partner sites.
  • They will permit computational science researchers to stage data for use in simulations at computational resources that have terabyte-sized disk farms.
  • Most importantly, they will form a production testbed on which NPACI partners can develop the tools for integrating distributed and diverse hardware components into a seamless data-handling environment.

"We are building a unique data-handling architecture that integrates tape storage, disk cache, I/O support nodes, and parallel database and archival storage software systems," said Reagan Moore, leader of the DICE thrust area. "The system is being designed to manipulate data on a scale two orders of magnitude larger than has been possible under the prior NSF Supercomputer Centers program."

Data caches are being set up at 10 partner sites, including the five sites that house NPACI high-performance computing resources (Figure 1), to serve research projects in NPACI applications thrust areas and in other areas of computational science that access very large data sets. The caches initially will hold more than four terabytes of data in HPSS and ADSM tape archives, in DB2, Oracle, Informix, Sybase, Illustra, and Objectivity databases, and in Unix file systems controlled by IBM, Sun, SGI, and Digital platforms.

The goal is to allow applications running on NPACI supercomputers to discover and access data on any of these storage locations. To ensure this, the Storage Resource Broker (SRB) from SDSC is being used as the initial access mechanism. Work has also started on integrating the SRB with digital library technology to automate support for services such as information discovery and data analysis.

INITIAL APPLICATIONS
INFRASTRUCTURE INTEGRATION

Partner site Cache size (GB) Hardware (OS) Netwo

Figure 1: Planned NPACI Data Caches for FY '98.


INITIAL APPLICATIONS

Initially, the data caches will serve specific projects and the general data-intensive computing effort. But each of the applications areas also will affect--and be affected by--developments in the DICE, Programming Tools and Environments, and Interaction Environments technology thrusts.

Specific NPACI DICE efforts include digital library projects at UC Santa Barbara, UC Berkeley, the University of Michigan, and Stanford University. Development work on data collections is being done at Caltech, the University of Houston, Washington University, the University of Maryland, UC Santa Cruz, and UCLA. Development of interfaces, data-handling systems, and tools is being conducted at Oregon State University, the University of Maryland, SDSC, and UC Davis.

The size of the cache at each site will depend on the requirements of the application thrust it serves. Some scientific applications have an inherent need for massive data support. Researchers in Earth Systems Science, for example, typically compare simulation output from Global Climate Models with observational data from satellites. Neuroscientists compare multiple brain images to determine a standard structure. Both groups are expected to manipulate terabytes of data by the year 2000.

Projects in these areas will be the first to apply the tools of the data cache testbed to their needs:

  • Neuroscience. Caches at UCLA, Washington University in St. Louis, and UCSD/SDSC will hold and integrate neuroscience metadata (including brain mapping data) to support the exchange of images among federated data collections.
  • Earth Systems Science. Four data caches--holding a climate database at UC San Diego, an Earth systems database at UCLA, a data server for California natural resources at UC Davis, and a repository for satellite-acquired land cover data at the University of Maryland--will support NPACI research projects. These will become elements of a larger, distributed Earth Systems Digital Library.
  • Molecular Science. The University of Houston is setting up a cache to hold a molecular dynamics trajectory database. Another cache at the University of Houston's Keck Center for Computational Biology will maintain a database of enhanced images of molecules.
  • Astronomy. In addition to supporting computational science research on the HP Exemplar, a cache at Caltech will support the Digital Sky project, which will integrate multiple sky surveys taken in various regions of the electromagnetic spectrum into one comprehensive catalog.

NPACI remote caches also include archival storage testbed systems at the University of Texas and UC Davis and a Web information server testbed at Oregon State University.

Top | Contents | Next

We are building a unique data-handling architecture that integrates tape storage, disk cache, I/O support nodes, and parallel database and archival storage software systems. The system is being designed to manipulate data on a scale two orders of magnitude larger than has been possible.

--Reagan Moore, SDSC


INFRASTRUCTURE INTEGRATION

The two technological factors that make the data cache scheme practical are high-performance computing for processing requests at the cache site and high-speed networking for moving large data sets from the cache to a client site. All of the NPACI cache sites are, or will soon be, connected to the NSF's vBNS research network or the University of California's CalREN-2 network, with backbones operating at up to 622 million bits per second now and at 2.4 billion bits per second in the near future.

Data support mechanisms are an active research area for the partnership. Researchers at the University of Southern California, Argonne National Laboratory, and SDSC are integrating the Globus caching system with the SDSC SRB. This will permit an application that uses the Globus metacomputing environment to access data sets stored in any of the NPACI data caches. Researchers at the University of Maryland are developing more sophisticated Active Data Repositories (ADR) that optimize the layout of data on disk. By integrating ADR with the SRB, it will be possible to optimize the layout of data into a cache at the same time as the data is being fetched.

Current efforts at SDSC, Lawrence Livermore National Laboratory, Caltech, and UC Davis include creating high-performance I/O systems through the use of parallel I/O channels, developing advanced archival storage and database systems to store and organize data, developing discipline-specific databases, and developing tools to analyze massive amounts of data.

"The data caches give us the flexibility to experiment with novel data management tools and environments in a distributed testbed," Moore said. "At the same time, we are improving the ability of the application thrust areas to establish domain-specific data repositories and build upon prior knowledge." --MG

Top | Contents | Next