Published 08/04/2003
Reagan Moore, co-director of San Diego Supercomputer Center's (SDSC) Data and Knowledge Systems (DAKS) program, gave a presentation on "Real-life Experiences with Data Grids" at the recent DOE Science Computing Conference in Arlington, Virginia. The conference focused on next-generation DOE computational resources and how these relate to the needs of science applications that range from climate change, fusion energy, and astrophysics to biology, along with the computer and network resources that will be required to meet these needs.
The conference provided an important opportunity to communicate the relevance of SDSC's advanced data and knowledge management technologies for large-scale scientific research, as well as to pursue existing and potential new collaborations between DOE and SDSC projects.
The presentation covered data intensive analysis and data management activities in SDSC's DAKS group, reflecting the group's extensive experience in working with large numbers of scientists and other researchers in developing and applying data and knowledge management methods across a wide range of disciplines. Over 50 active projects currently manage 66 terabytes of data collections in nearly 10 million files at SDSC.
"SDSC is collaborating in a large number of federally sponsored efforts involving real-world data management, based on production use of the SRB and advanced data management technologies," said Moore. "The central issues involve technologies to manipulate massive data collections, and examination of the important challenge of knowledge generation and the extraction of relationships from data."
The data and knowledge management methods that Moore described are being used in dozens of projects across numerous Federal agencies, from helping NIH-funded neuroscientists share brain data across the country in the Biomedical Informatics Research Network and enabling astronomers to integrate multi-terabyte image collections in the NSF's National Virtual Observatory, to publishing data in digital libraries in the National Science Digital Library for science and engineering education, and developing persistent archives that ensure long-term access to electronic Federal records for the National Archives and Records Administration.
The emerging end-to-end data and knowledge management architecture in DAKS technologies extends from sensor data ingestion using such methods as Object Ring Buffers to manage real-time data from sensors; data organization in collections that manage data context
; data sharing through data grids that manage heterogeneity; data publication in digital libraries that support discovery and browsing; data preservation in persistent archives that manage technology evolution; and data analysis in processing pipelines that are used to generate new knowledge. An important area of current research is the development of increasingly sophisticated data analysis in inference systems that manage knowledge extraction from massive data sets, providing researchers powerful new tools for insight and knowledge discovery.
-Paul Tooby
San Diego Supercomputer Center (SDSC) - http://www.sdsc.edu/
Data and Knowledge Systems (DAKS) program - http://daks.sdsc.edu/
SDSC Storage Resource Broker (SRB) - http://www.sdsc.edu/DICE/SRB/
Department of Energy (DOE) - http://www.energy.gov/engine/content.do