Published 07/30/2003
The San Diego Supercomputer Center (SDSC) at the University of California, San Diego has released the initial version of the SKIDLkit data mining toolkit, giving scientific users a user-friendly set of advanced data mining capabilities. In developing SKIDLkit, researchers in the SDSC Knowledge and Information Discovery Lab (SKIDL) focused on end-to-end applications in close collaboration with discipline scientists in Earth systems science, medical science, and monitoring the safety of civil infrastructure such as highway bridges.
"A key part of what SDSC is all about is providing scientists with access to powerful software that is practical for them to use," said Tony Fountain, director of the SKIDL Lab at SDSC. "SKIDLkit is a good example of this, and underscores SDSC's recognized leadership in providing usable data technologies."
SKIDLkit, which is open source software written in C and Matlab, is now online and available for download at http://daks.sdsc.edu/skidl/skidldownloads.html. A user guide and technical reports on using these data mining tools for microarray analysis and mining hyperspectral remote sensing data are also available. These challenging data mining problems are known as "high dimensional" because they involve hundreds or even thousands of features, or potentially governing variables, for each observation.
"We've assembled and integrated open source data mining tools to help discipline scientists analyze their data, without having to be experts in complex computer science issues," said Fountain. "We don't build software we can get elsewhere, but the tools in SKIDLkit simply weren't available for the applications we were interested in, and so we had to build them, and now we've packaged them for community use."
Data Mining Tools for High Dimensional Data
Typical data mining problems are characterized by a relatively small number of features or variables that potentially govern the phenomenon, and a large number of observations. For example, a business data mining problem might involve a file with 10,000 customers, or "observations," but only half a dozen features such as name, address, date, and purchases. In modern scientific research, however, a new class of problems known as high dimensional problems is emerging that is the reverse: the number of features, or potentially governing variables, is very large, combined with a relative scarcity of observations.
Since existing data mining or machine learning tools don't handle this kind of problem well, the SKIDL team set out to develop tools particularly suited for mining such high dimensional data sets, to help scientists efficiently sort through the forest of variables and zero in on the ones most likely to be of interest for further detailed analysis.
SKIDLkit contains algorithms and methods for both feature selection and predictive modeling. For feature selection, SKIDLkit includes both filter and wrapper methods. Filtering applies a test to the full data set, and returns a result that identifies the key features. The three filter methods included are t-test, prediction strength, and Bhattacharrya distance. Wrapper methods include genetic algorithms and recursive feature elimination.
These approaches are coupled with predictive modeling algorithms, including support vector machines and Bayesian belief networks. After the key features are selected, a model is created and iteratively tested and improved. Finally, new data samples that have not previously been analyzed can be efficiently classified using the model. In addition, domain scientists can gain insight into the dynamics and key variables in the domain by studying the selected features.
In developing SKIDLkit, to ensure that the tools would be practical to use for a range of problems, the SKIDL researchers worked closely with scientists in three different applications: environment, cancer diagnosis and bridge safety.
Environment
In the National Partnership for Advanced Computational Infrastructure (NPACI) Earth Systems Science (ESS) thrust, Fountain and SKIDL researcher Peter Shin collaborated with colleagues at the NSF Long-Term Ecological Research Network (LTER) at the University of New Mexico, along with researchers at the Northwest Alliance for Computational Science and Engineering (NACSE) at Oregon State University. The researchers applied SKIDLkit tools in land cover classification using hyperspectral remote sensing data, where each pixel in the image contains over 200 frequencies of reflected light. "In the past, remote sensing data traditionally involved fewer than 10 frequency bands," said Fountain. "Now, hyperspectral data has over 200 bands, so it's more difficult for researchers to identify which bands predict a given type of ground cover, and SKIDLkit helps them answer this question."
Cancer Diagnosis
The second problem the SKIDL team tackled is in medical science, where Fountain and SKIDL researcher Hector Jasso have been collaborating with colleagues in the Moores UCSD Cancer Center to improve early diagnosis of ovarian cancer. In this research, results from microarray gene expression experiments are being stored in high-throughput databases for visualization and further analysis. This high dimensional data may have 15,000 gene expressions (a large number of features) for only 100 patients (a small number of observations). SKIDLkit contributes to this research by using machine learning algorithms that can improve early detection of cancer, enhance prognosis, and give medical researchers new understanding of the complex phenomena underlying cancer.
Bridge Safety
As a third application in yet another field, the SKIDLkit tools are being applied to analyze sensor network data in monitoring civil infrastructure, such as the structural integrity of highway bridges. In this NSF Information Technology Research (ITR) project, which is being done in collaboration with the UCSD Department of Structural Engineering, a potentially large number of sensors sends back a stream of time series data on a bridge, and SKIDLkit tools can help engineers identify which sensors to rely on for building a predictive model that will best characterize the structural integrity of the bridge.
"It's very gratifying to find this range of utility in SKIDLkit," said Fountain. "The software has been able to identify common characteristics and yield useful results across very distinct phenomena, ranging from environmental science to proteomics and civil infrastructure."
With an eye to the future evolution of Web services and grid services, the SKIDL team has also produced prototype online data mining applications using Web services standards such as Simple Object Access Protocol (SOAP) and the Web Services Description Language (WSDL). In addition, they have prototyped these data mining tools in Globus 3 to explore their operation with grid services. - Paul Tooby