Mon Sep 08 09:19:12 EDT 2008



Project News
DataCutter 3.2 Release:  The newest version was released April 29, 2004. Click on Releases to download it.



DataCutter Project

Middleware for Filtering Large Archival Scientific Datasets in a Grid Environment

The Ohio State University
Biomedical Informatics Department
Multiscale Computing Laboratory
and
The University of Maryland
Department of Computer Science

As networks connecting computational resources get faster, an increasing number of applications are making use of collective computational resources across local and wide-area networks. One of the consequences of this ability is that scientific and engineering simulations can now generate unprecedented amounts of data. In addition, vast amounts of data are being gathered by advanced sensors and instruments at geographically distributed locations, such as satellites, microscopes, and medical imagers. At the same time, disks continue to become larger and cheaper as they become commoditized; large disk-based storage systems are increasingly easy and inexpensive to set up.

The availability of low cost systems has greatly enhanced a scientist's ability to generate large scale scientific datasets from, store them at, and share them with many disparate, distributed, and heterogeneous locations and parties. This trend results in very large datasets distributed across a network of storage and computational resources.

The primary goal of generating data through large scale simulations or sensors is to better understand the causes and effects of physical phenomena. Understanding can be achieved by querying, mining and analyzing such massive and complex datasets so that the scientist can gain insights into the problem at hand. This is accomplished in two phases through data exploration and analysis. For the data exploration phase, the data of interest is extracted from all relevant datasets through the use of efficient indexing schemes to quickly locate the data items that satisfy a request. In the data analysis phase, application-specific knowledge is used to process and transform the data into a new data product that can be more efficiently consumed by another program or analyzed interactively.

In this project we have been developing a framework, called DataCutter, that is designed to enable exploration and analysis of scientific datasets in distributed and heterogeneous environments. The programming model in DataCutter, called filter-stream programming, represents components of a data-intensive application as a set of filters. Each filter can potentially be executed on a different host across a wide-area network. Data exchange between any two filters is achieved by streams, which are uni-directional pipes that deliver data in fixed size buffers. DataCutter provides a core set of services, on top of which application developers can implement more application-specific services or combine with existing Grid services such as metadata management, resource management, and authentication services. The main design objective in DataCutter is to extend and apply features of the Active Data Repository (ADR), namely support for accessing subsets of datasets via range queries and user-defined filtering operations, for very large datasets in a shared distributed computing environment.

This research has been supported by the National Science Foundation under Grants ACI-9619020 (UC Subcontract 10152408), EIA-0121177, EIA-0203846, ACI-0130437, ACI-9982087, Lawrence Livermore National Laboratory under Grant B500288 and B517095 (UC Subcontract 10184497), and Ohio Board of Regents~BRTTC BRTT02-0003.

Any opinions, findings, and conclusions or recommendations expressed in on this web site are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Like our navigation menu?
Milonic license number 188323.