As networks connecting computational resources get faster, an
increasing number of applications are making use of collective
computational resources across local and wide-area networks. One of
the consequences of this ability is that scientific and engineering
simulations can now generate unprecedented amounts of data. In addition,
vast amounts of data are being gathered by advanced sensors and instruments
at geographically distributed locations, such as satellites, microscopes,
and medical imagers. At the same time, disks continue to become larger and
cheaper as they become commoditized; large disk-based storage
systems are increasingly easy and inexpensive to set up.
The availability of low cost systems has greatly enhanced a scientist's
ability to generate large scale scientific datasets from, store them at, and
share them with many disparate, distributed, and heterogeneous locations and
parties. This trend results in very large datasets distributed across a
network of storage and computational resources.
The primary goal of
generating data through large scale simulations or sensors is to
better understand the causes and effects of physical
phenomena. Understanding can be achieved by querying, mining and
analyzing such massive and complex datasets so that the scientist can
gain insights into the problem at hand. This is accomplished in two phases
through data
exploration and analysis. For the data exploration phase, the data of
interest is extracted from all relevant datasets through the use of
efficient indexing schemes to quickly locate the data items that satisfy a
request. In the data analysis phase, application-specific knowledge is used to
process and transform the data into a new data product that can
be more efficiently consumed by another program or analyzed
interactively.
In this project we have been developing a framework, called
DataCutter, that is designed to enable exploration and analysis of
scientific datasets in distributed and heterogeneous environments. The
programming model in DataCutter, called filter-stream programming,
represents components of a data-intensive application as a set of
filters. Each filter can potentially be executed on a different host
across a wide-area network. Data exchange between any two filters is
achieved by streams, which are uni-directional pipes that deliver
data in fixed size buffers. DataCutter provides a core set of
services, on top of which application developers can implement more
application-specific services or combine with existing Grid services
such as metadata management, resource management, and authentication
services. The main design objective in DataCutter is to extend and
apply features of the Active Data Repository (ADR), namely support for
accessing subsets of datasets via range queries and user-defined
filtering operations, for very large datasets in a shared distributed
computing environment.
This research has been supported by the National Science Foundation
under Grants ACI-9619020 (UC Subcontract 10152408), EIA-0121177,
EIA-0203846, ACI-0130437, ACI-9982087, Lawrence Livermore National
Laboratory under Grant B500288 and B517095 (UC Subcontract 10184497),
and Ohio Board of Regents~BRTTC BRTT02-0003.
Any opinions, findings, and conclusions or recommendations expressed in
on this web site are those of the author(s) and do not necessarily reflect
the views of the National Science Foundation.