What are grids?
Grids are geographically distributed, heterogeneous collections of
computing resources that can be accessed through a single point of contact.
They provide computational power or access to data at scales beyond even
the largest single system.
Several different kinds of grids have been proposed (and some have been
Computational grids whose primary purpose is to support
high-performance computation. These extend parallel and distributed
systems to unprecedented scale and heterogeneity.
Access grids whose primary purpose is to allow ad hoc
groups of people to use the resources of their individual organizations
as a single, properly protected resource.
Data grids whose primary purpose is to move large volumes of
data around effectively.
Why datacentric grids
The Datacentric Grid Project aims to design and implement grids
for data-intensive operations in which data is moved as little as
possible. In other words, we aim to reverse the traditional (but
inreasingly inappropriate) view that processors are the critical
resource in systems and hence that data should move to processors.
There are six factors that motivate datacentric grids:
The volume of data quadruples every 18 months, while the available
performance per processor doubles in the same time period. Unless
the number of processors increases unrealistically rapidly, most of
this data will never be touched.
Bandwidth in increasing but not at the same rate as stored data. In
the medium term, it will not be cost-effective to move raw data freely
across global-scale networks.
The latency required for transmission at global distances is significant
and hence imposes a significant performance hit on programs that use
non-local data. Even is input data can be prefetched and results
post-stored, this latency shows up in response time.
Large amounts of data have a property very like inertia - it is cheap
to store and cheap to keep moving, but the transitions between these
two states are expensive in time and hardware.
There are increasing legal and political restrictions on the movement
of data in support of privacy. For example, customer data cannot be
exported freely from a number of jurisdictions.
If data is perceived to have value, then those who own it may allow it
to be used under their control but not to be removed (there's a reason
that the world's great deposit libraries do not allow their books to
Data, especially large data, is increasingly immovable, and this must be
taken into account in designing ways to compute with it. In particular,
data repositories will have to be provided with local front-ends, with
significant computing power, so that computations can be done as close
to their data as possible.
What are datacentric grids?
The datacentric grid requires a new framework, including:
a programming model, possibly embodied in a kind of query language;
a resource discovery and execution planning scheme that explicitly
handles data structure, location, and arrangement;
a runtime system;
specialized compute server/data repositories.
A typical scenario for a datacentric application is shown below.
A user, who may be using a thin and perhaps mobile client, wants
to start an application.
The application now enters the execution planning phase. This involves
discovering where appropriate data is located, where resources are available
to execute the cycles required by the application, and translating the
application into a set of tasks to be executed at these locations. This
is fairly conventional in the sense that these issues arise in computational
grids. However, there are several extra constraints: computations that
condense data or extract conceptual information from it must be colocated
with the data on which they will act; there will often be communication
between these computations during the process of knowledge extraction,
so the choice of which versions of datasets to use is further constrained
by considerations of communication bandwidth and latency; and there will often
be further processing of the results of knowledge extraction which can
take place at other compute servers.
Finally, the results of the application
must be rendered appropriately for the user's client and returned to it.
If the application executes within a virtual grid (probably the normal
situation) then its access to public sources of data must also not
reveal the purpose of the overall computation.
The closest work to this is Bob Grossman's DataSpace (including
PMML and Terra) Project. The most important differences between
datacentric grids and DataSpaces are: (a) DataSpace assumes all data
is table structured whereas datacentric grids work with structured,
semi-structured, and unstructured (e.g. text) data; and DataSpace
assumes the WWW's model of moving data to the client and executing
the required operations there whereas datacentric grids never move
data in its raw form.
Back to David Skillicorn's home page.
Motivating Computational Grids, which discusses the reasons for
using and providing resources for computational grids. We conclude that
there are good reasons for such grids, but they are not the ones
usually mentioned in the literature.
The Case for Datacentric Grids, which describes the motivation
and goals of this project.
Slides from a panel presentation David
Skillicorn made at the High Performance Data Mining Workshop at
the SIAM Scientific Data Mining Conference (shorter version of the
Data Mining, Parallelism, and Grids, a presentation
of our group's recent work in high-performance data mining and how it fits
with datacentric grids.