Datacentric Grid Project |
Grids are geographically distributed, heterogeneous collections of computing resources that can be accessed through a single point of contact. They provide computational power or access to data at scales beyond even the largest single system.
Several different kinds of grids have been proposed (and some have been built):
The Datacentric Grid Project aims to design and implement grids for data-intensive operations in which data is moved as little as possible. In other words, we aim to reverse the traditional (but inreasingly inappropriate) view that processors are the critical resource in systems and hence that data should move to processors.
There are six factors that motivate datacentric grids:
The datacentric grid requires a new framework, including:
A typical scenario for a datacentric application is shown below. A user, who may be using a thin and perhaps mobile client, wants to start an application. The application now enters the execution planning phase. This involves discovering where appropriate data is located, where resources are available to execute the cycles required by the application, and translating the application into a set of tasks to be executed at these locations. This is fairly conventional in the sense that these issues arise in computational grids. However, there are several extra constraints: computations that condense data or extract conceptual information from it must be colocated with the data on which they will act; there will often be communication between these computations during the process of knowledge extraction, so the choice of which versions of datasets to use is further constrained by considerations of communication bandwidth and latency; and there will often be further processing of the results of knowledge extraction which can take place at other compute servers. Finally, the results of the application must be rendered appropriately for the user's client and returned to it.
If the application executes within a virtual grid (probably the normal situation) then its access to public sources of data must also not reveal the purpose of the overall computation.
The closest work to this is Bob Grossman's DataSpace (including PMML and Terra) Project. The most important differences between datacentric grids and DataSpaces are: (a) DataSpace assumes all data is table structured whereas datacentric grids work with structured, semi-structured, and unstructured (e.g. text) data; and DataSpace assumes the WWW's model of moving data to the client and executing the required operations there whereas datacentric grids never move data in its raw form.