Datacentric Grid Project

What are grids?

Grids are geographically distributed, heterogeneous collections of computing resources that can be accessed through a single point of contact. They provide computational power or access to data at scales beyond even the largest single system.

Several different kinds of grids have been proposed (and some have been built):

Why datacentric grids

The Datacentric Grid Project aims to design and implement grids for data-intensive operations in which data is moved as little as possible. In other words, we aim to reverse the traditional (but inreasingly inappropriate) view that processors are the critical resource in systems and hence that data should move to processors.

There are six factors that motivate datacentric grids:

  1. The volume of data quadruples every 18 months, while the available performance per processor doubles in the same time period. Unless the number of processors increases unrealistically rapidly, most of this data will never be touched.

  2. Bandwidth in increasing but not at the same rate as stored data. In the medium term, it will not be cost-effective to move raw data freely across global-scale networks.

  3. The latency required for transmission at global distances is significant and hence imposes a significant performance hit on programs that use non-local data. Even is input data can be prefetched and results post-stored, this latency shows up in response time.

  4. Large amounts of data have a property very like inertia - it is cheap to store and cheap to keep moving, but the transitions between these two states are expensive in time and hardware.

  5. There are increasing legal and political restrictions on the movement of data in support of privacy. For example, customer data cannot be exported freely from a number of jurisdictions.

  6. If data is perceived to have value, then those who own it may allow it to be used under their control but not to be removed (there's a reason that the world's great deposit libraries do not allow their books to circulate). Data, especially large data, is increasingly immovable, and this must be taken into account in designing ways to compute with it. In particular, data repositories will have to be provided with local front-ends, with significant computing power, so that computations can be done as close to their data as possible.

    What are datacentric grids?

    The datacentric grid requires a new framework, including:

    A typical scenario for a datacentric application is shown below. A user, who may be using a thin and perhaps mobile client, wants to start an application. The application now enters the execution planning phase. This involves discovering where appropriate data is located, where resources are available to execute the cycles required by the application, and translating the application into a set of tasks to be executed at these locations. This is fairly conventional in the sense that these issues arise in computational grids. However, there are several extra constraints: computations that condense data or extract conceptual information from it must be colocated with the data on which they will act; there will often be communication between these computations during the process of knowledge extraction, so the choice of which versions of datasets to use is further constrained by considerations of communication bandwidth and latency; and there will often be further processing of the results of knowledge extraction which can take place at other compute servers. Finally, the results of the application must be rendered appropriately for the user's client and returned to it.

    If the application executes within a virtual grid (probably the normal situation) then its access to public sources of data must also not reveal the purpose of the overall computation.

    Related Work

    The closest work to this is Bob Grossman's DataSpace (including PMML and Terra) Project. The most important differences between datacentric grids and DataSpaces are: (a) DataSpace assumes all data is table structured whereas datacentric grids work with structured, semi-structured, and unstructured (e.g. text) data; and DataSpace assumes the WWW's model of moving data to the client and executing the required operations there whereas datacentric grids never move data in its raw form.

    Project publications

    Back to David Skillicorn's home page.