Parallel Data Mining

David Skillicorn
Department of Computing and Information Science
Queen's University
Kingston.
skill@cs.queensu.ca

This document summarizes the slides used at this workshop. It is necessarily a little disjointed, and omits some of the more technical material.

Origins of Data Mining

The potential data is huge. A dataset might contain information about 10^9 transactions, each involving 1000 possible attributes.

BUT human variability is also huge. There needs to be greater awareness of how quickly data ages.

The number of organisations involved is also growing rapidly. It's hard to get reliable information, but the number of `industrial' users in the TOP500 increased from 153 (Nov'97) to 207 (Nov'98).

About half of these are probably doing some form of data mining.

Commercial applications in TOP500 (Nov 1998)

Rank Owner
64 State Farm
79 Charles Schwab
91 Oracle/IBM
117 Chase Manhattan
119 Sears
173-6 Commerzbank
178-9 Deutsche Morgan Grenfell
206 Prudential
211 Lexis Nexis
224 SMVG Bern

Approaches to Data Mining

There are three main approaches to data mining ...

``I suspect that..."

Sometimes called verification or exploratory data mining.

An analyst has a hypothesis, and uses data to verify it, or partially verify it.

This leads to a revised hypothesis, which begins the process again.

Response time is important so that trains of thought can be followed. The approach is mostly based on OLAP technology.

``I know who, now tell me why, OR now predict who else..."

This is based on existing practices which are reasonably well understood. Such techniques are called supervised, because the algorithms train on customers who are known to have some property X.

The transparent approach: these are my customers with property X, now tell me what else they have in common.

The opaque approach: give me a black box that identifies (new) customers with property X.

Examples: churn reduction (contract renewal), automated judgement (mortgage approval, medical billing), targeting (bulk mailing, retail marketing).

Transparent approaches may be more useful, but require more sophistication to exploit.

``Tell me something interesting..."

Here the algorithm is to find hypotheses that are likely given the available data, and also interesting, i.e. unlikely to have occurred by chance. Such techniques are called unsupervised.

Typical outcomes are of the form ``there is a cluster of customers with the following characteristics".

Examples: market segmentation, unsuspected correlations.

This approach seems to have the greatest potential since it provides information that is not accessible in any other way.

Data mining algorithms can be regarded abstractly as either:

There are also important pre- and post-processing issues:

These issues are the same for both sequential and parallel data mining, so we won't address them explicitly.

What's the role of parallel computing?

Most data mining algorithms are computationally expensive:

On the other hand...

So the case for parallelism in data mining is not proven, although many suspect it will be important, and perhaps crucial.

Relevant Parallel Architectural Developments

The last five years has seen a strong convergence on two basic parallel architectures, which are often called clusters and clumps.

Their properties constrain what's sensible for parallel algorithms to do.

Clusters

are collections of processor boards to which Network Interface Cards (NICs) have been added; these are connected using a high-speed network such as Gbit Ethernet or Myrinet. They use lightweight protocols to give user-space to user-space latencies of ~10 microseconds and bandwidths of ~98% of the hardware bandwidth.

At present, such systems cost $US1000 per Gflop and $US1000 per Gbps between processor and network.

Clusters have taken over the traditional high-performance scientific market.

Clumps

are small shared-memory submachines, connected by a larger-scale interconnect that also provides a shared-memory abstraction, but with lower performance. Typically a single application runs within each of the submachines. Migrating applications, running multiple synchronous copies of applications, and hot-swapping any piece of hardware is easy.

Clumps have taken over the 7x24 market for high-reliability, e.g. web servers, search engines, transaction processing.

Two architectural styles that are not very useful

Some perennial bad ideas in parallel computing

All parallel computation ideas have analogues in human organisations -- what do you do when a job is too big or complex for one person.

Cost Models

To know whether a data mining technique is worth executing in parallel, we must be able to determine its complexity. For this we need a realistic cost model.

The cost of a parallel program has two components:

The details of communication topology, technology, etc. can be accurately captured by a single parameter, the permeability of an architecture to uniformly-random traffic.

This parameter, written g, is in units of instructions executed per byte transmitted.

Communication performance is limited by access to the communication network rather than by capacity inside it. Hence the time taken for communication is limited by the processor that sends or receives the most data.

The overall cost of a parallel program is accurately modelled by:

cost = MAX processors w_i + MAX processors h_i g
where the MAX is taken over the set of processors, w_i is the number of instructions executed by processor i, and h_i is the number of bytes transmitted or received by processor i (the BSP cost model).

The + could also be a maximum, but some account must be taken of the interleaving between computation and communication.

Notice that the units of both terms are the same -- instructions executed. This makes it easy to compare algorithms with different mixes of computation and communication.

Data mining techniques

Association Rules

Find the frequency of occurrence of each subset of the m attributes in the dataset (the Frequent Set Problem).

Using these, construct rules of the form

A,B,C --> D
that occur sufficiently often (the support of {A, B, C, D}) and with sufficient confidence (support of {A, B, C, D} / support of { A, B, C}).

Decision Trees

An implementation of the game of Twenty Questions. Build a tree of decision points, each one chosen to maximise its effectiveness at splitting the remaining data (using entropy or information content). Each leaf has an associated classification.

New instances pass down the tree depending on the decision at each interior node and end up at a leaf which is assumed to be their classification.

A decision tree is a classifier. However, the conjunction of the decisions along the path to each leaf is a rule for classifying, so the classification is transparent.

Neural networks

The dataset is input to a neural network, which adjusts the weights on its internal edges in response to it. In the supervised case, the output for each case is compared to what it should be, and the difference also has an effect on the internal weights.

The data is normally fed to the network many times. Each pass through the data is called an epoch.

The trained neural network is a classifier, but there is no straightforward way of determining why it classifies as it does.

Singular value decomposition

Consider the n by m dataset as a matrix, A. It has a decomposition of the form

A = U Sigma V
where U and V are orthogonal, and Sigma is diagonal. The values on the diagonal are decreasing, and are called the singular values.

If we keep only the first k singular values then we get this situation ...

Each row of U_k can be regarded as a vector in k-dimensional space representing a customer; and each vector in V_k a vector representing an attribute. The representation in this low dimensional space is the ``closest" to the original m-dimensional space, i.e. A is close to A_k.

Such low-dimensional spaces make clusters easy to see for unsupervised approaches; and make it possible to measure how a customer is like other customers or attributes for supervised approaches.

SVD has been enormously successful for text searching (where it is called latent semantic indexing) and is just starting to be used for other data mining applications.

Inductive logic programming

(not a good name)

Think of the data about customers as propositions. Use induction to derive new propositions or predicates that describe them in more general ways.

ILP is useful when there is an underlying fundamental structure to the data, rather than just preferences.

ILP doesn't have to look like logic: for example kDNF.

Genetic algorithms

An implementation of Darwinian evolution of concepts.

Generate initial concepts randomly.

Give them a score ("fitness") based on how well they describe the dataset.

Replicate concepts in proportion to their score; allow some crossover; and some random mutations.

Repeat.

Genetic algorithms are very powerful, but almost nothing can be said about convergence and how long it will take. Many people find them attractive because you don't have to understand the problem.

Access Issues

Note that, hidden inside all of these algorithms, are steps in which a concept's descriptive success has to be evaluated. This almost invariably requires looking at the entire dataset, which is expensive.

It's extremely difficult to use an abstraction of the dataset instead, because such as abstraction is really what you're after.

So we can't expect to find `clever' representations that somehow make access cheap.

And incurring the overhead of a full database is probably not realistic.

So there's a considerable amount of work to be done on access methods that don't make distributional assumptions, but are still fast.

Hashing looks like the right approach at first. But most queries against the data set are `partial match' queries because a concept often looks like a data item with some don't cares.

Existing partial match access techniques usually rely on an implicit ordering of the dimensions -- another abstraction which begs the question in a data mining setting.

Costing algorithms in the abstract

Consider a generic sequential data mining algorithm. Suppose that the dataset consists of n rows of size m (i.e. n customers with m attributes each), and that the algorithms produces s concepts.

Such an algorithm usually consists of a loop that executes some number k_s times -- perhaps producing large concepts from smaller ones.

Its complexity is given by a formula

cost_s = k_s [ STEP(nm, s) + ACCESS(nm) ]
where STEP gives the cost of a single iteration of the loop, and ACCESS is the cost of accessing data. STEP depends on s as well as nm since it is possible for the set of derived concepts to be larger than the input.

Strategies for parallelizing data mining algorithms

Approximate concepts

Partition data into p subsets, one per processor.

Execute (some variant of) the sequential algorithm on each subset.

Exchange information about what each processor learned with the others.

Repeat.

Example

Association rules. Partition the dataset among the processors. Compute the support of each subset locally, then exchange these support values so that each processor knows the total support.

The cost of an approximate concept algorithm has the form

cost_a = k_a [ STEP(nm/p, r) + ACCESS(nm/p) + rpg + RES(rp) ]
where r is the size of the approximate concepts generated by each processor, rpg is the cost of a total exchange between the processors of these approximate concepts, and RES(rp) is the computation cost of using these approximation to compute better approximations for the next iteration.

It is reasonable to assume that

STEP(nm/p, r) = STEP(nm, r)/p
and
ACCESS(nm/p) = ACCESS(nm)/p
so we get
cost_a  = cost_s/p + k_a (rpg + RES(rp))
In other words, we get a p-fold speedup, except for an overhead term, provided k_s and k_a are of comparable size.

Perhaps surprisingly, exchanging results frequently can improve accuracy rapidly for some algorithms, so k_a << k_s.

This gives a ``double" speedup

cost_a = k_a/k_s cost_s/p + overhead

Partial concepts

Partition the data based on attributes into p subsets.

Execute a special variant of a data mining algorithm on this subset to derive concepts or model parameters that apply only to this subset of the attributes.

Repeat.

Combine those partial concepts that go together sensibly to produce concepts describing the whole dataset.

Example

Genetic algorithms. Partition the `chromosomes' by the columns to which they refer. Evolve each set of partial chromosomes independently. Score the fitness of combined chromosomes periodically.

The cost of a partial concept algorithm has the form

cost_p = k_p [ SPECIAL(n, m/p, r) + ACCESS(n, m/p) + EXCH(n,m,r,p)g + RES(n,m,r,p) ]
where SPECIAL gives the complexity of a single step of the special algorithm, and EXCH gives the cost of exchanging the partial concepts.

Independent search strategies

Do not partition the data. Instead execute the same algorithm p times, using some randomisation technique to direct each to a different part of the search space of concepts.

Example

Genetic algorithms. Run multiple copies of the sequential algorithm. Select the best descriptions at the end (cf. evolution of separate populations).

The cost of an independent search strategy algorithm has the form

cost_i = k_i [ STEP(nm, s) + ACCESS(nm) ] + s g + s
It is clear that this approach only makes sense if we have a reason to expect that k_i = k_s / p.

Comparisons

The cost of these different implementation strategies depends on the values of the ks. However, assuming that they aren't worse than k_s we can rank the different parallelization approaches:

Approximate concept algorithms are better than partial concept algorithms which are better than independent search algorithms.

It is easy to see, intuitively, why this should be. Partial match techniques generate many concepts that cannot make sense as part of larger concepts. So the sets generated by each independent algorithm are much larger than the sets that result after resolution -- which means that each algorithm is doing needless work.

In the same way, independent search doesn't make use of the fact that other processors may have discovered useful knowledge that could prune the search of this particular processor, again doing unnecessary work.

Are these abstract algorithm schemas and their costs reflected in real data mining algorithms?

A Datailed Example - Neural networks

Consider a neural network with L layers, and M neurons per layer, with full connections from each layer to the next and preceding layers. The number of weights (one per connection) is W = LM^2.

Suppose that the number of examples in the training dataset is N.

An approximate concept approach to learning is exemplar parallelism -- each processor trains an identical initial network on a pth fraction of the examples.

At the end of each epoch, processors exchange their error vectors and combine them (deterministic learning).

Partial concept approaches have each processor responsible for some subset of the neurons. Two possibilities are layer parallelism in which each processor is responsible for some layers of neurons, and neuron parallelism in which each processor is responsible for a set of neurons.

Only exemplar parallelism produces significant speedup. The cost for a single training epoch is:

C_EP = N(AW) / p + W(p - 1)g 
where A is a constant that depends on the particular training algorithm.

By contrast, the cost of layer parallelism (per epoch) is:

C_LP = [ (N-1) + 2(L-1) ] [ AW/p + 2Mg ]
and of neuron parallelism is:
C_NP = [ (N-1) + 2(L-1) ] [ AW/p + 2WL/p g ]
These partial concept approaches do about the same amount of computation as exemplar parallelism but require much more communication.

This is an information-theoretic argument and so applies to any architecture that does not exploit specific topological features (e.g. Kumar, Shekhar, Amin).

There's no point in considering implementations that partition the neural network, let alone partitioning the weight set. They can't possibly be the basis for a scalable parallel technique.

Double Speedup

But the number of iterations (k_a) can also be dramatically reduced.

Stochastic learning generates an error vector that is used to update weights after every example.

Deterministic learning generates the error vector only after the entire set of examples, at the end of the epoch.

Batch learning generates the error vector after a subset of the examples, called a batch.

If each batch is sufficiently large the error vector is a good approximation to the deterministic error vector, but has required much less processing to discover.

The number of epochs required to achieve a given level of convergence as a function of batch size is shown below:

Let b = N/B be the number of batches in each epoch.

The cost of exemplar parallelism per epoch is:

C_EP^batch = b [ N/b AW/p  + (p-1)Wg ]
and so the total cost for training for E epochs is
C_EP^total = E [ NAW/p + b(p-1)Wg ]

For batch sizes in the range where the number of epochs required for convergence depends linearly on B, we can write E = c/b, for some constant c.

C_EP^total = c/b NAW/p + c(p-1)W g
The cost of communication is independent of the number of batches. Thus, minimising the overall cost requires minimising the computation term, which happens when b is as large as possible in the appropriate range. This occurs when b = b' = N/B'.

Note the denominator! b is often ~20, so speedups are significant even when p is small.

Similar empirical results have been achieved for association rules.

Approximate concept approaches seem to work well for association rules, decision trees, and neural networks because concepts derived on part of the data are likely to reflect concepts that hold on the whole dataset. (Of course, this also suggests that sampling is likely to be effective as well.)

Approximate concept approaches do not work for inductive logic programming because concepts are supposed to be true or false (rather than having some quantitative support). The existence of negative data is a particular problem. Partial concept approaches can work, but (as far as I know) it hasn't been addressed. (?Fuzzy logic.)

It isn't clear what strategy is best for genetic algorithms, but all will work.

Almost nothing has been done to parallelize SVD, so it's unclear which approach is best (although approximate and partial concept approaches are closely related in this setting).

Summary

Very little work on parallel data mining has really been done.

The extra power of parallel computing seems, on the face of it, to be useful because of the size and complexity of the problems.

The chief benefit of an abstract cost approach to parallelizing data mining is that it can rule out techniques that will not produce performance improvements on real systems.

Resources

Biography

David Skillicorn is a Professor in the Department of Computing and Information Science at Queen's University. His primary research interest is general-purpose parallel computing, which spans a full range from theoretical models of parallelism, to implementations such as BSP. He is also interested in applications that require parallelism, especially non-traditional applications such as structured text and data mining.