David Skillicorn - Recent Papers in Data Mining

## Detecting Unusual and Deceptive Communication in Email

with P.S. Keila

Deception theory suggests that deceptive writing is characterized by reduced frequency of first-person pronouns and exclusive words, and elevated frequency of negative emotion words and action verbs. We apply this model of deception to the Enron email dataset, and then apply singular value decomposition to elicit the correlation structure between emails. This allows us to rank emails by how well they fit the profile of detection. Those emails that are highly ranked using this approach include deceptive emails; other emails that are ranked highly using these frequency counts also indicate organizational dysfunctions such as improper communication of information. Hence this approach can be used as a tool for both external investigation of an organization, and internal management and regulatory compliance.

A version will appear in CASCON2005.

## Dealing with Complex Patterns in XML Processing

with F. Huang and C.B.Jay

The handling of search patterns for data access is critical to XML processing. Most efforts to integrate XML processing into general-purpose programming languages add different features to existing languages for different kinds of patterns, or even create completely new languages. The resulting languages only work for certain kinds of patterns and are hard to extend to others in a well typed manner. This poses burdens for both language designers and programmers, and restricts XML applications. In this paper we present an alternative approach that handles many kinds of patterns within the same framework. The key idea is to express complex patterns by pattern structures in one general-purpose programming language based on the pattern calculus. With this approach, adding a new kind of patterns is just a matter of programming, not language design; programs can be highly parametric over well-typed complex patterns.

## Better Bond Angles in the Protein Data Bank

with C.R. Robinson

The Protein Data Bank (PDB) contains, at least implicitly, empirical information about the bond angles between pairs of amino acids. There is considerable variation in the observed values for a given amino acid pair, and it is not clear whether this variation represents a wide range of conformal possibilities or is due to noise. We show, by applying singular value decompositions to the sets of examples of particular amino acid sequences, that there appear to be relatively few possible conformations for a given amino acid pair, and hence noise is a plausible explanation for the variation in the raw data. This has implications for secondary structure prediction which typically depends on the PDB values.

## A Distributed Approach for Prediction in Sensor Networks

with S.M. McConnell

Sensor networks in which the sensors are capable of local computation create the possibility of training and using predictors in a distributed way. We have previously shown that global predictors based on voting the predictions of local predictors each trained on one (or a few) attributes can be as accurate as a centralized predictor. We extend this approach to sensor networks. We also show that, when the target function drifts over time, sensors are able to make local decisions about when they need to retrain to capture the changing class boundaries.

## Beyond Keyword Filtering for Message and Conversation Detection

Keyword filtering is a commonly used way to select, from a set of intercepted messages, those that need further scrutiny. An obvious countermeasure is to replace words that might be on a keyword list by others. We show that this strategy itself creates a signature in the altered messages that makes them readily detectable using several forms of matrix decomposition. Not only can unusual messages be detected, but sets of related messages can be detected as conversations, even when their endpoints have been obscured (by using transient email addresses, stolen cell phones and so on).

## Structure in the Enron Email Dataset

with P. Keila

We investigate the structures present in the Enron email dataset using singular value decomposition and semidiscrete decomposition. Using word frequency profiles we show that messages fall into two distinct groups, whose extrema are characterized by short messages and rare words versus long messages and common words. It is surprising that length of message and word use pattern should be related in this way. We also investigate relationships among individuals based on their patterns of word use in email. We show that word use is correlated to function within the organization, as expected. We also show that word use among those involved in alleged criminal activity may be slightly distinctive.

## Emotions as a Metaphor for Altering Operational Behavior in Autonomic Computing

with R. Chandarana

The ability to change operational behavior in response to changes in both external and internal environment is an important aspect of autonomic computing. Managing such behavioral changes is challenging. We propose emotions as a useful mechanism for understanding the structure and use of behavioral changes, and present the design and implementation of the Emotion System, a stand-beside environment for ordinary programs. Both the metaphor and the implementation are designed to make it easy for software developers to integrate autonomic computing into their systems.

Queen's Technical Report 2004-491 (pdf).

## Programming with Heterogeneous Structures: Manipulating XML data using bondi

with F. Huang and C. B. Jay

Manipulating semistructured data, such as XML, does not fit well within conventional programming languages. A typical manipulation requires finding all occurrences of a structure whose context may be different in different places, and both aspects cause difficulty. These problems of data access are often addressed by special-purpose languages which then pass their results to other, general-purpose languages, creating the impedance mismatch problem. Alternative approaches add a few new means of data access to existing languages without being able to achieve the desired expressive power. We show how an existing programming language, bondi, can be used as is to handle XML data, since data access patterns, called signposts, are first-class expressions which can be passed as parameters to functions.

## Distributed Data Mining for Astrophysical Datasets

with S.M. McConnell

Over the past decade, data mining has gained an important role in astronomical data analysis. Traditionally, such analysis is performed on data at a single location. However, one of the main motivational force behind a virtual observatory is the distributed nature of both data and computational resources. Existing data-mining methods for distributed data are either communication-intensive or result in a loss of accuracy. In this paper, we introduce a general approach to supervised data mining that allows data to remain distributed, but still produces satisfactory results. We demonstrate by applying the approach to a number of astronomical datasets.

## Information Discovery within Organizations using the Athens System

with N. Vats

The serendipitous discovery of novel information in the web is a challenging task. Existing retrieval tools, like search engines, can retrieve information about known topics. However, they cannot retrieve information about novel topics, that is topics whose existence is unknown to the user and which may be potentially interesting.

We present Athens, a system for discovering novel information in the web. Athens comprises three fundamental components: closure to find the essential content of a set of search query terms; probing to create new contextualized queries for retrieving information of wider scope; and clustering to remove less relevant information. Given a set of initial query terms, the system repeats these steps twice to reach novel information relative to the initial query topic. This paper describes an application of the Athens system to web-based data for two organizations: IBM and Microsoft. We compare the novel information generated for the two organizations against a query and discuss the encouraging results obtained.

## The Athens System for Novel Information Discovery

with N. Vats

Discovering novel information, that is information whose existence is not suspected or for which suitable descriptors are not known, is difficult in large information repositories. The Athens system finds novel information, given an initial set of keywords representing known information, using a focused and contextualized search. We present the design of the system and show its use in a number of different settings.

Queen's Technical Report 2004-489 (pdf).

## Novel Information Discovery for Intelligence and Counterterrorism

with N. Vats

Intelligence analysts construct hypotheses from large volumes of data, but are often limited by social and organizational norms and their own preconceptions and biases. The use of exploratory data-mining technology can mitigate these limitations by requiring fewer assumptions. We present the design of the Athens system, which discovers novel information, relative to a specified set of existing knowledge, in large information repositories such as the world wide web. We illustrate the use of the system by starting from the terms "al Qaeda" and "bin Laden" and running the Athens system as if on September 12th, 2001. This provides a picture of what novel information could have been known at the time. This is of some intrinsic interest, but also serves to validate the performance of the system since much of this novel information has been discovered by conventional means in the intervening three years.

Queen's Technical Report 2004-488 (pdf).

## Building Predictors from Vertically Distributed Data

with S.M. McConnell

Due in part to the large volume of data available today, but more importantly to privacy concerns, data are often distributed across institutional, geographical and organizational boundaries rather than being stored in a centralized location. Data can be distributed by separating objects or attributes: in the homogeneous case, sites contain subsets of objects with all attributes, while in the heterogeneous case sites contain subsets of attributes for all objects. Ensemble approaches combine the results obtained from a number of classifiers to obtain a final classification. In this paper, we present a novel ensemble approach, in which data is partitioned by attributes. We show that this method can successfully be applied to a wide range of data and can even produce an increase in classification accuracy compared to a centralized technique. As an ensemble approach, our technique exchanges models or classification results instead of raw data, which makes it suitable for privacy preserving data mining. In addition, both final model size and runtime are typically reduced compared to a centralized model. The proposed technique is evaluated using a decision tree, a variety of datasets, and several voting schemes. This approach is suitable for physically distributed data as well as privacy preserving data mining.

## Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems

with T.J. Wasserman, T.P. Martin and H. Rizvi

Computer system sizing involves estimating the amount of hardware resources needed to support a new workload not yet deployed in a production environment. In order to determine the type and quantity of resources required, a methodology is required for describing the new workload. In this paper, we discuss the sizing process for database management systems and describe an analysis for characterizing business intelligence (BI) workloads, using the TPC-H benchmark as our workload basis. The characterization yields four general classes of queries, each with different characteristics. Our approach for sizing a BI application’s database tier quantifies a new BI workload in terms of the response time goals and mix of the different query classes obtained from the characterization analysis.

## Applying Matrix Decompositions to Counterterrorism

Governments collect data in which they hope to find patterns of terrorist activity. It is hard to know what such patterns look like and, in any case, terrorists are actively trying to avoid leaving any distinctive traces. However, if they work as a group, it is impossible to avoid some correlation among their attributes and actions. We show that such correlation can be detected, partly because it is likely to be qualitatively different from the correlations among groups with more innocent purpose. We show that matrix decompositions, in particular singular value decomposition and semidiscrete decomposition, have several useful properties for this problem. In many cases it is possible to identify a terrorist group with few false positives.

Queen's Technical Report 2004-484 (pdf).

## Detecting Related Message Traffic

Governments routinely intercept messages as part of counterterrorism efforts. Messages of particular interest may sometimes be identified by particular content, or by the identity of their sender and receivers. We consider the problem of identifying messages between members of a threat group when neither of these conditions apply. We show that clusters of related messages can be identified when they use words in correlated ways (which all conversations do) \emph{and} the words are used with the `wrong' frequency. The proposed technique therefore complements the use of a watch list of words, since the greater the awareness that particular words should not be used, the greater the use of inappropriate words that will reveal the existence of related groups of messages.

Paper (pdf) appeared in Proceedings of the Workshop on Link Analysis, Counterterrorism and Privacy at the SIAM Data Mining Conference, 2004.

## Strategies for Winnowing Microarray Data

with P. Kennedy. S. Simoff, and D. Catchpoole.

The analysis of microarray datasets is complicated by the magnitude of the available information. It seems plausible that only a fairly small fraction of genes or proteins behave differently with respect to differences in patient status. Most data mining techniques are significantly hampered by irrelevant or redundant information. Hence it is useful to reduce datasets to manageable size by discarding such useless information. We present techniques for winnowing microarray datasets using singular value decomposition and semidiscrete decomposition, and show how they can be tuned to extract some information about the internal correlative structure of large datasets.

Paper (pdf) appeared in the Proceedings of the Workshop on Bioinformatics at the SIAM Data Mining Conference, 2004.

## Detecting Mineralization Using Partial Element Extractions: A Case Study

with D.R. Cohen

Detecting deep mineralization from partial element concentrations in surface and near-surface samples requires detecting the signature of one geochemical process against a background that arises from many other geochemical processes. Since the last glaciation of the Canadian Shield, it is believed that only H${}^+$ can have diffused through substantial overburden, and hence that the only available signatures for underlying mineralization must be based, directly or indirectly, on pH. We show that direct observation of pH may not be a reliable signature for underlying mineralization, but that signatures based on the indirect effects of a pH low are readily visible using matrix decompositions.

Paper (pdf) appeared in the Proceedings of the Workshop on Scientific Data Mining at the SIAM Data Mining Conference, 2004.

## Extracting and Explaining Biological Knowledge in Microarray Data

with P. Kennedy. S. Simoff, and D. Catchpoole.

High throughput technologies produce large biological datasets that may lead to greater understanding of the biological mechanisms behind diseases such as cancer. However, progress has been slow in extracting meaningful information from these datasets.We describe a method of clustering lists of genes mined from a microarray dataset using functional information from the Gene Ontology. The method uses relationships between terms in the ontology both to build clusters and to extract meaningful cluster descriptions. The approach is general and may be applied to assist explanation other datasets associated with ontologies.

Paper (pdf), version appeared at PAKDD 2004.

## Signature Detection in Geochemical Data Using Singular Value Decomposition and SemiDiscrete Decomposition

with D.R. Cohen, S. Gatehouse, I. Dalrymple.

The identification of weak geochemical signatures, derived from the effects of mineralisation on regolith samples, requires the appropriate combination of chemical analysis procedures and multivariate data processing methods capable of recognising those signatures. Existing multivariate data processing methods commonly assume that signatures related to the effects of mineralisation take the form of separated or outlying clusters or values. Trace element patterns are typically more subtle, including situations where geochemical signatures influenced by mineralisation may take the form of denser sub-clusters within larger clusters. A new approach to identifying such sub-clusters, involving the combination of singular value decomposition (SVD) and semi-discrete decomposition (SDD), has been investigated.

At the Mandamah porphyry Cu-Au deposit, deeply weathered mineralisation is covered by 50 m of alluvium. Partial geochemical extractions (potassium acetate, hydroxylamine.HCl, aqua regia and sodium pyrophosphate) were conducted on surface soil samples collected from grids over mineralised and non-mineralised areas. There were few discernible patterns in the raw partial extraction data related to the location of mineralisation, apart from sporadically elevated Ba and lower Ca, Mg and REE. K-means clustering provided no interpretable patterns in the data. SVD-SDD processing of the partial extraction data, however, isolated clusters in which the majority of samples were located above or adjacent to the buried mineralisation. The results indicate the SVD-SDD technique may assist in discriminating mineralised from non-mineralised geochemical signatures.

Paper (Word), a version was presented at the 21st International Geochemical Exploration Symposium (IGES), Dublin, 2003.

## Distributed Data-Intensive Computation and the Datacentric Grid

The analysis of large datasets has become an important tool in understanding complex systems in areas such as economics and business, science, and engineering. Such datasets are often collected in a geographically distributed way, and cannot, in practise, be gathered into a single repository. Applications that work with such datasets, for example data-mining algorithms, cannot control most aspects of the data's partitioning or arrangement. As a consequence, both new architectures and new algorithms are needed. This paper presents the design of the DCGrid, a datacentric grid, and describes some parallel and distributed algorithms for data mining that are particularly suited to large-scale distributed datasets. Some of these algorithms permit genuine superlinear speedup so they are effective even for modest levels of parallelism.

## The Datacentric Grid and the Public/Private Boundary

PPS Slides of a talk at the Data Mining and Exploration Middleware for Distributed and Grid Computing Workshop at the University of Minnesota, September 2003.

## Clusters Within Clusters: SVD and Counterterrorism

We argue that one important aspect of terrorism detection is the ability to detect small-scale, local correlations against a background of large-scale, diffuse correlations. Singular value decomposition (SVD) maps variation, and hence correlation, into proximity in low-dimensional spaces. We show, using artificial datasets whose plausibility we argue for, that SVD is effective at detecting local correlation in this setting.

Queen's Technical Report 2003-463 (postscript).

## Outlier Detection Using SemiDiscrete Decomposition

with S. McConnell

Semidiscrete decomposition (SDD) is usually presented as a storage-efficient analogue of singular value decomposition. We show, however, that SDD actually works in a completely different way, and is best thought of as a bump-hunting technique; it is extremely effective at finding outlier clusters in datasets. We suggest that SDD's success in text retrieval applications such as latent semantic indexing is fortuitous, and occurs because such datasets typically contain a large number of small clusters.

Queen's Technical Report 2001-452 (postscript).

## The Case for Datacentric Grids

We argue that the properties of online data make it effectively immovable. Massively parallel computations involving such data must therefore be constructed in a completely different way -- one that replaces the processor-centric assumptions that underlie almost all programming models by datacentric assumptions. We discuss the implications of this change for grid architectures and their programming models.

Queen's Technical Report 2001-451 (postscript).

A version of this paper appears in the Workshop on Massively Parallel Processing at IPDPS 2002.

## Parallelizing Arcing and Boosting

Carol Yu and David Skillicorn

Bagging and boosting are two general techniques for building predictors based on small samples from a dataset. We show that boosting can be parallelized, and then present performance results for parallelized bagging and boosting using OC1 decision trees and two standard datasets. The main results are that sample sizes limit achievable accuracy, regardless of computational time spent; that parallel boosting is more accurate than parallel bagging; and (unexpectedly) that parallel boosting is also cheaper than parallel bagging (at least over OC1).

Queen's Technical Report 2001-442 (postscript).

## Parallel Inductive Logic for Data Mining

Yu Wang and David Skillicorn

Data mining is the process of automatic extraction of novel, useful and understandable patterns in very large databases. High-performance, scalable, and parallel computing algorithms are crucial in data mining as datasets grow in size and complexity. Inductive logic is a research area in the intersection of machine learning and logic programming, which has been recently applied to data mining. Inductive logic studies learning from examples, within the framework provided by clausal logic. It provides a uniform and expressive means of representation: examples, background knowledge, and induced theories are all expressed in first-order logic. Such an expressive representation is computationally expensive, so it is natural to consider improving the performance of inductive logic data mining using parallelism. We present a parallelization technique for inductive logic, and implement a parallel version of a core inductive logic programming system, Progol. Performance results on several datasets and platforms are reported.

A version of this paper appears in KAIS, November 2001.

The full version of Yu Wang's thesis is available as Queen's Technical Report 2000-436 (postscript).

## Parallel Frequent Set Counting

Computing the frequent subsets of large multi-attribute data is both computation- and data-intensive. The standard parallel algorithms require multiple passes through the data. The cost of data access may easily outweigh any performance gained by parallelizing the computational part. We address three opportunities for performance improvement: using an approximate algorithm that requires only a single pass over the data; statically partitioning the attributes to provide rudimentary load- and storage-balancing; and using a probabilistic technique to avoid generating most of the lattice of subsets implied by each object's data. The computation required is only slightly larger than levelwise algorithms, but the amount of data access is much smaller.

Postscript (Queen's TR1999-434)

A version of this paper will appear in Parallel Computing.

## Strategies for Parallel Data Mining

We present a set of cost measures that can be applied to parallel algorithms to predict their computation, data access, and communication performance. These measures make it possible to compare different possible parallel implementation strategies for data mining techniques without the necessity to benchmark each one. We give general cost expressions for three common parallelizing strategies, and show how to instantiate these cost expressions for a particular technique, neural networks.

Postscript (Queen's TR1999-426)

A version of this paper appeared in IEEE Concurrency, Oct-Dec 1999.

## Parallel Data Mining

A tutorial at IBM's CASCON'98 in Toronto, December 3rd, 1998.

## Partial Match Queries Using Error Correcting Code Signatures

An important component of some data mining algorithms is determining the frequency of some set of attributes in an extremely large dataset. This requires partial match queries, in which some attributes have "don't care" values. We present a partial match query algorithm that uses the codewords of error-correcting codes as signatures.

Keywords: partial match queries, signatures, error correcting codes, frequent sets, association rules.

Postscript (Queens Technical Report 1998-419)

## Strategies for Parallelizing Data Mining

We classify parallelization strategies for data mining algorithms, concentrating on those techniques in which training data is partitioned, and extracted properties shared between processors at the end of each phase. This approach has been extensively investigated for association rules and decision trees. We sketch some similar work for supervised and unsupervised neural networks, and for the kDNF technique. Possibilities for more sophisticated, hybrid, algorithms are also indicated.

Keywords: Data Mining, association rules, decision trees, neural networks, kDNF, inductive logic programming, total exchange.

This paper appears in the proceedings of the Workshop on High-Performance Data Mining, in association with IPPS/SPDP'98.

Comment: Once you realise that communication-rich operations, such as total exchange, are not necessarily performance killers (a lesson made clear from BSP performance results), algorithms that use them become more attractive. This paper argues for data mining algorithms that exploit communication, and illustrates with some skimpy discussion of kDNF, a simple but useful algorithm introduced to me by David Page.

## Using the BSP Cost Model for Optimal Parallel Neural Network Training

with R.O. Rogers,

We derive cost formulae for three different parallelisation techniques for training both supervised and unsupervised networks. These formulae are parameterised by properties of the target computer architecture. It is therefore possible to decide both which technique is best for a given parallel computer, and which parallel computer best suits a given technique. One technique, exemplar parallelism, is far superior for almost all parallel computer architectures. Formulae also take into account optimal batch learning as the overall training approach. Cost predictions are made for several of today's popular parallel computers.

Keywords: Data mining, neural networks, parallelism, bulk synchronous parallelism, {BSP}, cost analysis, supervised learning, unsupervised learning, batch learning, deterministic learning, stochastic learning

A shorter version appears in the Proceedings of BioSP3, a workshop in conjunction with IPPS/SPDP'98, and in Future Generation Computer Systems, Vol. 14, 1998, 409--424.

Comment: This paper essentially polishes off the neural network parallelisation problem. It uses what is at bottom an information-theoretic argument to show that many of the ways to parallelise neural network learning cannot possibly perform well because they require too much communication. The bottom line is that training a copy of the network on each available processor, and then resolving the error vectors, is the only sensible thing to do.

It's possible to get around these results for particular specialised topologies, but architectures that use them are not common these days. The problem may become interesting for a while when optical interconnects become available.

The paper also makes the point that batch learning is a useful adjunct, and can, by itself, give substantial performance improvements.

The results are quite insensitive to the particular learning algorithm used, and apply to both supervised and unsupervised networks.

## Strategies for Parallelizing Supervised and Unsupervised Learning in Artificial Neural Networks Using the BSP Cost Model

with R.Owen Rogers,

We use the cost system of BSP (Bulk Synchronous Parallelism) to predict the performance of three different parallelization techniques for both supervised and unsupervised learning in artificial neural networks. We show that exemplar parallelism outperforms techniques that partition the neural network across processors, especially when the number of exemplars is large, typical of applications such as data mining.

Keywords: Bulk synchronous parallelism, BSP, batch learning, supervised learning, unsupervised learning, backpropagation, competitive layer network, cost metrics, performance evaluation, Cray T3D, Cray T3E, IBM SP2, SGI PowerChallenge.

Postscript (Queen's TR97-406)

Comment: This is an early, and more detailed, version of the paper above.