David Skillicorn - Recent Publications

## Understanding High-Dimensional Spaces

D.B. Skillicorn
Springer, to appear 2012.

## The Mental State of Influencers

D.B. Skillicorn and C. Leuprecht
submitted to a conference.

Most analysis of influence looks at the mechanisms used, and how effectively they work on the intended audience. Here we consider influence from another perspective: what do the language choices made by influencers enable us to detect about their internal mental state, strategies and assessments of success. We do this by examining the language used by the U.S. presidential candidates in the high-stakes attempt to get elected. Such candidates try to influence potential voters, but must also pay attention to the parallel attempts by their competitors to influence the same pool. We examine seven channels: persona deception, the attempt by each candidate to seem as attractive as possible; nouns, as surrogates for content; positive and negative language; and three categories that have received little attention, verbs, adverbs, and adjectives. Although the results are preliminary, several intuitive and expected hypotheses are supported, but some unexpected and surprising structures also emerge. The results provide insights into related influence scenarios where open-source data is available, for example marketing, business reporting, and intelligence.

## Lessons from a Jihadi Corpus

D.B. Skillicorn
submitted to a conference.

We analyze the posts in the Islamic Awareness forum, using models for content, for Salafist-Jihadist language, and for deception. These last two models each produce a single-factor ranking enabling, in each case, the most useful subset of posts to be selected for further analysis. Posts that rank highly for Salafist-Jihadist language rank low for deception, suggesting that faking extremist websites is probably an ineffective strategy. The process described here is a template for analysis of many kinds of open-source corpora where language models of what makes posts interesting are known.

## Course-of-Action Decision Support Using Partial Graph Cuts

R. Sawilla and D.B. Skillicorn
submitted to a conference.

Because of increasing vulnerabilities, maturing attack tools, and increasing dependence on computer network infrastructure, tools to support network defenders are essential. Course-of-action recommendation research has often assumed a goal of perfect network security. In reality, network administrators balance security with usability and so tolerate vulnerabilities and imperfect security. We provide realistic course-of-action decision support for network administrators by representing attack graphs as forward directed hypergraphs, and minimizing connectivity by optimizing the removal of nodes to separate source and target nodes as much as possible, even when complete disconnection is impractical. We introduce vertex closures and closure-relation graphs in forward directed hypergraphs as the underlying framework. Computing an optimal course-of-action is NP-hard but we design a polynomial-time greedy algorithm that almost always produces an optimal solution. Our algorithms are integrated with commercial and open-source network security software.

## Detecting Fraud in Financial Reports

D.B. Skillicorn and L. Purda
submitted to a conference.

Fraud in public companies has a large financial impact, and yet is only weakly detected by those who look for it; many incidents have been detected only when whistleblowers have come forward. We examine the problem of detecting fraud from the textual component of the quarterly and annual reports that public companies are required to file. Using an empirically derived set of words, we achieve prediction accuracy up to 88\%. Frauds rarely involve only a single quarter, so it is actually more useful to consider prediction performance on a per-incident basis. The truthfulness probability of our measure shows consistent decreases in the quarters leading up to a fraud, creating opportunities for proactive enforcement. We also compare the prediction performance of our word list with Pennebaker's deception model, and with a set of fixed lists suggested in the literature, only one of which has any predictive power.

## Global Similarity in Social Networks with Typed Edges

D.B. Skillicorn and Q. Zheng
submitted to a conference.

Almost all social network analysis treats edges (relationships) as having different intensities (weights), but the same qualitative properties. We address the problem of modelling edges of qualitatively different types that nevertheless interact with one another. For example, influence flows along friend and colleagues edges differently, but treating the two sets of different kinds of edges as independent graphs surely misses interesting and useful structure.

We model the subgraph corresponding to each edge type as a layer, and show how to weight the edges connecting the layers to produce a consistent spectral embedding, including for directed graphs. This embedding can be used to compute social network properties of the combined graph, to predict edges, and to predict edge types. We illustrate with Padgett's dataset of Florentine families in the 15th Century.

## Vectors of Extremism Across the Canada-US Border

C. Leuprecht, T. Hataley, and D.B. Skillicorn
to appear at IEEE Intelligence and Security Informatics, 2012.

The idea that Canadian-based terrorists pose a threat to the United States continues to resonate with Americans. We subject this hypothesis to empirical testing by analyzing terrorist-related activity across the Canada-US border. Drawing on 13 cases with 27 terrorist connections, the evidence substantiates the presence of cross-border interactions, but does not confirm common perceptions about America’s northern border: there is no consistent threat emanating from Canada. Rather, differentials in the availability of ideas and resources drive threat vectors across the border in both directions. The bulk of violent extremists exploiting these cross-border markets of opportunity do so to propagate terrorism beyond North America.

## Ranking Documents by Internal Variability

D.B. Skillicorn and P.K. Chandrasekaran
to appear at IEEE Intelligence and Security Informatics, 2012.

An analyst, presented with a corpus too large to read every document, must find some selection mechanism. A model for interestingness can be used to rank the documents so that only the subset at the top of the ranking need be examined. However, in many open-source intelligence settings, such a model is not known in advance. We design three measures for ranking documents by internal variability as a weak surrogate for interestingness. Selecting those documents ranked highly by these measures selects a superset of the documents an analyst might need to read, no matter what the specific model, and reduces the size of the corpus by an order of magnitude. We also discover that many corpora contain documents that are highly variable, but not interesting, and show how to remove them.

## Outlier Detection Using Semantic Sensors

D.B. Skillicorn
to appear at IEEE Intelligence and Security Informatics, 2012.

We describe a technique that calculates the expected relationships among attributes from training data, and uses this to generate anomaly scores reflecting the intuition that a record with anomalous values for \emph{related} attributes is more anomalous than one with anomalous values for unrelated attributes. The expected relations among attributes are calculated in two ways: using a data-dependent projection via singular value decomposition, and using the maximal information coefficient. Sufficiently anomalous records are displayed on a sensor dashboard, making it possible for an analyst to judge why each record has been classified as anomalous. The technique is illustrated for an intrusion detection dataset, and a set of contract descriptors.

## Home-grown radicalism in the West: Using survey evidence to model the structure of extremist attitudes

D.B. Skillicorn, C. Leuprecht, and D.B. Skillicorn
to appear in Canadian Journal of Political Science.

Theories of radicalization make implicit predictions about variation among attitudes in the communities from which radicals are drawn. This article subjects some popular theories to empirical testing. The precise process by which individuals come to sympathize with, provide material support for, or actually engage in political violence is fully comprehensible only by longitudinal analysis, but much existing work tries to reconstruct the process by looking only at one part of its outcomes: those who have become radicalized. The result is a large number of theories and mechanisms, with little compelling empirical support. A cross-sectional snapshot of an at-risk community cannot definitively support a particular theory of radicalization, but it can rule out those whose predictions about attitudes are at variance with the empirical results. We designed a survey instrument to gauge attitudes to issues widely believed to be relevant to radicalization and deployed it among Muslim communities in Ottawa. The results are remarkably inconsistent with patterns of variation in attitudes predicted by popular theories of radicalization. Instead, they show variation of attitudes along three independent dimensions: social/economic/political satisfaction/dissatisfaction, moral/religious satisfacton/dissatisfaction, and a dimension that appears to be associated with radicalization. This suggests that governments may have less policy leverage to mitigate radicalization than generally supposed.

## Finding Interesting Documents in a Corpus

P.K. Chandrasekaran and D.B. Skillicorn
Text Mining Workshop 12 at the SIAM International Conference on Data Mining, 2012.

Finding the most "interesting" document, or ranking documents by interestingness, in a large corpus is an important problem. We consider the most general case where there is no external model of interestingness, and suggest that internal variability is a plausible and useful surrogate. We define three measures of variability for documents, establish baselines for the corpus-wide structure of variability, and rank documents from four demonstration corpora using these measures. Although validation is necessarily problematic, we show that these measures reveal different aspects of variability, but rank highly documents that could be considered interesting in a useful sense.

## Knowledge Discovery in Adversarial Settings

D.B. Skillicorn
in Holmes and Jain, Data Mining: Foundations and Intelligent Paradigms, Springer, 2012.

In adversarial settings, the interests of modellers and those being modelled do not run together. This includes domains such as law enforcement and counterterrorism but, increasingly, also more mainstream domains such as customer relationship management. The conventional strategy, maximizing the fit of a model to the available data, does not work in adversarial settings because the data cannot all be trusted, and because it makes the results too predictable to adversaries. Some existing techniques remain applicable, others can be used if they are modified to allow for the adversarial setting, while others must be discarded. General principles for this domain are discussed and the implications for practice outlined.

## Novel Idea Generation, Collaborative Filtering, and Group Innovation Processes

T.A. Jenkin, D.B. Skillicorn and Y. Chan
Proceedings of the International Conference on Information Systems (ICIS), 2011.

Organizations that innovate encounter challenges due to the complexity and ambiguity of generating and making sense of novel ideas. Exacerbated in group settings, we describe these challenges and propose potential solutions. Specifically, we design group processes to support novel idea generation and selection, including use of a novel-information discovery (NID) tool to support creativity and brainstorming, as well as group support system and collaborative-filtering tools to support evaluation and decision making. Results indicate that the NID tool increases efficiency and effectiveness in creative tasks and that the collaborative-filtering tool can support the decision-making process by focusing the group’s attention on ideas that might otherwise be neglected. Combining these two novel tools with group processes provides valuable contributions to both research and practice.

## Mitigating the linkability problem in anonymous reputation management

M. Hussain and D.B. Skillicorn
Journal of Internet Services and Applications, Vol. 2, 2011.

Trust plays a key-role in enhancing user experience at service providers. Reputation management systems are used to quantify trust, based on some reputation metrics. Anonymity is an important requirement in these systems, since most individuals expect that they will not be profiled by participating in the feedback process. Anonymous Reputation management (ARM) systems allow individuals to submit their feedback anonymously. However, this solves part of the problem. Anonymous ratings by one individual can be linked to each other. This enables the system to easily build a profile of that individual. Data mining techniques can use the profile to re-identify that individual. We call this the linkability problem. This paper presents an anonymous reputation management system that avoids the linkability problem. This is achieved by constructing a system that empowers individuals to interact and rate service providers, securely and anonymously.

## Computational Approaches to Suspicion in Adversarial Settings

D.B. Skillicorn
Information Systems Frontiers, 2011.

Intelligence and law enforcement agencies collect large datasets, but have difficulty focusing analyst attention on the most significant records and structures within them. We address this problem using suspicion, which we interpret as relevant anomaly, as the measure associated with data records and individuals. For datasets collected about widespread activities in which the signs of adversarial activity are rare, we suggest ways to build predictive models of suspicion. For datasets collected as the result of lawful interception, we suggest a model of suspicion spreading using the social network implied by the intercepted data.

## Reverse Engineering Co-maintenance Relationships Using Conceptual Analysis of Source Code

S. Grant, J.R. Cordy, and D.B. Skillicorn
WCRE2011, 18th Working Conference on Reverse Engineering, 2011.

In this work, we explore the relationship between topic models and co-maintenance history by introducing a visualization that compares conceptual cohesion within changelists. We explain how this view of the project history can give insight about the semantic architecture of the code, and we identify a number of patterns that characterize particular kinds of maintenance tasks. We examine the relationship between comaintenance history and concept location, and visualize the distribution of changes across concepts to show how these techniques can be used to predict co-maintenance of source code methods.

## Contextualized Semantic Analysis of Web Services

S. Grant, D. Martin, J. R. Cordy, D. B. Skillicorn
WSE2011, 13th IEEE International Symposium on Web Services Evolution, 2011, Best Paper.

The poor locality of operation descriptions expressed in the Web Service Description Language (WSDL) makes them difficult to analyze and compare in web service discovery tasks. This problem has led us to develop a new method for service operation comparison involving contextualizing operation descriptions by inlining related type information from other sections of the service description. In this paper, we show that this contextualization of web service descriptions can enable topic models (statistical techniques for identifying relationships) to produce semantically meaningful results that can be used to reverse engineer service-oriented web systems and automatically identify related web service operations. Specifically, we model contextualized WSDL service operations using Latent Dirichlet Allocation, and show how this approach can be used to more accurately find similar web service operations.

## A Framework for Analyst Focus from Computed Significance

D.B. Skillicorn and M.A.J. Bourassa
Counterterrorism and Open Source Intelligence, Springer Lecture Notes in Social Networks, 2011.

Attention is the critical resource for intelligence analysts, so tools that provide focus are useful. One way to determine focus is by computing significance. In the context of a known model, new data can be placed on a spectrum defined by: normal, anomalous, interesting, novel, or random; and significance is greatest towards the middle of this spectrum. However, significance also depends on the mental state of the analyst (and the organization). A framework for understanding significance is defined, and its impact on the knowledge discovery process is explored.

## Interestingness: Directing Analyst Attention to Significant Data

M.A.J. Bourassa, J. Fugère, and D. Skillicorn
IEEE Open Source Intelligence and Web Mining Conference, 2011.

## How are Concepts Maintained?

S. Grant, J.R. Cordy, and D.B. Skillicorn
Poster, CASCON 2011.

## The Curse of Dimensionality: A Blessing to Personalized Medicine

D.R. Catchpoole, P. Kennedy, D.B. Skillicorn, and S.Simoff
Journal of Clinical Oncology, 2010.

## Building Models when Adversary Data is Rare

D.B. Skillicorn
IEEE Intelligent Systems, 2010.

Security and intelligence data often contains numerous of records associated with good actions, but few, perhaps none, associated with bad actions. For example, aircraft passenger screening systems contain billions of records about normal travelers for each record about a terrorist or hijacker. In crime, fraud, tax evasion, and money laundering, the ratio of good to bad records might not be quite as great, but examples of bad records are still rare. Many of these settings are not only unbalanced in the number of examples of different kinds of records, they are also adversarial because those whose actions generate the bad records are trying to conceal themselves and also perhaps disrupt the modeling that is being done. In these settings, we still need to be able to build robust inductive models (predictors, clusterings, and rankings) from the data, but there are new issues to be considered.

## The Case for Service Provider Anonymity

M. Hussain and D.B. Skillicorn
Proceedings of the 2010 IEEE International Symposium on Signal Processing and Information Technology, 2010.

While the majority of the research on anonymity is focused on individuals, there is an increasing number of scenarios that demand anonymity for service providers as well. For example, in many business to consumer (B2C) scenarios, service providers sell their surplus to individuals for lower prices through arbitrageurs. Those service providers must remain anonymous, to avoid discouraging customers from buying directly from the service provider, at the regular price. Some providers wish to be anonymous for various reasons, including but not limited to, escaping denial of service attacks and avoiding censorship. Enabling service providers to prove relations to other providers is also needed in business to business (B2B) scenarios. This paper shows that allowing service providers to act as individuals in privacy preserving systems does not fulfill the job. The paper describes a system that allows individuals and service providers to participate in secure and anonymous transactions.

## Applying Interestingness Measures to Ansar Forum Texts

D.B. Skillicorn
ISI KDD Workshop at KDD 2010.

Documents from the Ansar aljihad forum are ranked using a number of word-usage models. Analysis of overall content shows that postings fall strongly into two categories. A model describing Salafist-jihadi content generates a very clear single-factor ranking of postings. This ranking could be interpreted as selecting the most radical postings, and so could direct analyst attention to the most significant documents. A model for deception creates a multifactor ranking that produces a similar ordering, with low-deception postings identified with highly Salafist-jihadi ones. This suggests either that such postings are extremely sincere, or that personal pronoun use and intricate structuring are also markers of Salafist-jihadi language. Although the overall approach is relatively straightforward, the choice of parameters to maximize the usefulness of the results is intricate.

## Using decomposition methods and visualization of global gene expression and SNP data to identify leukaemia patients at risk of treatment failure

D.R. Catchpoole, N. Ho, G. Morton, P. Kennedy, S. Simoff, and D. Skillicorn
Poster, 2nd Worldwide Innovative Networking in Personalized Cancer Medicine, Paris, 2010.

## Reading Between the Lines: Detecting Fraud from the Language of Financial Reports

L. Purda and D.B. Skillicorn
Northern Finance Association Conference 2010.

We ask whether the text of the Management Discussion and Analysis section (MD&A) of quarterly and annual reports holds important clues regarding the integrity of the financial report. By building a detection algorithm based on the most frequent words present in a sample of fraudulent and truthful reports, we correctly classify approximately 87% of all reports as either fraudulent or not. This rate falls to 80% for a hold-out-sample excluded from the training set. While the method is particularly strong in correctly capturing financial misrepresentations, truthful statements are incorrectly labeled as fraudulent approximately 16% of the time. These false positive classifications frequently occur in quarters immediately surrounding actual occurrences of fraud. We compare our method to a more traditional approach suggested by Dechow, Ge, Larson and Sloan (2009) that is based on quantitative variables. We find that the correlation between predictions from the two methods is only 0.13 suggesting that the text of financial reports can be a powerful supplement to the identification of fraudulent reports using quantitative variables alone.

## Hardening Adversarial Prediction with Anomaly Tracking

M.A.J. Bourassa and D.B. Skillicorn
IEEE Intelligence and Security Informatics 2009.

Predictors are often regarded as black boxes that treat all incoming records exactly the same, regardless of whether or not they resemble those from which the predictor was built. This is inappropriate, especially in adversarial settings where rare but unusual records are of critical importance and some records might occur because of deliberate attempts to subvert the entire process. We suggest that any predictor can, and should, be hardened by including three extra functions that watch for different forms of anomaly: input records that are unlike those previously seen (\emph{novel} records); records that imply that the predictor is not accurately modelling reality (\emph{interesting} records); and trends in predictor behavior that imply that reality is changing and the predictor should be updated. Detecting such anomalies prevents silent poor predictions, and allows for responses, such as: human intervention, using a variant process for some records, or triggering a predictor update.

## Guarantee-Based Access Control

M. Hussain and D.B. Skillicorn
PASSAT 2009.

New web technologies, such as the Semantic Web and the Social Web, have changed the way services and information are accessed. These new technologies allow more usable and interoperable services to be realized. They help service providers to reach more individuals. Automatic service discovery and invocation can also benefit from these technologies.

Unfortunately, several threats to security, privacy, and trust comes along with these benefits. For example, making services interoperable increases the chance of profiling individuals, which constitutes a privacy threat. Security measures, such as access control, should handle the potential threats of opening previously encapsulated web services.

We present a new access-control model, guarantee-based access control. The new model constructs certificates based on guarantees rather than attributes. These guarantees are then used as the basis for access-control decisions. This allow access control to be fine-grained, and avoids the burden of design and management of roles. The model suits the open nature of new web technologies. It resists threats to individuals' privacy, such as profiling. The model also permits access rights to be based on a set of individuals in a particular structured relationship.

## Knowledge Discovery for Counterterrorism and Law Enforcement

D.B. Skillicorn,
Chapman and Hall, 2008.

This book summarizes the state of the art in adversarial knowledge discovery when it was published.

## Understanding Complex Datasets: Matrix Decompositions for Data Mining

D.B. Skillicorn,
Chapman and Hall, 2007.

This book summarizes the ways in which various matrix decompositions (singular value decomposition, semidiscrete decomposition, independent component analysis, non-negative matrix factorization) can be used as data mining tools.

## Classification Using Streaming Random Forests

H. Abdulsalam, D.B. Skillicorn, P. Martin,
IEEE Transactions on Knowledge and Data Engineering, Vol. 22, February 2010.

We consider the problem of data-stream classification, where the data arrives in a conceptually-infinite stream, and the opportunity to examine each record is brief. We introduce a stream-classification algorithm that is online, running in amortized O(1) time, able to handle intermittent arrival of labelled records, and able to adjust its parameters to respond to changing class boundaries (concept drift') in the data stream. In addition, when blocks of labelled data are short, the algorithm is able to judge internally whether the quality of models updated from them is good enough for deployment on unlabelled records, or whether further labelled records are required. Unlike most proposed stream-classification algorithms, multiple target classes can be handled.

Experimental results on real and synthetic data show that accuracy is comparable to a conventional classification algorithm that sees all of the data at once and is able to make multiple passes over it.

D.B. Skillicorn,
IEEE Intelligent Systems, Vol. 24, No. 6, 54--61, Nov/Dec 2009.

In adversarial settings, there are those who wish to conceal their existence, properties and activities from data analysis. This substantially changes the knowledge discovery process -- finding a model that best fits' the data is unhelpful because it provides adversaries with predictable ways to hide, and ways to manipulate. We survey some of the implications for algorithms and process, and suggest some open problems.

## Patterns of Word Use for Deception in Testimony

A. Little and D.B. Skillicorn,
Security Informatics, Annals of Information Systems Vol. 9, Springer, 25--40, 2010.

Several models for deception in text, based on changes in usage frequency of certain classes of words, have been proposed. These are empirically derived from settings in which individuals are asked to lie or be truthful in freeform text. We consider the problem of detecting deception in testimony, where the content generated must necessarily be responsive to questions, where there is the opportunity for immediate followup if the possibility of deception is detected by the questioner, and where those who have reasons to be deceptive have time and motivation to rehearse potential answers.

Using the testimony to the Gomery Commission, a situation in which many witnesses had some motivation to be deceptive, we propose and validate a model for deception in testimony. We achieve substantial (80%) agreement with media estimates of who was motivated to testify in a deceptive way.

## Personas: Beyond Identity Protection by Information Control: A Report to the Privacy Commissioner of Canada

D.B. Skillicorn and M. Hussain,

As individuals interact on larger and larger scales, what makes up their identity also has to expand. In a village, only a single name is enough. In a country, other attributes such as a social insurance or taxpayer number become necessary. Across the world, a multinational hotel chain may want to identify the same individual whenever he or she stays at one of their hotels, but there is no single global identifier that makes this possible.

Because identities are important in so many interactions, not all of them known to and supervised by individuals, there are considerable economic and privacy risks when identity information is misused. Today, the standard solution to this problem is to mandate legal or policy rules that restrict the flow and use of identity information.

This solution is starting to fail for two reasons. First, and most importantly, new developments in data mining and data fusion allow identities to be constructed from data that has not previously been considered identifying. Systems often do not control this kind of data as tightly as traditionally identifying data. Second, increasingly information that was supposed to be controlled is released accidentally. Once this has been done, there is no way to call it back, but also no way (short of Witness Protection Programs) to create fresh identities for individuals who have been compromised.

We design and describe a new approach to this problem, based on the creation and use of artificial identities, called \emph{personas}. These identities can only be traced to the individuals who created them under tightly controlled circumstances, but personas can act for individuals in almost all situations, from ecommerce to border crossing.

## Distributed Prediction in Large-Scale Astrophysical Datasets

S.M. McConnell, D.B. Skillicorn and J. Irwin,
Special Issue on Data/Text Mining Applications on Supercomputing and Grid Environments, Mediterranean Journal of Computers and Networks, Volume 3, Number 2, 62--71, 2007.

Different records of a dataset are often collected in different locations. Increasingly, different attributes of the same record are collected in different locations as well. Astrophysical data are an important example of this class since different instruments collect data about the same objects in different wavebands; instruments also collect data about different objects, for example because of their relative positions. We present a lightweight distributed prediction algorithm based on local predictor construction on partitions of a dataset, even partitions by attributes, followed by global voting. The computation and communication requirements of the algorithm are compatible with distributed systems and grids; privacy is preserved because raw data never leave local sites; and prediction accuracy is comparable with a centralized algorithm using the same base predictor.

## Detecting Word Substitutions in Text

SW. Fong, D.B. Skillicorn, D. Roussinov,
IEEE Transactions on Knowledge and Data Engineering, Special Issue on Intelligence and Security Informatics, 2008.

Searching for words on a watchlist is one way in which large-scale surveillance of communication can be done, for example in intelligence and counterterrorism settings. One obvious defense is to replace words that might attract attention to a message with other, more innocuous, words. For example, the sentence the attack will be tomorrow" might be altered to the complex will be tomorrow", since complex' is a word whose frequency is close to that of attack'.

Such substitutions are readily detectable by humans since they do not make sense. We address the problem of detecting such substitutions automatically, by looking for discrepancies between words and their contexts, and using only syntactic information. We define a set of measures, each of which is quite weak, but which together produce per-sentence detection rates around 90% with false positive rates around 10%. Rules for combining per-sentence detection into per-message detection can reduce the false positive and false negative rates for messages to practical levels. We test the approach using sentences from the Enron email and Brown corpora, representing informal and formal text respectively.

## Distributed Prediction from Vertically Partitioned Data

S.M. McConnell and D.B. Skillicorn,
Journal of Parallel and Distributed Computing, Vol. 68, 16--36, 2008.

We address the problem of clustering of data that is vertically partitioned, that is where local sites hold some of the attributes of all of the records. This situation is natural when data is collected by channels that are physically separated. For distributed clustering, we design a technique that computes global clusters from local clusters computed at each site, without requiring either large amounts of storage or computation at the center. The technique is simple, outperforms a centralized algorithm for most datasets, distributes computation and data access, and preserves the privacy of each local site.

## Novel Information Discovery for Intelligence and Counterterrorism

D.B. Skillicorn, N. Vats,
Decision Support Systems, 2006.

Intelligence analysts construct hypotheses from large volumes of data, but are often limited by social and organizational norms and their own preconceptions and biases. The use of exploratory data-mining technology can mitigate these limitations by requiring fewer assumptions. We present the design of the Athens system, which discovers novel information, relative to a specified set of existing knowledge, in large information repositories such as the world wide web. We illustrate the use of the system by starting from the terms "al Qaeda" and "bin Laden" and running the Athens system as if on September 12th, 2001. This provides a picture of what novel information could have been known at the time. This is of some intrinsic interest, but also serves to validate the performance of the system since much of this novel information has been discovered by conventional means in the intervening

## Extracting Knowledge from Graph Data in Adversarial Settings

D.B. Skillicorn,
Mathematical Methods in Counterterrorism, Springer, 2009.

Graph data captures connections and relationships among individuals, and between individuals and objects, places, and times. Because many of the properties of graphs are emergent, they are resistant to manipulation by adversaries. This robustness comes at the expense of more-complex analysis algorithms. We describe several approaches to analysing graph data, illustrating with examples from the relationships within al Qaeda.

## Individual and Collective Analysis of Anomalies in Message Traffic

D.B. Skillicorn,
in Terrorism Informatics: Knowledge Management and Data Mining for Homeland Security, Chen, Reid, Sinai, Silke, Ganor, Eds., 20pp, 2008.

We consider four properties by which intercepted messages can be selected for deeper analysis: their external properties, their content, their authorship, and the mental state of their authors. We argue that, rather than trying to differentiate directly between good' messages and bad' messages, it is better to use a two-pronged approach, where a simple detection scheme triggers a reaction in authors of `bad' messages. This reaction is easier to detect than the original difference. We also suggest that differentiation is more effective when it is done for sets of messages, rather than on a message by message basis.

## Social Network Analysis via Matrix Decompositions

D.B. Skillicorn,
in Emergent Information Technologies and Enabling Policies for Counter-Terrorism, R.L. Popp and J. Yen, Eds., Wiley, 2006, 367--392.

## Structures in Collaborative Tagging: An Empirical Analysis

A. Fani Marvasti and D.B. Skillicorn,
Proceedings of the Thirty-Third Australasian Computer Science Conference (ACSC 2010), Brisbane, Australia, Conferences in Research and Practice in Information Technology Volume 102, 109--116.

It is widely believed that users choose meaningful tags; that the combinations of these tags from multiple users helps to elicit meaning for tagged objects; and that implicit communities can be discovered among the users who use a particular tag or tag a particular object. We examine data on tagging practices from the Delicious web site (delicious.com) and find that there is little support for any of these beliefs. Although users individually do seem to have a small set of tags that they use in a controlled and effective way, they also use very large sets of other tags in a much more haphazard way. These poorly managed tags obscure much of the collective sense making and implicit community structure. We derive some suggestions for improving collaborative tagging systems.

## Guarantee-Based Access Control

M.Hussain and D.B. Skillicorn,
SecCom-09 (International Symposium on Secure Computing 2009), August 2009.

New web technologies, like the Semantic Web and the Social Web, have changed the way services and information are accessed. The new technologies allow for more usable and interoperable services to be realized. They help service providers to reach more individuals. Automatic service discovery and invocation can also benefit from these technologies.

Unfortunately, several threats regarding security, privacy, and trust comes along with these benefits. For example, making services interoperable increases the chance of profiling individuals, which constitutes a privacy threat. Security measures, like access control, should handle the potential threats of opening the previously encapsulated web services.

We present a new access control model, guarantee-based access control. The new model constructs individuals certificates' based on guarantees rather than attributes. The guarantees are then used as the basis for access control decisions. This allow access control to be fine-grained, and avoids the burden of the management of roles. The model suits the openness nature of the new web technologies. It resists the threats against individuals privacy, such as profiling. The model permits access to be not only based on one individual, but also based on a set of individuals having different relationships.

## Hardening Adversarial Prediction with Anomaly Tracking

M.A.J. Bourassa and D.B. Skillicorn,
IEEE Intelligence and Security Informatics 2009, 43--48.

Predictors are often regarded as black boxes that treat all incoming records exactly the same, regardless of whether or not they resemble those from which the predictor was built. This is inappropriate, especially in adversarial settings where rare but unusual records are of critical importance and some records might occur because of deliberate attempts to subvert the entire process. We suggest that any predictor can, and should, be hardened by including three extra functions that watch for different forms of anomaly: input records that are unlike those previously seen (novel records); records that imply that the predictor is not accurately modelling reality (interesting records); and trends in predictor behavior that imply that reality is changing and the predictor should be updated. Detecting such anomalies prevents silent poor predictions, and allows for responses, such as: human intervention, using a variant process for some records, or triggering a predictor update.

## Automated Concept Location Using Independent Component Analysis

S. Grant, J. Cordy and D.B. Skillicorn,
Short paper, 15th Working Conference on Reverse Engineering (WCRE), Antwerp, October 2008.

## Classifying Evolving Data Streams Using Dynamic Streaming Random Forests

H. Abdulsalam, D. B. Skillicorn, and P. Martin,
19th International Conference on Database and Expert Systems Applications - DEXA'08, Turin, December 2008.

## Persona-based Identity Management: A Novel Approach to Privacy Protection

M. Hussain and D.B. Skillicorn,
In proceedings of the 13th Nordic Workshop on Secure IT-Systems, Technical University of Denmark, pp. 201-212, Lyngby, Denmark, October 2008.

Protecting privacy is often conceived and implemented as a matter of controlling the flow of identity and identity-related information. This, for instance, is the strategy of much privacy legislation and business privacy policy. Such approaches are increasingly weakened by the ability of data-mining techniques to fuse these partial identities into a complete identity, even using attributes that are not normally considered to be identifying.

We propose instead that the solution to protecting privacy is the use of artificial identities, generated robustly using cryptographic techniques and managed by individuals as they wish (with ultimate traceability when required by, for example, law enforcement). The existence of a framework that can protect privacy even in the presence of large-scale data collection and analysis makes it possible for individuals and their agents to participate in online activity without the reluctance that is increasingly associated with current approaches.

## Detecting Deception in Testimony

A. Little and D.B. Skillicorn,
IEEE International Conference on Intelligence and Security Informatics 2008, June 2008, Taiwan, 13--18.

Several models for deception in text, based on changes in usage frequency of certain classes of words, have been proposed. These are empirically derived from settings in which individuals are asked to lie or be truthful in freeform text. We consider the problem of detecting deception in testimony, where the content generated must necessarily be responsive to questions, where there is the opportunity for immediate followup if the possibility of deception is detected by the questioner, and where those who have reasons to be deceptive have time and motivation to rehearse potential answers. Using the testimony to the Gomery Commission, a situation in which many witnesses had some motivation to be deceptive, we propose and validate a model for deception in testimony. We achieve substantial (80%) agreement with media estimates of who was motivated to testify in a deceptive way.

## Subverting Prediction in Adversarial Settings

J.G. Dutrisac and D.B. Skillicorn,
IEEE International Conference on Intelligence and Security Informatics 2008, June 2008, Taiwan, 19--24 (Honorable Mention, Best Paper).

We show that two mainstream prediction techniques, support vector machines and decision trees, are easily subverted by inserting carefully-chosen training records. Furthermore, the relationship between the properties of the inserted record(s) and the regions for which the predictor will subsequently misclassify can be inferred, so desired misclassifications can be forced. In adversarial settings, it is plausible that manipulation of this kind will be attempted, so this has implications for the design of prediction systems and the use of off-the-shelf technology, especially as support vector machines are one of the most powerful prediction algorithms known.

## Hiding Clusters in Adversarial Settings

J.G. Dutrisac and D.B. Skillicorn,
IEEE Conference on Intelligence and Security Informatics 2008, June 2008, Taiwan, 185--187.

In adversarial settings, records associated with those who want to conceal their existence or activities tend to be unusual because of their illicit status; but not too unusual because of efforts to make them as normal as possible. Clusters of such records will not be single outliers or even outlying clusters, but rather small clusters on the fringes of normal clusters. Such structures are undetectable by many mainstream clustering algorithms, for example those based on distance and convexity. We show that even more sophisticated clustering algorithms are easily subverted by the addition of only a few carefully chosen records. Robust clustering in adversarial settings will require the development of more sophisticated algorithms tailored to this domain.

## Integrative Visual Data Mining of Biomedical Data: Investigating cases in Chronic Fatigue Syndrome and Acute Lymphoblastic Leukaemia

P. Kennedy, S.J. Simoff, D.R. Catchpoole, D.B. Skillicorn, F. Ubaudi and Ahmad Al-Oqaily,
Springer, 23pp.

## Detecting Word Substitutions: PMI vs. HMM

D. Roussinov, SW. Fong, and D.B. Skillicorn,
(Poster), SIGIR 2007, Amsterdam, 2007.

## Random Forests for Streaming Data

H. Abdulsalam, P. Martin, and D.B. Skillicorn,
Workshop on Multimedia Mining and Knowledge Discovery, Waterloo, 17-18th October, 2007, Poster (Runner-up for Best Poster).

## Streaming Random Forests

H. Abdulsalam, D.B. Skillicorn, and P. Martin,
The Eleventh International Database Engineering and Applications Symposium (IDEAS'07), Banff, August 2007.

## Matching for Mobile Users in the Publish/Subscribe Paradigm

A. M. Roumani and D. B. Skillicorn,
International Conference on Wireless Information Networks and Systems (WINSYS 2007), Barcelona, Spain, July 28-31, 2007.

In a mobile publish/subscribe paradigm, user service discovery and recommendation requires matching user preferences with properties of published services. This is a difficult problem when users are mobile, wirelessly connected, and dynamically roaming. The available data is very large, and the matching must be computed in real time. Existing heuristics are quite ineffective. We propose novel algorithms that use singular value decomposition as a dimension-reduction technique. We introduce "positive" nearest-neighbor matching to find services whose attribute values exceed those of a new user subscription. For n services and m preference attributes, reasonable matches can be found in time O(m log n), using O(nm) storage.

## Positive Matching of Resources in Computational Grids

A.M. Roumani and D.B. Skillicorn,
2007 Internation Conference on Grid Computing and Applications (GCA'07), Las Vegas, 2007.

In computational grids, resource discovery and selection requires matching job requirements to available resources. The properties of each job must be matched with those of existing resources as accurately as possible. The difficulty of the problem increases with the number of attributes for each resource/job. Existing heuristics are quite ineffective.

We propose novel algorithms that use singular value decomposition as a dimension-reduction technique, and match in a collection of lower-dimensional spaces. We introduce "positive" nearest-neighbor matching, the problem of finding resources whose attribute values exceed those of a given job. Singular value decomposition is, in some sense, an obvious approach but requires careful application to work effectively in this setting. Performance and quality of matches are measured using datasets representing applications in the grid. We show that reasonable matches can be found in time O(m log n), using O(nm) storage space, where n is the number of resources and m is the number of attributes for which a job might have requirements.

## Finding the Positive Nearest Neighbor in Recommender Systems

A.M. Roumani and D.B. Skillicorn,
2007 International Conference on Data Mining (DMIN'07), Las Vegas, 2007.

## Novel-Knowledge Discovery: Challenges and Design Theory

T.A. Jenkin, Y.E. Chan and D.B. Skillicorn,
Proceedings of the Annual Conference of the Administrative Sciences Association of Canada, Ottawa, Ontario, June 2007.

## Detecting Anomalies in Graphs

D.B. Skillicorn,
IEEE Conference on Intelligence and Security Informatics (ISI2007), IEEE CS Press, 209-216, 2007.

Graph data represents relationships, connections, or affinities. Innocent relationships produce repeated, and so common, substructures in graph data. We present techniques for discovering anomalous substructures in graphs, for example small cliques, nodes with unusual neighborhoods, or small unusual subgraphs, using extensions of spectral graph techniques commonly used for clustering. Although not all anomalous structure represents terrorist or criminal activity, it is plausible that all terrorist or criminal activity creates anomalous substructure in graph data. Using our techniques, unusual regions of a graph can be selected for deeper analysis.

## Adaptiveness in Well-Typed Java Bytecode Verification

F.Y. Huang, C.B. Jay and D.B. Skillicorn,
CASCON 2006, 248--262, 2006.

## Improving a Textual Deception Detection Model

S. Gupta and D.B. Skillicorn,
CASCON 2006, 345--348, 2006.

## Does CFS Have a Biological Basis? -- A Constructionist Approach

P.J. Kennedy, S.J. Simoff, D.R. Catchpoole, F. Abaudi, A. Al-Oqaily, S. Yildiz, Y. Du, D.B. Skillicorn,
CAMDA '06, June 2006, 5pp.

## Measures to Detect Word Substitution in Intercepted Communication

SW. Fong, D.B. Skillicorn, D. Roussinov,
IEEE Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, LNCS 3975, 2006.

## Detecting Word Substitution in Adversarial Communication

SW. Fong, D.B. Skillicorn, D. Roussinov,
Workshop on Link Analysis, Counterterrorism, and Security, SIAM International Data Mining Conference, April 2006.

## Programming with Heterogeneous Structures: Manipulating XML data Using bondi

F.Y. Huang, C.B. Jay, D.B. Skillicorn,
Australasian Computer Science Week Conference, Hobart, Tasmania, January 14--16th 2006.

Manipulating semistructured data, such as XML, does not fit well within conventional programming languages. A typical manipulation requires finding all occurrences of a structure matching a structured search pattern, whose context may be different in different places, and both aspects cause difficulty. If a special-purpose query language is used to manipulate XML, an interface to a more general programming environment is required, and this interface typically creates a typing mismatch and incurs runtime overhead for type checking. However, adding XML manipulation to a general-purpose programming language has proven difficult because of problems associated with expressiveness and typing.

In this paper we show an alternative approach that handles many kinds of patterns within an existing strongly-typed general-purpose programming language called bondi. The key ideas are to express complex search patterns as structures of simple patterns, pass these complex patterns as parameters to generic data-processing functions and traverse heterogeneous data structures by a generalized form of pattern matching. These ideas are made possible by the language's support for pattern calculus, which enables path and pattern polymorphism. With this approach, adding a new kind of patterns is just a matter of programming, not language design.