David Skillicorn - Recent Publications |
Most analysis of influence looks at the mechanisms used, and how effectively they work on the intended audience. Here we consider influence from another perspective: what do the language choices made by influencers enable us to detect about their internal mental state, strategies and assessments of success. We do this by examining the language used by the U.S. presidential candidates in the high-stakes attempt to get elected. Such candidates try to influence potential voters, but must also pay attention to the parallel attempts by their competitors to influence the same pool. We examine seven channels: persona deception, the attempt by each candidate to seem as attractive as possible; nouns, as surrogates for content; positive and negative language; and three categories that have received little attention, verbs, adverbs, and adjectives. Although the results are preliminary, several intuitive and expected hypotheses are supported, but some unexpected and surprising structures also emerge. The results provide insights into related influence scenarios where open-source data is available, for example marketing, business reporting, and intelligence.
We analyze the posts in the Islamic Awareness forum, using models for content, for Salafist-Jihadist language, and for deception. These last two models each produce a single-factor ranking enabling, in each case, the most useful subset of posts to be selected for further analysis. Posts that rank highly for Salafist-Jihadist language rank low for deception, suggesting that faking extremist websites is probably an ineffective strategy. The process described here is a template for analysis of many kinds of open-source corpora where language models of what makes posts interesting are known.
Because of increasing vulnerabilities, maturing attack tools, and increasing dependence on computer network infrastructure, tools to support network defenders are essential. Course-of-action recommendation research has often assumed a goal of perfect network security. In reality, network administrators balance security with usability and so tolerate vulnerabilities and imperfect security. We provide realistic course-of-action decision support for network administrators by representing attack graphs as forward directed hypergraphs, and minimizing connectivity by optimizing the removal of nodes to separate source and target nodes as much as possible, even when complete disconnection is impractical. We introduce vertex closures and closure-relation graphs in forward directed hypergraphs as the underlying framework. Computing an optimal course-of-action is NP-hard but we design a polynomial-time greedy algorithm that almost always produces an optimal solution. Our algorithms are integrated with commercial and open-source network security software.
Fraud in public companies has a large financial impact, and yet is only weakly detected by those who look for it; many incidents have been detected only when whistleblowers have come forward. We examine the problem of detecting fraud from the textual component of the quarterly and annual reports that public companies are required to file. Using an empirically derived set of words, we achieve prediction accuracy up to 88\%. Frauds rarely involve only a single quarter, so it is actually more useful to consider prediction performance on a per-incident basis. The truthfulness probability of our measure shows consistent decreases in the quarters leading up to a fraud, creating opportunities for proactive enforcement. We also compare the prediction performance of our word list with Pennebaker's deception model, and with a set of fixed lists suggested in the literature, only one of which has any predictive power.
Almost all social network analysis treats edges (relationships) as having different intensities (weights), but the same qualitative properties. We address the problem of modelling edges of qualitatively different types that nevertheless interact with one another. For example, influence flows along friend and colleagues edges differently, but treating the two sets of different kinds of edges as independent graphs surely misses interesting and useful structure.
We model the subgraph corresponding to each edge type as a layer, and show how to weight the edges connecting the layers to produce a consistent spectral embedding, including for directed graphs. This embedding can be used to compute social network properties of the combined graph, to predict edges, and to predict edge types. We illustrate with Padgett's dataset of Florentine families in the 15th Century.
The idea that Canadian-based terrorists pose a threat to the United States continues to resonate with Americans. We subject this hypothesis to empirical testing by analyzing terrorist-related activity across the Canada-US border. Drawing on 13 cases with 27 terrorist connections, the evidence substantiates the presence of cross-border interactions, but does not confirm common perceptions about America’s northern border: there is no consistent threat emanating from Canada. Rather, differentials in the availability of ideas and resources drive threat vectors across the border in both directions. The bulk of violent extremists exploiting these cross-border markets of opportunity do so to propagate terrorism beyond North America.
An analyst, presented with a corpus too large to read every document, must find some selection mechanism. A model for interestingness can be used to rank the documents so that only the subset at the top of the ranking need be examined. However, in many open-source intelligence settings, such a model is not known in advance. We design three measures for ranking documents by internal variability as a weak surrogate for interestingness. Selecting those documents ranked highly by these measures selects a superset of the documents an analyst might need to read, no matter what the specific model, and reduces the size of the corpus by an order of magnitude. We also discover that many corpora contain documents that are highly variable, but not interesting, and show how to remove them.
We describe a technique that calculates the expected relationships among attributes from training data, and uses this to generate anomaly scores reflecting the intuition that a record with anomalous values for \emph{related} attributes is more anomalous than one with anomalous values for unrelated attributes. The expected relations among attributes are calculated in two ways: using a data-dependent projection via singular value decomposition, and using the maximal information coefficient. Sufficiently anomalous records are displayed on a sensor dashboard, making it possible for an analyst to judge why each record has been classified as anomalous. The technique is illustrated for an intrusion detection dataset, and a set of contract descriptors.
Theories of radicalization make implicit predictions about variation among attitudes in the communities from which radicals are drawn. This article subjects some popular theories to empirical testing. The precise process by which individuals come to sympathize with, provide material support for, or actually engage in political violence is fully comprehensible only by longitudinal analysis, but much existing work tries to reconstruct the process by looking only at one part of its outcomes: those who have become radicalized. The result is a large number of theories and mechanisms, with little compelling empirical support. A cross-sectional snapshot of an at-risk community cannot definitively support a particular theory of radicalization, but it can rule out those whose predictions about attitudes are at variance with the empirical results. We designed a survey instrument to gauge attitudes to issues widely believed to be relevant to radicalization and deployed it among Muslim communities in Ottawa. The results are remarkably inconsistent with patterns of variation in attitudes predicted by popular theories of radicalization. Instead, they show variation of attitudes along three independent dimensions: social/economic/political satisfaction/dissatisfaction, moral/religious satisfacton/dissatisfaction, and a dimension that appears to be associated with radicalization. This suggests that governments may have less policy leverage to mitigate radicalization than generally supposed.
Finding the most "interesting" document, or ranking documents by interestingness, in a large corpus is an important problem. We consider the most general case where there is no external model of interestingness, and suggest that internal variability is a plausible and useful surrogate. We define three measures of variability for documents, establish baselines for the corpus-wide structure of variability, and rank documents from four demonstration corpora using these measures. Although validation is necessarily problematic, we show that these measures reveal different aspects of variability, but rank highly documents that could be considered interesting in a useful sense.
In adversarial settings, the interests of modellers and those being modelled do not run together. This includes domains such as law enforcement and counterterrorism but, increasingly, also more mainstream domains such as customer relationship management. The conventional strategy, maximizing the fit of a model to the available data, does not work in adversarial settings because the data cannot all be trusted, and because it makes the results too predictable to adversaries. Some existing techniques remain applicable, others can be used if they are modified to allow for the adversarial setting, while others must be discarded. General principles for this domain are discussed and the implications for practice outlined.
Organizations that innovate encounter challenges due to the complexity and ambiguity of generating and making sense of novel ideas. Exacerbated in group settings, we describe these challenges and propose potential solutions. Specifically, we design group processes to support novel idea generation and selection, including use of a novel-information discovery (NID) tool to support creativity and brainstorming, as well as group support system and collaborative-filtering tools to support evaluation and decision making. Results indicate that the NID tool increases efficiency and effectiveness in creative tasks and that the collaborative-filtering tool can support the decision-making process by focusing the group’s attention on ideas that might otherwise be neglected. Combining these two novel tools with group processes provides valuable contributions to both research and practice.
Trust plays a key-role in enhancing user experience at service providers. Reputation management systems are used to quantify trust, based on some reputation metrics. Anonymity is an important requirement in these systems, since most individuals expect that they will not be profiled by participating in the feedback process. Anonymous Reputation management (ARM) systems allow individuals to submit their feedback anonymously. However, this solves part of the problem. Anonymous ratings by one individual can be linked to each other. This enables the system to easily build a profile of that individual. Data mining techniques can use the profile to re-identify that individual. We call this the linkability problem. This paper presents an anonymous reputation management system that avoids the linkability problem. This is achieved by constructing a system that empowers individuals to interact and rate service providers, securely and anonymously.
Intelligence and law enforcement agencies collect large datasets, but have difficulty focusing analyst attention on the most significant records and structures within them. We address this problem using suspicion, which we interpret as relevant anomaly, as the measure associated with data records and individuals. For datasets collected about widespread activities in which the signs of adversarial activity are rare, we suggest ways to build predictive models of suspicion. For datasets collected as the result of lawful interception, we suggest a model of suspicion spreading using the social network implied by the intercepted data.
In this work, we explore the relationship between topic models and co-maintenance history by introducing a visualization that compares conceptual cohesion within changelists. We explain how this view of the project history can give insight about the semantic architecture of the code, and we identify a number of patterns that characterize particular kinds of maintenance tasks. We examine the relationship between comaintenance history and concept location, and visualize the distribution of changes across concepts to show how these techniques can be used to predict co-maintenance of source code methods.
The poor locality of operation descriptions expressed in the Web Service Description Language (WSDL) makes them difficult to analyze and compare in web service discovery tasks. This problem has led us to develop a new method for service operation comparison involving contextualizing operation descriptions by inlining related type information from other sections of the service description. In this paper, we show that this contextualization of web service descriptions can enable topic models (statistical techniques for identifying relationships) to produce semantically meaningful results that can be used to reverse engineer service-oriented web systems and automatically identify related web service operations. Specifically, we model contextualized WSDL service operations using Latent Dirichlet Allocation, and show how this approach can be used to more accurately find similar web service operations.
Attention is the critical resource for intelligence analysts, so tools that provide focus are useful. One way to determine focus is by computing significance. In the context of a known model, new data can be placed on a spectrum defined by: normal, anomalous, interesting, novel, or random; and significance is greatest towards the middle of this spectrum. However, significance also depends on the mental state of the analyst (and the organization). A framework for understanding significance is defined, and its impact on the knowledge discovery process is explored.
Security and intelligence data often contains numerous of records associated with good actions, but few, perhaps none, associated with bad actions. For example, aircraft passenger screening systems contain billions of records about normal travelers for each record about a terrorist or hijacker. In crime, fraud, tax evasion, and money laundering, the ratio of good to bad records might not be quite as great, but examples of bad records are still rare. Many of these settings are not only unbalanced in the number of examples of different kinds of records, they are also adversarial because those whose actions generate the bad records are trying to conceal themselves and also perhaps disrupt the modeling that is being done. In these settings, we still need to be able to build robust inductive models (predictors, clusterings, and rankings) from the data, but there are new issues to be considered.
While the majority of the research on anonymity is focused on individuals, there is an increasing number of scenarios that demand anonymity for service providers as well. For example, in many business to consumer (B2C) scenarios, service providers sell their surplus to individuals for lower prices through arbitrageurs. Those service providers must remain anonymous, to avoid discouraging customers from buying directly from the service provider, at the regular price. Some providers wish to be anonymous for various reasons, including but not limited to, escaping denial of service attacks and avoiding censorship. Enabling service providers to prove relations to other providers is also needed in business to business (B2B) scenarios. This paper shows that allowing service providers to act as individuals in privacy preserving systems does not fulfill the job. The paper describes a system that allows individuals and service providers to participate in secure and anonymous transactions.
Documents from the Ansar aljihad forum are ranked using a number of word-usage models. Analysis of overall content shows that postings fall strongly into two categories. A model describing Salafist-jihadi content generates a very clear single-factor ranking of postings. This ranking could be interpreted as selecting the most radical postings, and so could direct analyst attention to the most significant documents. A model for deception creates a multifactor ranking that produces a similar ordering, with low-deception postings identified with highly Salafist-jihadi ones. This suggests either that such postings are extremely sincere, or that personal pronoun use and intricate structuring are also markers of Salafist-jihadi language. Although the overall approach is relatively straightforward, the choice of parameters to maximize the usefulness of the results is intricate.
We ask whether the text of the Management Discussion and Analysis section (MD&A) of quarterly and annual reports holds important clues regarding the integrity of the financial report. By building a detection algorithm based on the most frequent words present in a sample of fraudulent and truthful reports, we correctly classify approximately 87% of all reports as either fraudulent or not. This rate falls to 80% for a hold-out-sample excluded from the training set. While the method is particularly strong in correctly capturing financial misrepresentations, truthful statements are incorrectly labeled as fraudulent approximately 16% of the time. These false positive classifications frequently occur in quarters immediately surrounding actual occurrences of fraud. We compare our method to a more traditional approach suggested by Dechow, Ge, Larson and Sloan (2009) that is based on quantitative variables. We find that the correlation between predictions from the two methods is only 0.13 suggesting that the text of financial reports can be a powerful supplement to the identification of fraudulent reports using quantitative variables alone.
Predictors are often regarded as black boxes that treat all incoming records exactly the same, regardless of whether or not they resemble those from which the predictor was built. This is inappropriate, especially in adversarial settings where rare but unusual records are of critical importance and some records might occur because of deliberate attempts to subvert the entire process. We suggest that any predictor can, and should, be hardened by including three extra functions that watch for different forms of anomaly: input records that are unlike those previously seen (\emph{novel} records); records that imply that the predictor is not accurately modelling reality (\emph{interesting} records); and trends in predictor behavior that imply that reality is changing and the predictor should be updated. Detecting such anomalies prevents silent poor predictions, and allows for responses, such as: human intervention, using a variant process for some records, or triggering a predictor update.
New web technologies, such as the Semantic Web and the Social Web, have changed the way services and information are accessed. These new technologies allow more usable and interoperable services to be realized. They help service providers to reach more individuals. Automatic service discovery and invocation can also benefit from these technologies.
Unfortunately, several threats to security, privacy, and trust comes along with these benefits. For example, making services interoperable increases the chance of profiling individuals, which constitutes a privacy threat. Security measures, such as access control, should handle the potential threats of opening previously encapsulated web services.
We present a new access-control model, guarantee-based access control. The new model constructs certificates based on guarantees rather than attributes. These guarantees are then used as the basis for access-control decisions. This allow access control to be fine-grained, and avoids the burden of design and management of roles. The model suits the open nature of new web technologies. It resists threats to individuals' privacy, such as profiling. The model also permits access rights to be based on a set of individuals in a particular structured relationship.
This book summarizes the state of the art in adversarial knowledge discovery when it was published.
This book summarizes the ways in which various matrix decompositions (singular value decomposition, semidiscrete decomposition, independent component analysis, non-negative matrix factorization) can be used as data mining tools.
We consider the problem of data-stream classification, where the data arrives in a conceptually-infinite stream, and the opportunity to examine each record is brief. We introduce a stream-classification algorithm that is online, running in amortized O(1) time, able to handle intermittent arrival of labelled records, and able to adjust its parameters to respond to changing class boundaries (`concept drift') in the data stream. In addition, when blocks of labelled data are short, the algorithm is able to judge internally whether the quality of models updated from them is good enough for deployment on unlabelled records, or whether further labelled records are required. Unlike most proposed stream-classification algorithms, multiple target classes can be handled.
Experimental results on real and synthetic data show that accuracy is comparable to a conventional classification algorithm that sees all of the data at once and is able to make multiple passes over it.
In adversarial settings, there are those who wish to conceal their existence, properties and activities from data analysis. This substantially changes the knowledge discovery process -- finding a model that best `fits' the data is unhelpful because it provides adversaries with predictable ways to hide, and ways to manipulate. We survey some of the implications for algorithms and process, and suggest some open problems.
Several models for deception in text, based on changes in usage frequency of certain classes of words, have been proposed. These are empirically derived from settings in which individuals are asked to lie or be truthful in freeform text. We consider the problem of detecting deception in testimony, where the content generated must necessarily be responsive to questions, where there is the opportunity for immediate followup if the possibility of deception is detected by the questioner, and where those who have reasons to be deceptive have time and motivation to rehearse potential answers.
Using the testimony to the Gomery Commission, a situation in which many witnesses had some motivation to be deceptive, we propose and validate a model for deception in testimony. We achieve substantial (80%) agreement with media estimates of who was motivated to testify in a deceptive way.
As individuals interact on larger and larger scales, what makes up their identity also has to expand. In a village, only a single name is enough. In a country, other attributes such as a social insurance or taxpayer number become necessary. Across the world, a multinational hotel chain may want to identify the same individual whenever he or she stays at one of their hotels, but there is no single global identifier that makes this possible.
Because identities are important in so many interactions, not all of them known to and supervised by individuals, there are considerable economic and privacy risks when identity information is misused. Today, the standard solution to this problem is to mandate legal or policy rules that restrict the flow and use of identity information.
This solution is starting to fail for two reasons. First, and most importantly, new developments in data mining and data fusion allow identities to be constructed from data that has not previously been considered identifying. Systems often do not control this kind of data as tightly as traditionally identifying data. Second, increasingly information that was supposed to be controlled is released accidentally. Once this has been done, there is no way to call it back, but also no way (short of Witness Protection Programs) to create fresh identities for individuals who have been compromised.
We design and describe a new approach to this problem, based on the creation and use of artificial identities, called \emph{personas}. These identities can only be traced to the individuals who created them under tightly controlled circumstances, but personas can act for individuals in almost all situations, from ecommerce to border crossing.
Different records of a dataset are often collected in different locations. Increasingly, different attributes of the same record are collected in different locations as well. Astrophysical data are an important example of this class since different instruments collect data about the same objects in different wavebands; instruments also collect data about different objects, for example because of their relative positions. We present a lightweight distributed prediction algorithm based on local predictor construction on partitions of a dataset, even partitions by attributes, followed by global voting. The computation and communication requirements of the algorithm are compatible with distributed systems and grids; privacy is preserved because raw data never leave local sites; and prediction accuracy is comparable with a centralized algorithm using the same base predictor.
Searching for words on a watchlist is one way in which large-scale surveillance of communication can be done, for example in intelligence and counterterrorism settings. One obvious defense is to replace words that might attract attention to a message with other, more innocuous, words. For example, the sentence ``the attack will be tomorrow" might be altered to ``the complex will be tomorrow", since `complex' is a word whose frequency is close to that of `attack'.
Such substitutions are readily detectable by humans since they do not make sense. We address the problem of detecting such substitutions automatically, by looking for discrepancies between words and their contexts, and using only syntactic information. We define a set of measures, each of which is quite weak, but which together produce per-sentence detection rates around 90% with false positive rates around 10%. Rules for combining per-sentence detection into per-message detection can reduce the false positive and false negative rates for messages to practical levels. We test the approach using sentences from the Enron email and Brown corpora, representing informal and formal text respectively.
We address the problem of clustering of data that is vertically partitioned, that is where local sites hold some of the attributes of all of the records. This situation is natural when data is collected by channels that are physically separated. For distributed clustering, we design a technique that computes global clusters from local clusters computed at each site, without requiring either large amounts of storage or computation at the center. The technique is simple, outperforms a centralized algorithm for most datasets, distributes computation and data access, and preserves the privacy of each local site.
Intelligence analysts construct hypotheses from large volumes of data, but are often limited by social and organizational norms and their own preconceptions and biases. The use of exploratory data-mining technology can mitigate these limitations by requiring fewer assumptions. We present the design of the Athens system, which discovers novel information, relative to a specified set of existing knowledge, in large information repositories such as the world wide web. We illustrate the use of the system by starting from the terms "al Qaeda" and "bin Laden" and running the Athens system as if on September 12th, 2001. This provides a picture of what novel information could have been known at the time. This is of some intrinsic interest, but also serves to validate the performance of the system since much of this novel information has been discovered by conventional means in the intervening
Graph data captures connections and relationships among individuals, and between individuals and objects, places, and times. Because many of the properties of graphs are emergent, they are resistant to manipulation by adversaries. This robustness comes at the expense of more-complex analysis algorithms. We describe several approaches to analysing graph data, illustrating with examples from the relationships within al Qaeda.
We consider four properties by which intercepted messages can be selected for deeper analysis: their external properties, their content, their authorship, and the mental state of their authors. We argue that, rather than trying to differentiate directly between `good' messages and `bad' messages, it is better to use a two-pronged approach, where a simple detection scheme triggers a reaction in authors of `bad' messages. This reaction is easier to detect than the original difference. We also suggest that differentiation is more effective when it is done for sets of messages, rather than on a message by message basis.
It is widely believed that users choose meaningful tags; that the combinations of these tags from multiple users helps to elicit meaning for tagged objects; and that implicit communities can be discovered among the users who use a particular tag or tag a particular object. We examine data on tagging practices from the Delicious web site (delicious.com) and find that there is little support for any of these beliefs. Although users individually do seem to have a small set of tags that they use in a controlled and effective way, they also use very large sets of other tags in a much more haphazard way. These poorly managed tags obscure much of the collective sense making and implicit community structure. We derive some suggestions for improving collaborative tagging systems.
New web technologies, like the Semantic Web and the Social Web, have changed the way services and information are accessed. The new technologies allow for more usable and interoperable services to be realized. They help service providers to reach more individuals. Automatic service discovery and invocation can also benefit from these technologies.
Unfortunately, several threats regarding security, privacy, and trust comes along with these benefits. For example, making services interoperable increases the chance of profiling individuals, which constitutes a privacy threat. Security measures, like access control, should handle the potential threats of opening the previously encapsulated web services.
We present a new access control model, guarantee-based access control. The new model constructs individuals certificates' based on guarantees rather than attributes. The guarantees are then used as the basis for access control decisions. This allow access control to be fine-grained, and avoids the burden of the management of roles. The model suits the openness nature of the new web technologies. It resists the threats against individuals privacy, such as profiling. The model permits access to be not only based on one individual, but also based on a set of individuals having different relationships.
Predictors are often regarded as black boxes that treat all incoming records exactly the same, regardless of whether or not they resemble those from which the predictor was built. This is inappropriate, especially in adversarial settings where rare but unusual records are of critical importance and some records might occur because of deliberate attempts to subvert the entire process. We suggest that any predictor can, and should, be hardened by including three extra functions that watch for different forms of anomaly: input records that are unlike those previously seen (novel records); records that imply that the predictor is not accurately modelling reality (interesting records); and trends in predictor behavior that imply that reality is changing and the predictor should be updated. Detecting such anomalies prevents silent poor predictions, and allows for responses, such as: human intervention, using a variant process for some records, or triggering a predictor update.
Protecting privacy is often conceived and implemented as a matter of controlling the flow of identity and identity-related information. This, for instance, is the strategy of much privacy legislation and business privacy policy. Such approaches are increasingly weakened by the ability of data-mining techniques to fuse these partial identities into a complete identity, even using attributes that are not normally considered to be identifying.
We propose instead that the solution to protecting privacy is the use of artificial identities, generated robustly using cryptographic techniques and managed by individuals as they wish (with ultimate traceability when required by, for example, law enforcement). The existence of a framework that can protect privacy even in the presence of large-scale data collection and analysis makes it possible for individuals and their agents to participate in online activity without the reluctance that is increasingly associated with current approaches.
Several models for deception in text, based on changes in usage frequency of certain classes of words, have been proposed. These are empirically derived from settings in which individuals are asked to lie or be truthful in freeform text. We consider the problem of detecting deception in testimony, where the content generated must necessarily be responsive to questions, where there is the opportunity for immediate followup if the possibility of deception is detected by the questioner, and where those who have reasons to be deceptive have time and motivation to rehearse potential answers. Using the testimony to the Gomery Commission, a situation in which many witnesses had some motivation to be deceptive, we propose and validate a model for deception in testimony. We achieve substantial (80%) agreement with media estimates of who was motivated to testify in a deceptive way.
We show that two mainstream prediction techniques, support vector machines and decision trees, are easily subverted by inserting carefully-chosen training records. Furthermore, the relationship between the properties of the inserted record(s) and the regions for which the predictor will subsequently misclassify can be inferred, so desired misclassifications can be forced. In adversarial settings, it is plausible that manipulation of this kind will be attempted, so this has implications for the design of prediction systems and the use of off-the-shelf technology, especially as support vector machines are one of the most powerful prediction algorithms known.
In adversarial settings, records associated with those who want to conceal their existence or activities tend to be unusual because of their illicit status; but not too unusual because of efforts to make them as normal as possible. Clusters of such records will not be single outliers or even outlying clusters, but rather small clusters on the fringes of normal clusters. Such structures are undetectable by many mainstream clustering algorithms, for example those based on distance and convexity. We show that even more sophisticated clustering algorithms are easily subverted by the addition of only a few carefully chosen records. Robust clustering in adversarial settings will require the development of more sophisticated algorithms tailored to this domain.
In a mobile publish/subscribe paradigm, user service discovery and recommendation requires matching user preferences with properties of published services. This is a difficult problem when users are mobile, wirelessly connected, and dynamically roaming. The available data is very large, and the matching must be computed in real time. Existing heuristics are quite ineffective. We propose novel algorithms that use singular value decomposition as a dimension-reduction technique. We introduce "positive" nearest-neighbor matching to find services whose attribute values exceed those of a new user subscription. For n services and m preference attributes, reasonable matches can be found in time O(m log n), using O(nm) storage.
In computational grids, resource discovery and selection requires matching job requirements to available resources. The properties of each job must be matched with those of existing resources as accurately as possible. The difficulty of the problem increases with the number of attributes for each resource/job. Existing heuristics are quite ineffective.
We propose novel algorithms that use singular value decomposition as a dimension-reduction technique, and match in a collection of lower-dimensional spaces. We introduce "positive" nearest-neighbor matching, the problem of finding resources whose attribute values exceed those of a given job. Singular value decomposition is, in some sense, an obvious approach but requires careful application to work effectively in this setting. Performance and quality of matches are measured using datasets representing applications in the grid. We show that reasonable matches can be found in time O(m log n), using O(nm) storage space, where n is the number of resources and m is the number of attributes for which a job might have requirements.
Graph data represents relationships, connections, or affinities. Innocent relationships produce repeated, and so common, substructures in graph data. We present techniques for discovering anomalous substructures in graphs, for example small cliques, nodes with unusual neighborhoods, or small unusual subgraphs, using extensions of spectral graph techniques commonly used for clustering. Although not all anomalous structure represents terrorist or criminal activity, it is plausible that all terrorist or criminal activity creates anomalous substructure in graph data. Using our techniques, unusual regions of a graph can be selected for deeper analysis.
Manipulating semistructured data, such as XML, does not fit well within conventional programming languages. A typical manipulation requires finding all occurrences of a structure matching a structured search pattern, whose context may be different in different places, and both aspects cause difficulty. If a special-purpose query language is used to manipulate XML, an interface to a more general programming environment is required, and this interface typically creates a typing mismatch and incurs runtime overhead for type checking. However, adding XML manipulation to a general-purpose programming language has proven difficult because of problems associated with expressiveness and typing.
In this paper we show an alternative approach that handles many kinds of patterns within an existing strongly-typed general-purpose programming language called bondi. The key ideas are to express complex search patterns as structures of simple patterns, pass these complex patterns as parameters to generic data-processing functions and traverse heterogeneous data structures by a generalized form of pattern matching. These ideas are made possible by the language's support for pattern calculus, which enables path and pattern polymorphism. With this approach, adding a new kind of patterns is just a matter of programming, not language design.