Workshop on Link Analysis, Counterterrorism and Security

23rd April, 2005
Sutton Place Hotel
Newport Beach, California, USA
at the SIAM International Conference on Data Mining.

Workshop Co-chairs:

David Skillicorn
School of Computing
Queen’s University
Kingston Canada
skill@cs.queensu.ca

Kathleen Carley
Institute for Software Research International
School of Computer Science
Carnegie-Mellon University
Kathleen.carley@cmu.edu

Program Committee:

Program

8:25-8:30 Introduction

8:30-10:00

1. J. Diesner, K.M. Carley, Exploration of Communication Networks from the Enron Email Corpus 3-14

2. A. Chapanond, M.S. Krishnamoorthy, Bülent Yener, Graph Theoretic and Spectral Analysis of Enron Email Data 15-22

3. C. Priebe, Scan statistics on Enron Graphs 23-32

Coffee break

10:30-12:00

4. A. McCallum, A. Corrada-Emmanuel, X. Wang, The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks, with Application to Enron and Academic Email 33-44

5. M.W. Berry, M. Browne, Email Surveillance Using Nonnegative Matrix Factorization 45-54

6. P.S. Keila and D.B. Skillicorn, Structure in the Enron Email Dataset 55-64

Lunch

1:45-2:45

7. S. Lehmann, Live and Dead Nodes 65-70

8. Y. Duan, J. Wang, M. Kam and J. Canny, A Secure Online Algorithm for Link Analysis on Weighted Graph 71-81

Coffee break

3:15-4:15

9. H.W. Lauw, E.-P. Lim, T.-T. Tan, H.-H. Pang, Mining Social Network from Spatio-Temporal Events 82-93

10. B. Malin, Unsupervised Name Disambiguation via Social Network Similarity 93-102

Introduction

This workshop is the third on this topic at the SIAM International Conference on Data Mining. Research in areas such as link analysis, social network analysis, dynamic network analysis, and text analysis has a long history of use to understand how information flows in organizations, how people form relationships and connections, and how this affects decision making. These techniques have also been applied to understanding pathologies in organizations: how collusion and fraud reveal themselves in the links within and among organizations.

The growth of the Salafist terrorist movement, and in particular the attacks of September 11th, 2001, have moved research in this area from academia to an important part of many countries' defensive technology. Such techniques are of value in identifying key actors, locating experts, identifying unique patterns of transactions, characterizing the shape of and differences in terrorist groups, and locating areas of expertise.

It has been clear from the start that successfully discovering terrorism, fraud, or other covert activities requires analyzing large, complex, and messy datasets. Furthermore, the patterns in these datasets are usually small in scale and hard to pick out against the background of normal every day behavior. As Ted Senator has said, the problem is like trying to find a needle in a haystack of needle pieces. This creates difficult new problems for analysis techniques: pragmatic problems caused by the sheer size and complexity of the data, and discrimination problems, determining when some small variation in the structure of the data is potentially interesting. The situation is further complicated by the fact that such data is inherently messy reflecting the vast array of original data sources (e.g., news plus web plus email), biases in data collection, and intentional ambiguities (such as false identities).

There are two broad kinds of analysis. The first looks at the properties of individual objects, perhaps people or messages or journeys, and tries to detect those that are anomalous in some useful way. The second looks at the relationships between objects, and tries to find patterns in their connections that are anomalous. Again, there are two broad kinds of approaches. The first focuses on streams of data and tries to locate anomalies as new data arrives. The second focuses on the data as though it were a single block in time, a snapshot of the world, and tries to locate anomalies within this snapshot. The research reported here, is split between the two types of analysis but is more focused on the data as a snapshot rather than as a stream.

One of the problems for academic researchers has been the availability of appropriate datasets against which to try techniques. One such dataset is the online movie database; but this is a stylized archive of transactions and does not reflect the vagaries of everyday communication. In the wake of the collapse of Enron, a large set of email records was released by the U.S. Department of Justice. This created an opportunity for researchers to try their techniques in a realistic way on a database of actual everyday transactions and to compare the results. This workshop is one opportunity to do so.

First, it is clear the email is not quite like either spoken or formal written communication. Email tends to occupy a middle ground: less formal than other forms of writing, but more formal than speech. The Enron emails provide a chance to investigate, empirically, what the language of email is like. Second, emails have a sender and one of more receivers, and so represent a form of connection between people. It is natural to build various forms of graphs to capture these connections, and then to see what they can tell us about how communication works, and how it connects to relationships and to power. This is complicated by the fact that: a) the senders/receivers may have multiple identities (email ids) and b) the receivers may be groups such as mailing lists. Further, each sender and receiver can be further characterized by the domain from which they are sending and sometimes by the role they play within a company (such as president or CEO).

Third, emails are written for a purpose - they are about something. Examining the content of real emails can tell us how information flows in an organization, how information reflects relationships, and also how word usage and style might reflect relationships and power.

Fourth, emails are timestamped, so it is possible to look at how email usage changes with time. This is particularly interesting because Enron was undergoing a change in leadership and the fraudulent scheme was unravelling during the period of time over which these emails were captured, and connections can perhaps be made between patterns in the emails, and activities in the outside world.

The results presented here are all preliminary. The sheer size and complexity of the dataset resulted in massive amounts of time being spent simply cleaning the data; e.g., eliminating copies of messages, identifying when the same person had multiple ids, and so on. Researchers engaged in looking at the data were often forced to rebuild or extend their software to account for the scale of data or for features in the dataset that were unanticipated (such as multiple email ids for the same person). As a result, the research reported on is as much about technology as it is about Enron, perhaps more so. Despite the preliminary level of the included research, much of interest has been observed. This is due to both the importance of the data and the value added by the new methodologies.

The papers in this session represent important advances in data-mining that, in many cases, merge machine learning techniques, link analysis, and social network analysis into new capabilities. To be sure, reading these papers provide some understanding of the massive changes Enron was undergoing. However, the insights here are nowhere as striking theoretically as the increase in capability afforded by the new methodologies. That being said, we note that the methodologies still need to be improved to be faster and more robust in the face of messy data. Outstanding issues remain such as those surrounding automated identification of aliases, experts, areas of discussion, automated ontology creation, and automated monitoring of streaming data.

We anticipate this email dataset will continue to be studied for many years. It represents a unique point in American history and an unprecedented level of access to daily information. We expect that future work will move to using a unified cleaned dataset. We also expect that future work will progress to advance not just the methodologies but our understanding of information flow, corporate planning and corporate decision making. Thus, while the current workshop is very methodologically focused, we anticipate that future ones will be as or more theoretically focused.

We would like to thank the members of the Program Committee for their support in publicizing the workshop, and for helping us to review the submissions. We would also like to thank others who took the time to review submissions, necessarily with short timelines. And, of course, we would also like to thank the researchers who submitted papers. We received many submissions and, given the limited time available for the workshop, had to reject many worthy submissions.

Kathleen Carley
David Skillicorn