Other versions of the Enron data

Other forms of the Enron data

This page contains processed versions of the Enron dataset which may be more directly useful.

Representations of an email-word sparse matrix. Here the rows correspond to the emails in time stamp order from the original dataset, and the columns correspond to nouns in descending order of frequency in the entire corpus. Each matrix entry gives the frequency of the corresponding noun in the corresponding email. The following files are available:
- Zipped matrix containing (word, frequency) pairs (53MB) (empty emails are represented by (d,-1))
- Zipped matrix containing (word index, frequency) pairs (34MB) (empty emails are represented by (-1,-1))
- List of email files in directory traversal order
- List of emails, sorted by time stamp, including index into file names (note that some messages were sent in year 1 and some were sent in years that haven't happened yet; the first sensible time stamp appears to be from December 1979 and the last from February 2004)
- Word list in decreasing frequency order
- Word list in alphabetical order (note about 1000 strings that are not English words at the beginning)
Nouns were extracted using the Monty tagger which is a bit inclined to treat words it doesn't recognise as nouns. There are about a thousand strange strings at the beginning of the list.
Thanks to Nikhil Vats for doing the data preparation. Please report errors and corrections to David Skillicorn.