Other forms of the Enron data

This page contains processed versions of the Enron dataset which may be more directly useful.

  1. Representations of an email-word sparse matrix. Here the rows correspond to the emails in time stamp order from the original dataset, and the columns correspond to nouns in descending order of frequency in the entire corpus. Each matrix entry gives the frequency of the corresponding noun in the corresponding email. The following files are available:

    Nouns were extracted using the Monty tagger which is a bit inclined to treat words it doesn't recognise as nouns. There are about a thousand strange strings at the beginning of the list.

    Thanks to Nikhil Vats for doing the data preparation. Please report errors and corrections to David Skillicorn.