Giter VIP home page Giter VIP logo

r-sample-enron's Introduction

Analyzing the Enron Email Corpus - Demo

This R file analyses some of the Enron Email Corpus. It produces 4 PDF files, each containing a graph displaying how different persons are connected through emails present in the corpus.

The Enron Email Corpus can be downloaded from http://www.cs.cmu.edu/~enron/ and the contents should be untared in the same directory as the R script, producing a folder named 'enron_mail_20110402'. The first line of the R script needs to be adjusted to the environment. A script that does all of this automatically is provided (see below).

The R script itself contains more comments with details on each step in the analysis.

Data selection

Due to constrains on time and resources, only a subset of the emails is analysed.

Only emails from 'Sent' or 'Sent_Items' folders are analysed.

This greatly reduces the number of emails. 'Sent' folders will not contain spam and will only contain email sent on purpose by an Enron employee, making it uncessary to filter emails for validity. 'Sent' folders will furthermore not contain duplicates (i.e. an email in the 'Sent' folder of one person will be present in the inbox of another person, but not in any other 'Sent' folder), relieving us of the work of identifying such duplicates and removing them.

The 'Sent' type folders are filtered out with the following command:

find . -type d  | grep -i sent | grep -v presentation > sent_folders

Only emails sent from one person to exactly one person are analysed

Although the data set contains many emails sent to several receipients, parsing the To: field of the emails was skipped for time constraints. Only emails sent directly to one receipient are kept.

Analysis: Mails sent between people whose mailboxes are in the data set

Graph: 00.most.mails.pdf

The first analysis looks only at emails that are sent between people whose mailbox is in the data set. To do that, only emails are kept that are sent to someone that has also sent a mail him/herself.

The graph shows connections between people where more than 150 emails have been sent. This number reduces the data set to only 21 people, a number that can still be plotted nicely. The thickness of each arch in the graph shows the volume of email exchanged. The size of each vertex is proportional to the number of different people that person has sent emails to overall.

Within the selected data, unfortunately most mails sent directly seem to be quite disconnected from each other. The connection between Kay Mann and Suzanne Adams is exceptionally strong.

Analysis: Important people

We quickly analyse some important people within the network. A paper by Shetty and Adibi suggests these names:

  • Louise Kitchen
  • Mike Grigsby
  • Greg Whalley
  • Scott Neal
  • Kenneth Lay

Mails to or from important people: Whole network

Graph: 01.important.people.pdf

We first simply plot the whole network of people that have sent mails to or received mails from one of the members in the list of important people. The graph shows some interesting properties. While there are a lot of people that only exchange emails with one of the important people (this is probably in part due to their inboxes not being in the data set), there is a group of people (in the center of the graph) that is connected to nearly all of the important people. (E.g. John Lavorato, Jeffrey Shankman, etc)

Mails to or from important people: Network of 'well connected' people

Graph: 02.important.people.well.connected.pdf

The graph shows only those people in the network that have more than 4 connections. Within the selected data set, this will be people that are connected to most of the 'important people' in the network. One can see that nearly everybody is connected to Louise Kitchen.

Mails to or from important people: Strength of connections between 'well connected' people

Graph: 03.important.people.well.connected.weights.pdf

The last graph shows the network of 'well connected' people with thickness of lines indicating the number of emails sent from one person to the other. The connection between Louise Kitchen and John Lavarato and Sally Beck is exceptionally strong. Furthermore, John Lavarato has quite strong connections to all of the 'important people' mentioned above.

Running the analysis from scratch

The analysis can be run from scratch, using only the R script and assuming that the relevant libraries are installed by using

chmod u+x run.sh
./run.sh

Sources

Source of Enron Corpus: http://www.cs.cmu.edu/~enron/

Please note: This a demo of R rather than a proper analysis. Due to time constraints, only a small part of the emails was analyzed. No conclusions should be drawn from the results without analyzing the whole dataset.

r-sample-enron's People

Contributors

mrdavid avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.