Giter VIP home page Giter VIP logo

data-platform-enron-emails's Introduction

Streaming Data Platform - Enron Emails V2 xml

ScreenShot


  • ENRON-CONNECTOR

Generates events feeding .ZIP files (.XML format) off a local folder linked to AWS ESB remotely. Transforms XML's document tag + the content of its .TXT file (regardless it's an txt attachment) into AVRO format in streaming and send them to Kafka.

Files are processed asynchronously, taking advantage of the Hazelcast's ExecuteService destributed. Allowing to scale it out as needed.


  • KAFKA & SCHEMA REGISTRY

Kafka streams Avro files. Each Avro represents an single email details + body message.

Schema Registry, tracks Avro schema versions. UI. http://ADVERTISER-IP-HOST>:3030


  • ENRON-CONSUMER on SPARK.

The streaming job running on Spark, feeds messages from Kafka. It uses the spark streaming api, gets rid of the .TXT attachments.

Calculates the number of word within the body message per RDD, cleaning static things, such as company signature, automatic words generated by the mail service, numbers, asteriks...

Figures out the email of each recipient and calculates the relevancy. 1 for TO and 0.5 for CC and BCC recipients.

Sinks the results in Cassandra, using the Cassandra connector for Spark.


  • CASSANDRA

Contains analitics tables.

mail table. Contains the mail details + number of words within a message in real time

select * from enron.mail;

recipients_state. Contains all the existing unique recipients + their total relevancy regardless of order, in real time.

select * from enron.recipients_state;

mail_word_avg Avarage length, in words, of the emails.

select * from enron.mail_word_avg;

recipients_total top 100 recipient email addresses

select * from enron.recipients_total;

  • ZEPPELIN

http://IP-HOST:8080

Data visualization, allowing business users, developers, analytics run their scripts.

To link to cassandra, go to Interpreter -> Cassandra, and set in Host your host ip.

To get started, create a note, click on the save button, in order to have available those interpreters. To start query, type %cassandra and underneath, the query.


Requiremnts:

  • JAVA 8, 7 and SCALA 2.10

  • Docker 1.+

  • Docker Compose 1.+

  • Apache maven 3.+

  • sshfs - mount a remote system -


Configuration:

  • Mounting a remote system to a local folder. - Description for AWS -

1.Mount an AWS ESB and restore the Enron Emails Snapshot.

2.In order to read those files in ESB we mount a folder in our local host remotely using sshfs.

$ sudo apt-get install sshfs

SSHFS is a very powerful tool but its commands are a bit naughty.

Login as Root user. Pay attention the full path of your public/private key and no slash at the end of your target local.

# sshfs -o identityfile=<full path of your private key>.pem <instance user>@<instance aws public dns>:/<aws esb mount path>/edrm-enron-v2/ /<target local folder>

  • Editing docker-compose.yml.

1.Avro schema registry needs an advertiser host, so, we replace the value of all the environment variables called KAFKA_SCHEMA_REGISTRY to http://ADVERTISER-IP-HOST:8081

2.For enron-connector to read files, in enron-connector container -> volumes, we replace the path prefix :/enron/input to the target local folder (you configured this at the sshfs step).


  • Building the jars. We have to build it separately due to enron-connector should be built with JAVA 8 and enron-consumer with JAVA 7 since it is written in SCALA 2.10.

(Assuming JAVA 8 is by default)

1.Enron connector:

$ mvn -f enron-connector\pom.xml clean install

2.Enron consumer:

$ export JAVA_HOME = "<JAVA 7 install folder location>"
$ mvn -f enron-consumer\pom.xml clean install
  • Running Everything

Execute the bash file with name start-data-platform.sh

$ ./start-data-platform.sh

###Results

Execute this job to calculate the result once the data has landed to cassandra. Update your local folder to be mounted if needed

docker exec -i -t spark-master bash spark-submit --class com.fjbecerra.sql.AggregationJob --master local[*] --driver-memory 1G --executor-memory 1G /app/enron-consumer-0.0.1-SNAPSHOT-jar-with-dependencies.jar 

Quetion 1. Average length, in words, of the emails

ScreenShot

Question 2. Top 100 recipient email addresses

ScreenShot

data-platform-enron-emails's People

Contributors

fjbecerra avatar

Stargazers

Armen avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.