Giter VIP home page Giter VIP logo

enron-python-flask-cassandra-pig's Introduction

Enron-Python-Flask-Cassandra-Pig

This Hortonworks example post extracts topics via TF-IDF from the Enron emails and serves them via Cassandra and Flask with help from the Pygmalion project, CassandraStorage and pycassa. It accompanies the blog post at <>.

Environment Setup

Edit and run env.sh to inform CassandraStorage about your local Cassandra instance.

Cassandra Setup

Install Cassandra according to the instructions in the post, and then create our schema by running cassandra.txt in the cassandra-cli.

Test Pycassa

Run test_pycassa.py to verify it works.

Get the Enron Emails

Grab the Enron emails at https://s3.amazonaws.com/rjurney_public_web/hadoop/enron.avro

Run our Pig Script

Run cassandra_enron.pig to extract topics from the email bodies and store them in Cassandra. Note: you may want to adjust the limit statement to run the example on fewer emails if you are running this example in local mode. The entire corpus on one machine will take a LONG time to finish. This is where the utility of Hadoop comes in :)

Serve up our data in our app

Run index.py and plug in a message_id (which you can get via SAMPLE/LIMIT in Pig) to the url in your favorite browser and you can see the top 20 topics, as determined by Tf*idf, published in a web service. Wallah!

enron-python-flask-cassandra-pig's People

Contributors

rjurney avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.