Giter VIP home page Giter VIP logo

topical-topology's Introduction

This repository is for tools for studying Twitter using probabilistic topic modelling.

It uses MALLET to do the topic inference, using LDA.

## Setup and usage

To execute, first unpack the mallet script:
tar -xzf mallet-2.0.6.tar-gz

settings.py has the global settings for all the scripts

then run these scripts in the following order:

getsnowball.py -> finds a 'snowball' of users
getLog.py -> logs last 200 tweets of each user in snowball
extract.py -> extracts the tweets from the logs into a format usable by MALLET
infertopics.py -> Use MALLET to infer topics on the tweets using LDA
preparedata.py -> parse data into numpy arrays and persist as .npy files

then use the analysis scripts, or write your own


## Config

To properly configure the scripts, you need to supply a file, config.cfg, with the following contents:


[OAuth]
accesstoken:realaccesstoken
accesstokenkey:realtokenkey
consumerkey:consumerkey
consumersecret:consumersecret

topical-topology's People

Contributors

sbenthall avatar seanyhc avatar

Stargazers

NDuma avatar Aswini S avatar Nazeeruddin Ikram avatar Vijay Rudraraju avatar Sayat avatar  avatar  avatar

Watchers

 avatar James Cloos avatar  avatar  avatar

topical-topology's Issues

write get-log.py

get-log.py takes twitter usernames from a file, and runs twitter-log and dumps the logs into the log/ directory

write fetch-names.py

given a twitter username, get the list of all people within N hopping distance, where N is a parameter.

(hops can be either follower or following relationships)

exponential drop off on data collection error

The Twitter API docs recommend doing an exponential drop off when Twitter returns and error. (i.e., sleep for an exponentially increasing time per request until it succeeds). We should implement this to improve our unmonitored data collection process.

400 error sometimes on lookupMulti

(I haven't yet reproduced this on the master branch, so maybe this is way off, but I just wanted to note this here...)

When the number of user_ids gets high enough, the Twitter API returns a 400 error on the lookup request.

This forces the individual lookup of ID's with the lookup method.

It would cut down on total requests if we could monitor lookupMulti and break it down into smaller batch requests if necessary.

Consolidate all constants into settings.py file

There are several places where constants and configuration details are in our scripts. Sometimes they are repeated and need to be updated in multiple places.

A quick way to fix this would be to refactor them into a single settings.py script and import it into other scripts.

A problem with this is that the the mallet.sh script is bash script, so the configuration options in the (like number of topics) cannot be set in a python file.

One solution would be to rewrite the script as a python file that makes the command line calls.

pull 'snowball' infor from accounts/ dir?

Currently, getlog.py uses the snowball.json file to get the metadata it needs to make the log requests.

This means that if getsnowball errors out before writing the snowball file, we can't use getlog.

But what if getlog could instead work from the cache in accounts?

That would let us use a bigger batch of data (though one collected in a more ad-hoc way).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.