sbenthall / topical-topology Goto Github PK

Tools for probabilistic topic modeling of twitter

Shell 0.16% Python 99.84%

topical-topology's Introduction

This repository is for tools for studying Twitter using probabilistic topic modelling.

It uses MALLET to do the topic inference, using LDA.

## Setup and usage

To execute, first unpack the mallet script:
tar -xzf mallet-2.0.6.tar-gz

settings.py has the global settings for all the scripts

then run these scripts in the following order:

getsnowball.py -> finds a 'snowball' of users
getLog.py -> logs last 200 tweets of each user in snowball
extract.py -> extracts the tweets from the logs into a format usable by MALLET
infertopics.py -> Use MALLET to infer topics on the tweets using LDA
preparedata.py -> parse data into numpy arrays and persist as .npy files

then use the analysis scripts, or write your own


## Config

To properly configure the scripts, you need to supply a file, config.cfg, with the following contents:


[OAuth]
accesstoken:realaccesstoken
accesstokenkey:realtokenkey
consumerkey:consumerkey
consumersecret:consumersecret

topical-topology's People

Contributors

Stargazers

Watchers

Forkers

tomoeyukishiro whyjustin parsegarden

topical-topology's Issues

add an option in getlog to grab accounts from file instead of snowball

write get-log.py

get-log.py takes twitter usernames from a file, and runs twitter-log and dumps the logs into the log/ directory

probability filter on getsnowball

add a filter on fetch-names so that the graph is only traversed with some probability P (P is a parameter)

write fetch-names.py

given a twitter username, get the list of all people within N hopping distance, where N is a parameter.

(hops can be either follower or following relationships)

exponential drop off on data collection error

The Twitter API docs recommend doing an exponential drop off when Twitter returns and error. (i.e., sleep for an exponentially increasing time per request until it succeeds). We should implement this to improve our unmonitored data collection process.

use OAuth on twitter requests

use OAuth on twitter requests to raise our rate limit

separate prep_data from analysis

combine fetch-names, get-logs, extracts, and mallet

String these together so we can generate topic models from the snowball around any arbitrary user.

in extract.py, get_screen_names(), add options to choose different snowballs

add exp delay to snowball

include a sleeper time to prevent rate limiting

add a sleep time to calls to the twitter API to keep it under the rate limit

remove config.cfg from repository

so that I'm not forced to commit my update on this file

400 error sometimes on lookupMulti

(I haven't yet reproduced this on the master branch, so maybe this is way off, but I just wanted to note this here...)

When the number of user_ids gets high enough, the Twitter API returns a 400 error on the lookup request.

This forces the individual lookup of ID's with the lookup method.

It would cut down on total requests if we could monitor lookupMulti and break it down into smaller batch requests if necessary.

modify extract to take names from names.txt or from all logs/

modify extract so that it takes names from the names.txt file (which get-logs gets the names from) or else gets them from the logs/ directory

to save on total # of requests, do multiple lookups at once

With the lookup API, we can request multiple lookups at once

https://dev.twitter.com/docs/api/1/get/users/lookup

As an optimization to help us get around the rate limitation, we should change the getsnowball algorithm to take advantage of this

Consolidate all constants into settings.py file

There are several places where constants and configuration details are in our scripts. Sometimes they are repeated and need to be updated in multiple places.

A quick way to fix this would be to refactor them into a single settings.py script and import it into other scripts.

A problem with this is that the the mallet.sh script is bash script, so the configuration options in the (like number of topics) cannot be set in a python file.

One solution would be to rewrite the script as a python file that makes the command line calls.

pull 'snowball' infor from accounts/ dir?

Currently, getlog.py uses the snowball.json file to get the metadata it needs to make the log requests.

This means that if getsnowball errors out before writing the snowball file, we can't use getlog.

But what if getlog could instead work from the cache in accounts?

That would let us use a bigger batch of data (though one collected in a more ad-hoc way).

in extract.py, option to view all the tweets as one doc

cache lookup, friend, and follower data

cache lookup, friend, and follower data for each user so that we can avoid unnecessary calls to the twitter API.