sbenthall / topical-topology Goto Github PK
View Code? Open in Web Editor NEWTools for probabilistic topic modeling of twitter
Tools for probabilistic topic modeling of twitter
This repository is for tools for studying Twitter using probabilistic topic modelling. It uses MALLET to do the topic inference, using LDA. ## Setup and usage To execute, first unpack the mallet script: tar -xzf mallet-2.0.6.tar-gz settings.py has the global settings for all the scripts then run these scripts in the following order: getsnowball.py -> finds a 'snowball' of users getLog.py -> logs last 200 tweets of each user in snowball extract.py -> extracts the tweets from the logs into a format usable by MALLET infertopics.py -> Use MALLET to infer topics on the tweets using LDA preparedata.py -> parse data into numpy arrays and persist as .npy files then use the analysis scripts, or write your own ## Config To properly configure the scripts, you need to supply a file, config.cfg, with the following contents: [OAuth] accesstoken:realaccesstoken accesstokenkey:realtokenkey consumerkey:consumerkey consumersecret:consumersecret
get-log.py takes twitter usernames from a file, and runs twitter-log and dumps the logs into the log/ directory
add a filter on fetch-names so that the graph is only traversed with some probability P (P is a parameter)
given a twitter username, get the list of all people within N hopping distance, where N is a parameter.
(hops can be either follower or following relationships)
The Twitter API docs recommend doing an exponential drop off when Twitter returns and error. (i.e., sleep for an exponentially increasing time per request until it succeeds). We should implement this to improve our unmonitored data collection process.
use OAuth on twitter requests to raise our rate limit
String these together so we can generate topic models from the snowball around any arbitrary user.
add a sleep time to calls to the twitter API to keep it under the rate limit
so that I'm not forced to commit my update on this file
(I haven't yet reproduced this on the master branch, so maybe this is way off, but I just wanted to note this here...)
When the number of user_ids gets high enough, the Twitter API returns a 400 error on the lookup request.
This forces the individual lookup of ID's with the lookup method.
It would cut down on total requests if we could monitor lookupMulti and break it down into smaller batch requests if necessary.
modify extract so that it takes names from the names.txt file (which get-logs gets the names from) or else gets them from the logs/ directory
With the lookup API, we can request multiple lookups at once
https://dev.twitter.com/docs/api/1/get/users/lookup
As an optimization to help us get around the rate limitation, we should change the getsnowball algorithm to take advantage of this
There are several places where constants and configuration details are in our scripts. Sometimes they are repeated and need to be updated in multiple places.
A quick way to fix this would be to refactor them into a single settings.py script and import it into other scripts.
A problem with this is that the the mallet.sh script is bash script, so the configuration options in the (like number of topics) cannot be set in a python file.
One solution would be to rewrite the script as a python file that makes the command line calls.
Currently, getlog.py uses the snowball.json file to get the metadata it needs to make the log requests.
This means that if getsnowball errors out before writing the snowball file, we can't use getlog.
But what if getlog could instead work from the cache in accounts?
That would let us use a bigger batch of data (though one collected in a more ad-hoc way).
cache lookup, friend, and follower data for each user so that we can avoid unnecessary calls to the twitter API.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.