Giter VIP home page Giter VIP logo

twarc's Introduction

twarc

Build Status Coverage Status ![Gitter](https://badges.gitter.im/Join Chat.svg)

twarc is a command line tool and Python library for archiving Twitter JSON data. Each tweet is represented as a JSON object which is exactly what was returned from the Twitter API. It runs in three modes: search, stream and hydrate. When running in each mode twarc will stop and resume activity in order to work within the Twitter API's rate limits.

Install

This is an example of using twarc in search mode:

  1. install Python and pip
  2. pip install twarc
  3. create an app for your program at apps.twitter.com
  4. set CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN and ACCESS_TOKEN_SECRET for your app in your environment.
  5. twarc.py --search ferguson > tweets.json

Search

When running in search mode twarc will use Twitter's search API to retrieve tweets that match a particular query. So for example, to collect all the tweets mentioning the keyword "ferguson" you would:

twarc.py --search ferguson > tweets.json

This command will walk through each page of the search results and write each tweet to stdout as line oriented JSON. Twitter's search API only makes (roughly) the last weeks worth of Tweets available via its search API, so time is of the essence if you are trying to collect tweets for something that has already happened.

Stream

In stream mode twarc will listen to Twitter's filter stream API for tweets that match a particular filter. Similar to search mode twarc will write these tweets to stdout as line oriented JSON:

twarc.py --stream ferguson > tweets.json

Note the syntax for the Twitter's filter queries is slightly different than what queries in their search API. So please consult the documentation on how best to express the filter.

Hydrate

The Twitter API's Terms of Service prevent people from making large amounts of raw Twitter data available on the Web. The data can be used for research and archived for local use, but not shared with the world. Twitter does allow files of tweet identifiers to be shared, which can be useful when you would like to make a dataset of tweets available. You can then use Twitter's API to hydrate the data, or to retrieve the full JSON for each identifier. This is particularly important for verification of social media research.

In hydrate mode twarc will read a file of tweet identifiers and use Twitter's lookup API to fetch the full JSON for each tweet and write it to stdout as line-oriented JSON:

twarc.py --hydrate ids.txt > tweets.json

Use as a Library

If you want you can use twarc programatically as a library to collect tweets. You first need to create a Twarc instance, and then use it to iterate through search results, filter results or lookup results.

from twarc import Twarc

t = Twarc()
for tweet in t.search("ferguson"):
    print tweet["text"]

You can do the same for a stream of new tweets:

from twarc import Twarc

t = Twarc()
for tweet in t.stream("ferguson"):
    print tweet["text"]

Similarly you can hydrate tweet identifiers by passing in a list of ids or or a generator:

from twarc import Twarc

t = Twarc()
for tweet in t.hydrate(open('ids.txt')):
  print tweet["text"]

Utilities

In the utils directory there are some simple command line utilities for working with the line-oriented JSON, like printing out the archived tweets as text or html, extracting the usernames, referenced URLs, etc. If you create a script that is handy please send a pull request.

For example lets say you archive some tweets mentioning "ferguson":

% twarc.py --search ferguson > tweets.json

This is good for one off collecting but if you would like to periodically run the same search and have it only collect tweets you previously missed try the utils/archive.py utility:

% utils/archive.py ferguson /mnt/tweets/ferguson/

This will search for tweets and write them as:

/mnt/tweets/ferguson/tweets-0001.json

If you run the same command later it will write any tweets that weren't archived previously to:

/mnt/tweets/ferguson/tweets-0002.json

When you've got some tweets you can create a rudimentary wall of them:

% utils/wall.py tweets.json > tweets.html

You can create a word cloud of tweets you collected about nasa:

% utils/wordcloud.py tweets.json > wordcloud.html

gender.py is a filter which allows you to filter tweets based on a guess about the gender of the author. So for example you can filter out all the tweets that look like they were from women, and create a word cloud for them:

% utils/gender.py --gender female tweets.json | utils/wordcloud.py > tweets-female.html

You can output GeoJSON from tweets where geo coordinates are available:

% utils/geojson.py tweets.json > tweets.geojson

If you suspect you have duplicate in your tweets you can dedupe them:

% utils/deduplicate.py tweets.json > deduped.json

You can sort by ID, which is analogous to sorting by time:

% utils/sort_by_id.py tweets.json > sorted.json

You can filter out all tweets before a certain date (for example, if a hashtag was used for another event before the one you're interested in):

% utils/filter_date.py --mindate 1-may-2014 tweets.json > filtered.json

You can get an HTML list of the clients used:

% utils/source.py tweets.json > sources.html

If you want to remove the retweets:

% utils/noretweets.py tweets.json > tweets_noretweets.json

Or unshorten urls (requires unshrtn):

% cat tweets.json | utils/unshorten.py > ushortened.json

Once you unshorten your URLs you can get a ranked list of most tweeted URLs:

% cat unshortened.json | utils/urls.py | sort | uniq -c | sort -n > urls.txt

twarc-report

Some further utility scripts to generate csv or json output suitable for use with D3.js visualizations are found in the twarc-report project. The util directed.py, formerly part of twarc, has moved to twarc-report as d3graph.py.

Each script can also generate an html demo of a D3 visualization, e.g. timelines or a directed graph of retweets.

License

  • CC0

twarc's People

Contributors

edsu avatar pbinkley avatar hugovk avatar ruebot avatar recrm avatar steko avatar bibliotechy avatar lsblakk avatar anarchivist avatar gitter-badger avatar phette23 avatar

Watchers

James Cloos avatar Ryan Pickering avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.