Import script for Twitter and GNIP data into mongodb. Inserts entire JSON records and adds some fields to make the two uniform for queries. This tool includes several options to configure the size of each insert (# of tweets) and to check for existing IDs in the database, which can be useful for merging datasets (albeit this happens in a slow way).
Internally, ids are tracked during insertion to prevent duplicates which can happen occasionally.
python import.py <host> <database> <collection>
argument name | description |
---|---|
host |
hostname of the database. eg. localhost or mongo.example.com |
database |
name of the database to use. eg. mydatabase |
collection |
name of the collection in the database to insert. eg. tweets |
Optional arguments:
shorthand argument | argument | description |
---|---|---|
-l | --limit | limit the number of tweets to import to x. |
-f | --filename | filename of the json file to import. wildcards acceptable. eg. tweets.json or july_*_2016.json |
-e | --encoding | json file encoding (default is utf-8) |
-b | --batchsize | Number of tweets to insert with a single command. default 1000. |
-c | --check | check if tweet exists (same tweet id) before inserting. Use this is something goes wrong during an insert, or you're trying to merge two datasets. CAUTION is incredibly slow. |
-r | --no_retweets | do not add embedded retweets. By default the source tweet of a retweet is also inserted like a top-level tweet, but this can bias datasets towards things that are retweeted. |
--no_index | do not create an index for tweet ids. By default an ascending index is created on the tweet id. |
To use the tool with all defaults use the following. Make sure to replace the first three arguments with those correct for your system.
python import.py mydatabase.host.com mycollection tweets
If you're trying to merge datasets or something went wrong during the first insert use this:
python import.py mydatabase.host.com mycollection tweets --check