Giter VIP home page Giter VIP logo

tweet_import's Introduction

Mongo Tweet Import

Import script for Twitter and GNIP data into mongodb. Inserts entire JSON records and adds some fields to make the two uniform for queries. This tool includes several options to configure the size of each insert (# of tweets) and to check for existing IDs in the database, which can be useful for merging datasets (albeit this happens in a slow way).

Internally, ids are tracked during insertion to prevent duplicates which can happen occasionally.

Usage and Arguments

python import.py <host> <database> <collection>

argument name description
host hostname of the database. eg. localhost or mongo.example.com
database name of the database to use. eg. mydatabase
collection name of the collection in the database to insert. eg. tweets

Optional arguments:

shorthand argument argument description
-l --limit limit the number of tweets to import to x.
-f --filename filename of the json file to import. wildcards acceptable. eg. tweets.json or july_*_2016.json
-e --encoding json file encoding (default is utf-8)
-b --batchsize Number of tweets to insert with a single command. default 1000.
-c --check check if tweet exists (same tweet id) before inserting. Use this is something goes wrong during an insert, or you're trying to merge two datasets. CAUTION is incredibly slow.
-r --no_retweets do not add embedded retweets. By default the source tweet of a retweet is also inserted like a top-level tweet, but this can bias datasets towards things that are retweeted.
--no_index do not create an index for tweet ids. By default an ascending index is created on the tweet id.

Example Usage

To use the tool with all defaults use the following. Make sure to replace the first three arguments with those correct for your system.

python import.py mydatabase.host.com mycollection tweets

If you're trying to merge datasets or something went wrong during the first insert use this:

python import.py mydatabase.host.com mycollection tweets --check

tweet_import's People

Contributors

geosoco avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.