Giter VIP home page Giter VIP logo

reddit-database's Introduction

reddit-database

This respository contains MongoDB dumps of the reddit structure that were data minned with this crawler.

Each document in subreddits contains the following information:

  • The _id is the name of the subreddit (in lowercase).
  • The date and time when the subreddit was analyzed is stored in a timestamp.
  • The amount of subscribers the subreddit had at the moment.
  • the type can be one of these:
    • public: most common type, it has the over18 flag off.
    • nsfw: most used for research, has the over18 flag on.
    • private: you need to be invited in by a mod to read it.
    • banned: got hammered down.
    • nonexistent: it was deleted or never created.
  • The description has the contents of wiki/config/description for public/nsfw subreddits and the message of the private subreddits landing page.
  • Finally, a list of keywords are extracted from the description when possible (using rake-nltk).

To build the graph, what the crawler did was analize the sidebar and wiki of each subreddit looking to match the regex /r/.* . Each string found this was added to a queue and a relation was inserted in the relations collection.

On each document in the relations collection you'll find the following fields:

  • sub_a is the first subreddit (lexicographically).
  • sub_b is the second subreddit (lexicographically).
  • The _id is a hash of the concatenation sub_a/sub_b.
  • The date and time when this relations was found is stored in a timestamp.

Database: v2

Date: November 3, 2017

Subreddits: 54151 Relations: 255617

This time i've decided to NOT analyze the wiki of subreddits with less than 10000 subscribers. I've found out that many small subreddits had really big wiki pages with a lot of related small subreddits. I also excluded any subreddit with 'reddit' in it's name (with the exception of askreddit).

Database: v1

Date: November 2, 2017

Subreddits: 69600 Relations: 301565

FAQ:

Q: Why would you use MongoDB for storing a Graph? A: Because the crawler is running 24/7 on a heroku worker and the MongoDB add-on has a lot of space (compared to Heroku Postgres). Also, it's free just perfect for my budget.

Q: Will you push new dumps of this graph in the future? A: I'll probably will, as long i keep tinkering with the crawler to add new information.

reddit-database's People

Watchers

James Cloos avatar Keith Webber avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.