Giter VIP home page Giter VIP logo

axe's Introduction

What this tool does

It splits the JSON data set available from PushShift into smaller JSON files.

It also writes to disk in parallel to cut down on processing time.

At this time, the data can be split by the following keys:

  • Subreddit --split-on subreddit
  • Author --split-on author
  • Day of month --split-on day
  • Month --split-on month
  • Day of year --split-on day-of-year
  • Day of week --split-on day-of-week

When the data is split, a JSON file is created for each unique key, so if the split is on subreddit, a JSON file is created per subreddit.

Example Usage
  • Build the code
~/dev/rust/axe  (master) 
 abhijat $ cargo build --release
  • Run the code
~/dev/rust/axe  (master) 
 abhijat $ cargo run -- --input-path ~/Downloads/R --output-prefix ~/tmp/data-by-sub --split-on subreddit
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/axe --input-path /home/abhijat/Downloads/R --output-prefix /home/abhijat/tmp/data-by-sub --split-on subreddit`
...

The files will be present in ~/tmp/data-by-sub after the above run is complete.

Help
~/dev/rust/axe  (master) 
 abhijat $ ./target/release/axe --help
axe 0.1.0
A utility to split a reddit dataset into individual JSON files

USAGE:
    axe [OPTIONS] --input-path <input-path> --output-prefix <output-prefix> --split-on <split-on>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -i, --input-path <input-path>          The path to the data set
    -m, --max-size <max-size>              The maximum size the hashmap will grow to before it is written to disk
                                           [default: 150000]
    -o, --output-prefix <output-prefix>    The parent directory where output JSON files will be written
    -s, --split-on <split-on>              The attribute to split the data set on
Sample runtime (4 GB dataset with around 3 million entries)
~/dev/rust/axe  (master) 
 abhijat $ time ./target/release/axe -s subreddit -i ~/Downloads/R -o ~/tmp/subr -m 600000
dumping 600000 entries to disk
dump finished
dumping 600000 entries to disk
dump finished
dumping 600000 entries to disk
dump finished
dumping 600000 entries to disk
dump finished
dumping 600000 entries to disk
dump finished

real	0m24.009s
user	0m24.364s
sys	0m16.685s

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.