Giter VIP home page Giter VIP logo

rhuffle's Introduction

rhuffle

crates.io Build Status

rhuffle is a random shuffler for large file with many lines which can exceed available RAM.

rhuffle supports:

  • shuffling huge files which does not fit in memory
  • skipping head lines which should not include for shuffling (e.g. csv/tsv)
  • multiple file input and flexible input formats
  • rhuffle works very fast (see benchmark results.)

rhuffle_demo

Installation

See lib.rs.

Usage

USAGE:
    rhuffle [OPTIONS]

FLAGS:
        --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -b, --buf <NUMBER>
            Sets buffer size which is smaller than available RAM with bytes (default: 4294967296).

        --dst <PATH>
            Sets destination file path. If not set, destination sets to stdout. (default: None)

        --feed <LF|LF_CRLF>                        Sets acceptable line feed as EOL (default: LF_CRLF).
    -h, --head <NUMBER>
            Sets first `n` lines without shuffling (default: 0). For multiple input sources, take README a look.

        --log <off|error|warn|info|debug|trace>    Sets log level. (default: off)
        --src <[PATH]>
            Sets source file paths (space separated). If not set, source sets to stdin. (default: None)

--head n Option

  • For multiple input sources, first n lines in the first input source forwards to output source without shuffling.
  • For second input source and later, first n lines in the first input source are skipped.
  • Here is an example below:

in1.txt

head1-1
head2-1
line1-1
line2-1

in2.txt

head1-2
head2-2
line1-2
line2-2
$ rhuffle --src in1.txt in2.txt --dst out.txt --head 2

out.txt

head1-1 // L1-L2: fixed
head2-1 
line2-1 // L3-L6: shuffled globally
line1-2
line2-2
line1-1

--feed Option

  • LF_CRLF(default): accepts LF or CRLF as newline
  • LF: accepts only LF as newline
  • No option for CR

Benchmarks

The results shown below are focused on execution time in a limited memory space. Two datasets are used for testing.

Three softwares are used for performance comparison.

  • GNU shuf
    • command: shuf {src} -o {dst}
  • terashuf
    • command: terashuf < {src} > {dst}
  • rhuffle
    • command: rhuffle --src {src} --dst {dst}

Benchmarks are executed on MacBook Pro 2017, Core i7 3.1GHz, RAM 16GB. Execution time is measured by time.

Kaggle competition dataset

5.3GB size, 55423856 lines

Software real user sys
GNU shuf 0m59s 0m34s 0m14s
terashuf 5m06s 4m43s 0m14s
rhuffle 1m56s 1m06s 0m40s

Custom dataset

9.0GB size, 21550072 lines

Software real user sys
GNU shuf x x x
terashuf 8m12s 7m16s 0m31s
rhuffle 1m47s 0m39s 0m51s

GNU shuf was impossible to measure (very slow).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.