Giter VIP home page Giter VIP logo

sample's Introduction

sample

Status Conda Downloads Conda Version Platforms
Build Status Conda Downloads Conda Version Conda Platforms

Produce a sample of lines from files. The sample size is either fixed or proportional to the size of the file. Additionally, the header and footer can be included in the sample.

Red tape

  • no dependencies other than a POSIX system and a C99 compiler.
  • licensed under BSD3c

Features

  • proportional sampling of streams and files
  • header and footer can be included in the sample
  • reservoir sampling (fixed sample size) of streams and files
  • stable reservoir sampling (i.e. the order is preserved)

Motivation

Practically ubiquitous, there's shuf -n of GNU coreutils, a tool that, in principle, solves the problem at hand. However, shuf buffers all input and is therefore useless for files that don't fit in memory.

So, looking for alternatives one may come across paulgb's subsample or earino's fast_sample. They usually do the trick and everyone seems to agree (judged by github stars). However, both tools have short-comings: they try to make sense of the line data semantically, and secondly, they are slow!

The first issue is such a major problem that their bug trackers are full of reports. subsample needs lines to be UTF-8 strings and fast_sample wants CSV files whose correctness is checked along the way. This project's tool, sample, on the other hand does not care about the line's content, all it needs are those line breaks at the end.

The speed issue is addressed by

  • using the most appropriate programming language for the problem
  • using radix sort
  • using the PCG family to obtain randomness
  • oversampling

Examples

To get 10 random words from the words file:

$ sample -n 10 -H 0 /usr/share/dict/words
...
benzopyrene
calamondins
cephalothorax
copulate
garbology's
Kewadin
Peter's
reassembly
Vienna's
Wagnerism's
...

The -H 0 produces 0 lines of header output which defaults to 5.

For proportional sampling use -r|--rate:

$ wc -l /usr/share/dict/words
305089
$ sample -r 1% /usr/share/dict/words | wc -l
3080

which is close to the true result bearing in mind that by default the header and footer of the file is printed as well.

Sampling with a rate of 0 replaces awkward scripts that use multios and head and tail to produce the same result.

$ sample -r 0 /usr/share/dict/words
A
AA
AAA
Aachen
aah
...
Zyuganov
Zyuganov's
zyzzyva
zyzzyvas
ZZZ

Similar projects

In no particular order and without any claim to completeness:

sample's People

Contributors

corneliusroemer avatar hroptatyr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

corneliusroemer

sample's Issues

Enhancement: Provide binaries

I'm struggling to build this, the README doesn't have any instructions on how to build it.

On macOS build I tried:

./configure
make

but I get the following error:

/Library/Developer/CommandLineTools/usr/bin/make  all-am
make[2]: Nothing to be done for `all-am'.
Making all in src
/Library/Developer/CommandLineTools/usr/bin/make  all-am
  CC       sample.o
sample.c:98:21: warning: cast from function call of type 'unsigned int' to non-matching type 'double' [-Wbad-function-cast]
        double u = (double)runif32() / 0x1.p32;
                           ^~~~~~~~~
1 warning generated.
  CC       pcg_basic.o
  CC       version.o
  CCLD     sample
Making all in test
/Library/Developer/CommandLineTools/usr/bin/make  all-am
make[2]: Nothing to be done for `all-am'.
Making all in info
/Library/Developer/CommandLineTools/usr/bin/make  all-am
make[2]: Nothing to be done for `all-am'.
  GEN      version.mk
make[2]: Nothing to be done for `all-am'.
make[1]: Nothing to be done for `all-am'.

Ok, it seems to have worked, which sample returns:

❯ which sample
/usr/local/bin/sample

Add license

There doesn't seem to be a license which is a bit of in the way for packaging :/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.