Giter VIP home page Giter VIP logo

phraug's Introduction

phraug

A set of simple Python scripts for pre-processing large files, things like splitting and format conversion. The names phraug comes from a great book, Made to Stick, by Chip and Dan Heath.

See http://fastml.com/processing-large-files-line-by-line/ for the basic idea.

There's always at least one input file and usually one or more output files. An input file always stays unchanged.

phraug2 is available. It offers improved handling of command line arguments. Check it out.

Format conversion

[...] means that the parameter is optional.

csv2libsvm.py <input file> <output file> [<label index = 0>] [<skip headers = 0>]

Convert CSV to the LIBSVM format. If there are no labels in the input file, specify label index = -1. If there are headers in the input file, specify skip headers = 1.

pivotedcsv2libsvm.py <input file> <output file> [<skip headers = 0>]

Convert pivoted CSV (each line contains sample id, feature index and feature value) to the LIBSVM format. If there are headers in the input file, specify skip headers = 1.

csv2vw.py <input file> <output file> [<label index = 0>] [<skip headers = 0>]

Convert CSV to VW format. Arguments as above.

libsvm2csv.py <input file> <output file> <input file dimensionality>

Convert LIBSVM to CSV. You need to specify dimensionality, that is a number of columns (not counting a label).

libsvm2vw.py <input file> <output file>

Convert LIBSVM to VW.

tsv2csv.py <input file> <output file>

Convert tab-separated file to comma-separated file.

Column means, standard deviations and standardization

How do you standardize (or shift and scale) your data if it doesn't fit into memory? With these two scripts.

colstats.py <input file> <output file> [<label index>]

Compute column means and standard deviations from data in csv file. Can skip label if present. Numbers only. The first line of the output file contains means, the second one standard deviations.

This script uses f_is_headers module, which contains is_headers() function. The purpose of the function is to automatically define if the [first] line in file contains headers.

standardize.py <stats file> <input file> <output file> [<label index>]

Standardize (shift and scale to zero mean and unit standard deviation) data from csv file. Meant to be used with column stats file produced by colstats.py. Numbers only.

Other operations

chunk.py <input file> <number of output files> [<random seed>]

Split a file randomly line by line into a number of smaller files. Might be useful for preparing cross-validation. Output files will have the base nume suffixed with a chunk number, for example data.csv will be chunked into data_0.csv, data_1.csv etc.

count.py <input file>

Count lines in a file. On Unix you can do it with wc -l

delete_cols.py <input file> <output_file> <indices of columns to delete> delete_cols.py train.csv train_del.csv 0 2 3

Delete some columns from a CSV file. Indexes start with 0. Separate them with whitespace.

sample.py <input file> <output file> [<P = 0.5>]

Sample lines from an input file with probability P. Similiar to split.py, but there's only one output file. Useful for sampling large datasets.

shuffle.py <input file> <output file> [<max. lines in memory = 25000>] [<random seed>]

Shuffle (randomize order of) lines in a [big] file. Similiar to Unix' shuf. Useful for files that don't fit in memory. For fastest operation, set max. lines in memory as big as possible - this will result in fewer passes over the input file.

split.py <input file> <output file 1> <output file 2> [<P = 0.9>] [<random seed>]

Split a file into two randomly. Default P (probability of writing to the first file) is 0.9. You can specify any string as a seed for random number generator.

subset.py <input file> <output file> [<offset = 0>] [<lines = 100>]

Save a subset of lines from an input file to an output file. Start at offset (default 0), save lines (default 100).

unshuffle.py <input file> <output file> <max. lines in memory> <random seed>

Unshuffle a previously shuffled file (or any file) to the original order. Syntax is the same as for shuffle.py, but the seed is mandatory so max. lines in memory is mandatory also.

phraug's People

Contributors

bitdeli-chef avatar henrique avatar zygmuntz avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.