Giter VIP home page Giter VIP logo

xsv's Introduction

xsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files. Commands should be simple, fast and composable:

  1. Simple tasks should be easy.
  2. Performance trade offs should be exposed in the CLI interface.
  3. Composition should not come at the expense of performance.

This README contains information on how to install xsv, in addition to a quick tour of several commands.

Linux build status Windows build status

Dual-licensed under MIT or the UNLICENSE.

Available commands

  • cat - Concatenate CSV files by row or by column.
  • count - Count the rows in a CSV file. (Instantaneous with an index.)
  • fixlengths - Force a CSV file to have same-length records by either padding or truncating them.
  • flatten - A flattened view of CSV records. Useful for viewing one record at a time. e.g., xsv slice -i 5 data.csv | xsv flatten.
  • fmt - Reformat CSV data with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)
  • frequency - Build frequency tables of each column in CSV data. (Uses parallelism to go faster if an index is present.)
  • headers - Show the headers of CSV data. Or show the intersection of all headers between many CSV files.
  • index - Create an index for a CSV file. This is very quick and provides constant time indexing into the CSV file.
  • input - Read CSV data with exotic quoting/escaping rules.
  • join - Inner, outer and cross joins. Uses a simple hash index to make it fast.
  • partition - Partition CSV data based on a column value.
  • sample - Randomly draw rows from CSV data using reservoir sampling (i.e., use memory proportional to the size of the sample).
  • reverse - Reverse order of rows in CSV data.
  • search - Run a regex over CSV data. Applies the regex to each field individually and shows only matching rows.
  • select - Select or re-order columns from CSV data.
  • slice - Slice rows from any part of a CSV file. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
  • sort - Sort CSV data.
  • split - Split one CSV file into many CSV files of N chunks.
  • stats - Show basic types and statistics of each column in the CSV file. (i.e., mean, standard deviation, median, range, etc.)
  • table - Show aligned output of any CSV data using elastic tabstops.

A whirlwind tour

Let's say you're playing with some of the data from the Data Science Toolkit, which contains several CSV files. Maybe you're interested in the population counts of each city in the world. So grab the data and start examining it:

$ curl -LO https://burntsushi.net/stuff/worldcitiespop.csv
$ xsv headers worldcitiespop.csv
1   Country
2   City
3   AccentCity
4   Region
5   Population
6   Latitude
7   Longitude

The next thing you might want to do is get an overview of the kind of data that appears in each column. The stats command will do this for you:

$ xsv stats worldcitiespop.csv --everything | xsv table
field       type     min            max            min_length  max_length  mean          stddev         median     mode         cardinality
Country     Unicode  ad             zw             2           2                                                   cn           234
City        Unicode   bab el ahmar  Þykkvibaer     1           91                                                  san jose     2351892
AccentCity  Unicode   Bâb el Ahmar  ïn Bou Chella  1           91                                                  San Antonio  2375760
Region      Unicode  00             Z9             0           2                                        13         04           397
Population  Integer  7              31480498       0           8           47719.570634  302885.559204  10779                   28754
Latitude    Float    -54.933333     82.483333      1           12          27.188166     21.952614      32.497222  51.15        1038349
Longitude   Float    -179.983333    180            1           14          37.08886      63.22301       35.28      23.8         1167162

The xsv table command takes any CSV data and formats it into aligned columns using elastic tabstops. You'll notice that it even gets alignment right with respect to Unicode characters.

So, this command takes about 12 seconds to run on my machine, but we can speed it up by creating an index and re-running the command:

$ xsv index worldcitiespop.csv
$ xsv stats worldcitiespop.csv --everything | xsv table
...

Which cuts it down to about 8 seconds on my machine. (And creating the index takes less than 2 seconds.)

Notably, the same type of "statistics" command in another CSV command line toolkit takes about 2 minutes to produce similar statistics on the same data set.

Creating an index gives us more than just faster statistics gathering. It also makes slice operations extremely fast because only the sliced portion has to be parsed. For example, let's say you wanted to grab the last 10 records:

$ xsv count worldcitiespop.csv
3173958
$ xsv slice worldcitiespop.csv -s 3173948 | xsv table
Country  City               AccentCity         Region  Population  Latitude     Longitude
zw       zibalonkwe         Zibalonkwe         06                  -19.8333333  27.4666667
zw       zibunkululu        Zibunkululu        06                  -19.6666667  27.6166667
zw       ziga               Ziga               06                  -19.2166667  27.4833333
zw       zikamanas village  Zikamanas Village  00                  -18.2166667  27.95
zw       zimbabwe           Zimbabwe           07                  -20.2666667  30.9166667
zw       zimre park         Zimre Park         04                  -17.8661111  31.2136111
zw       ziyakamanas        Ziyakamanas        00                  -18.2166667  27.95
zw       zizalisari         Zizalisari         04                  -17.7588889  31.0105556
zw       zuzumba            Zuzumba            06                  -20.0333333  27.9333333
zw       zvishavane         Zvishavane         07      79876       -20.3333333  30.0333333

These commands are instantaneous because they run in time and memory proportional to the size of the slice (which means they will scale to arbitrarily large CSV data).

Switching gears a little bit, you might not always want to see every column in the CSV data. In this case, maybe we only care about the country, city and population. So let's take a look at 10 random rows:

$ xsv select Country,AccentCity,Population worldcitiespop.csv \
  | xsv sample 10 \
  | xsv table
Country  AccentCity       Population
cn       Guankoushang
za       Klipdrift
ma       Ouled Hammou
fr       Les Gravues
la       Ban Phadèng
de       Lüdenscheid      80045
qa       Umm ash Shubrum
bd       Panditgoan
us       Appleton
ua       Lukashenkivske

Whoops! It seems some cities don't have population counts. How pervasive is that?

$ xsv frequency worldcitiespop.csv --limit 5
field,value,count
Country,cn,238985
Country,ru,215938
Country,id,176546
Country,us,141989
Country,ir,123872
City,san jose,328
City,san antonio,320
City,santa rosa,296
City,santa cruz,282
City,san juan,255
AccentCity,San Antonio,317
AccentCity,Santa Rosa,296
AccentCity,Santa Cruz,281
AccentCity,San Juan,254
AccentCity,San Miguel,254
Region,04,159916
Region,02,142158
Region,07,126867
Region,03,122161
Region,05,118441
Population,(NULL),3125978
Population,2310,12
Population,3097,11
Population,983,11
Population,2684,11
Latitude,51.15,777
Latitude,51.083333,772
Latitude,50.933333,769
Latitude,51.116667,769
Latitude,51.133333,767
Longitude,23.8,484
Longitude,23.2,477
Longitude,23.05,476
Longitude,25.3,474
Longitude,23.1,459

(The xsv frequency command builds a frequency table for each column in the CSV data. This one only took 5 seconds.)

So it seems that most cities do not have a population count associated with them at all. No matter—we can adjust our previous command so that it only shows rows with a population count:

$ xsv search -s Population '[0-9]' worldcitiespop.csv \
  | xsv select Country,AccentCity,Population \
  | xsv sample 10 \
  | xsv table
Country  AccentCity       Population
es       Barañáin         22264
es       Puerto Real      36946
at       Moosburg         4602
hu       Hejobaba         1949
ru       Polyarnyye Zori  15092
gr       Kandíla          1245
is       Ólafsvík         992
hu       Decs             4210
bg       Sliven           94252
gb       Leatherhead      43544

Erk. Which country is at? No clue, but the Data Science Toolkit has a CSV file called countrynames.csv. Let's grab it and do a join so we can see which countries these are:

curl -LO https://gist.githubusercontent.com/anonymous/063cb470e56e64e98cf1/raw/98e2589b801f6ca3ff900b01a87fbb7452eb35c7/countrynames.csv
$ xsv headers countrynames.csv
1   Abbrev
2   Country
$ xsv join --no-case  Country sample.csv Abbrev countrynames.csv | xsv table
Country  AccentCity       Population  Abbrev  Country
es       Barañáin         22264       ES      Spain
es       Puerto Real      36946       ES      Spain
at       Moosburg         4602        AT      Austria
hu       Hejobaba         1949        HU      Hungary
ru       Polyarnyye Zori  15092       RU      Russian Federation | Russia
gr       Kandíla          1245        GR      Greece
is       Ólafsvík         992         IS      Iceland
hu       Decs             4210        HU      Hungary
bg       Sliven           94252       BG      Bulgaria
gb       Leatherhead      43544       GB      Great Britain | UK | England | Scotland | Wales | Northern Ireland | United Kingdom

Whoops, now we have two columns called Country and an Abbrev column that we no longer need. This is easy to fix by re-ordering columns with the xsv select command:

$ xsv join --no-case  Country sample.csv Abbrev countrynames.csv \
  | xsv select 'Country[1],AccentCity,Population' \
  | xsv table
Country                                                                              AccentCity       Population
Spain                                                                                Barañáin         22264
Spain                                                                                Puerto Real      36946
Austria                                                                              Moosburg         4602
Hungary                                                                              Hejobaba         1949
Russian Federation | Russia                                                          Polyarnyye Zori  15092
Greece                                                                               Kandíla          1245
Iceland                                                                              Ólafsvík         992
Hungary                                                                              Decs             4210
Bulgaria                                                                             Sliven           94252
Great Britain | UK | England | Scotland | Wales | Northern Ireland | United Kingdom  Leatherhead      43544

Perhaps we can do this with the original CSV data? Indeed we can—because joins in xsv are fast.

$ xsv join --no-case Abbrev countrynames.csv Country worldcitiespop.csv \
  | xsv select '!Abbrev,Country[1]' \
  > worldcitiespop_countrynames.csv
$ xsv sample 10 worldcitiespop_countrynames.csv | xsv table
Country                      City                   AccentCity             Region  Population  Latitude    Longitude
Sri Lanka                    miriswatte             Miriswatte             36                  7.2333333   79.9
Romania                      livezile               Livezile               26      1985        44.512222   22.863333
Indonesia                    tawainalu              Tawainalu              22                  -4.0225     121.9273
Russian Federation | Russia  otar                   Otar                   45                  56.975278   48.305278
France                       le breuil-bois robert  le Breuil-Bois Robert  A8                  48.945567   1.717026
France                       lissac                 Lissac                 B1                  45.103094   1.464927
Albania                      lumalasi               Lumalasi               46                  40.6586111  20.7363889
China                        motzushih              Motzushih              11                  27.65       111.966667
Russian Federation | Russia  svakino                Svakino                69                  55.60211    34.559785
Romania                      tirgu pancesti         Tirgu Pancesti         38                  46.216667   27.1

The !Abbrev,Country[1] syntax means, "remove the Abbrev column and remove the second occurrence of the Country column." Since we joined with countrynames.csv first, the first Country name (fully expanded) is now included in the CSV data.

This xsv join command takes about 7 seconds on my machine. The performance comes from constructing a very simple hash index of one of the CSV data files given. The join command does an inner join by default, but it also has left, right and full outer join support too.

Installation

Binaries for Windows, Linux and macOS are available from Github.

If you're a macOS Homebrew user, then you can install xsv from homebrew-core:

$ brew install xsv

If you're a macOS MacPorts user, then you can install xsv from the official ports:

$ sudo port install xsv

If you're a Nix/NixOS user, you can install xsv from nixpkgs:

$ nix-env -i xsv

Alternatively, you can compile from source by installing Cargo (Rust's package manager) and installing xsv using Cargo:

cargo install xsv

Compiling from this repository also works similarly:

git clone git://github.com/BurntSushi/xsv
cd xsv
cargo build --release

Compilation will probably take a few minutes depending on your machine. The binary will end up in ./target/release/xsv.

Benchmarks

I've compiled some very rough benchmarks of various xsv commands.

Motivation

Here are several valid criticisms of this project:

  1. You shouldn't be working with CSV data because CSV is a terrible format.
  2. If your data is gigabytes in size, then CSV is the wrong storage type.
  3. Various SQL databases provide all of the operations available in xsv with more sophisticated indexing support. And the performance is a zillion times better.

I'm sure there are more criticisms, but the impetus for this project was a 40GB CSV file that was handed to me. I was tasked with figuring out the shape of the data inside of it and coming up with a way to integrate it into our existing system. It was then that I realized that every single CSV tool I knew about was woefully inadequate. They were just too slow or didn't provide enough flexibility. (Another project I had comprised of a few dozen CSV files. They were smaller than 40GB, but they were each supposed to represent the same kind of data. But they all had different column and unintuitive column names. Useful CSV inspection tools were critical here—and they had to be reasonably fast.)

The key ingredients for helping me with my task were indexing, random sampling, searching, slicing and selecting columns. All of these things made dealing with 40GB of CSV data a bit more manageable (or dozens of CSV files).

Getting handed a large CSV file once was enough to launch me on this quest. From conversations I've had with others, CSV data files this large don't seem to be a rare event. Therefore, I believe there is room for a tool that has a hope of dealing with data that large.

Naming collision

This project is unrelated to another similar project with the same name: https://mj.ucw.cz/sw/xsv/

xsv's People

Contributors

adevore avatar alexargoai avatar alexcrichton avatar amilajack avatar astro avatar atouchet avatar bruceadams avatar burntsushi avatar cbuesser avatar chills42 avatar dimagog avatar emk avatar josephfrazier avatar kbd avatar kper avatar kyegupov avatar ldcasillas-progreso avatar llogiq avatar lperry avatar lukaskalbertodt avatar mchesser avatar mdamien avatar michiel-de-muynck avatar mintyplanet avatar ngirard avatar tafia avatar timmmm avatar yomguithereal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xsv's Issues

csv join example

Hello,

Could you explain what role sample.csv is in this example of yours?

$ xsv join --no-case Country sample.csv Abbrev countrynames.csv | xsv table

If I run xsv join --no-case Country worldcitiespop.csv Abbrev countrynames.csv | xsv table my shell runs out of memory and dies.

Here's some output of it processing the above:

load: 0.99  cmd: xsv 15196 [runnable] 2.71r 0.17u 0.05s 11% 102328k
load: 0.99  cmd: xsv 15196 [runnable] 6.80r 0.44u 0.12s 21% 243884k
load: 0.99  cmd: xsv 15195 [runnable] 14.83r 0.80u 0.70s 39% 4632k
load: 1.05  cmd: xsv 15196 [runnable] 46.44r 2.55u 0.77s 29% 659956k
zsh: done       xsv join --no-case Country worldcitiespop.csv Abbrev countrynames.csv | 
zsh: killed     xsv table

sample.csv is also mentioned here:

xsv join --no-case  Country sample.csv Abbrev countrynames.csv \
| xsv select 'Country[1],AccentCity,Population' \
| xsv table

Thanks!

Feature request: import table from Excel (xslx/xls)

I am a heavy csvkit user, and gradually transit to xsv now, only one exception in2csv that often makes me come back to csvkit. A subcommand like, import, is very nice to import csv from Excel (xlsx or xls).

How do I "unflatten" the data?

I expected something like

for i in /proc/*/status ; do cat "$i"; echo '#'; done | xsv unflatten | xsv sort -s Threads: | xsv select Name:,Threads:

Can I also override row separator to handle /proc/*/environ or find -print0?

How can one add quotes to output

Hi,
I'm fairly new to using xsv so may have missed something,
but I couldn't figure how I could force it to surround all values with double quotes when exporting data.

Is there an option I missed?

Numeric equality and comparisons

Currently, even when foo is detected as integer-valued, search -s foo 42 matches on the values of foo which contain 42 in their digits, instead of matching on the values 42, as one could possibly expect. A workaround is search -s foo '^42$' or even search -s foo '^0*42$' in case of leading zeros in the file. It could also be useful to search for fields greater or less than a given value. Would it make sense to implement these operations?

Support for NULL values

First, thanks for the great tool!

I was wondering if it would be possible to add some rudimentary support for NULL values?

I want to use xsv to get the column types for a CSV file before loading it into a database. This way I would know the correct column types when creating the database table. While xsv is very fast at calculating file column statistics, my files often have int / float columns with NULL values (e.g. "-", ".'", "\N", "NA"), and xsv reports those as Unicode. It would be nice to have an option like --na-values, so that xsv could skip NULL values when calculating statistics, and could report whether a column contains at least one NULL value.

Compilation failure under rustc 1.0.0-beta.2

This could be me doing something wrong...

$ cargo build --release
   Compiling byteorder v0.3.7
   Compiling threadpool v0.1.4
   Compiling streaming-stats v0.1.23
   Compiling regex v0.1.28
   Compiling rustc-serialize v0.3.12
   Compiling libc v0.1.6
   Compiling log v0.3.1
/Users/luis.casillas/.cargo/registry/src/github.com-1ecc6299db9ec823/streaming-stats-0.1.23/src/lib.rs:1:1: 1:41 error: unstable feature
/Users/luis.casillas/.cargo/registry/src/github.com-1ecc6299db9ec823/streaming-stats-0.1.23/src/lib.rs:1 #![feature(collections, core, std_misc)]
                                                                                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
note: this feature may not be used in the beta release channel
error: aborting due to previous error
Build failed, waiting for other jobs to finish...
Could not compile `streaming-stats`.

$ git log | head -n 5
commit c19cf0bf43d001ee0555d943eb6703c058152877
Author: Andrew Gallant <[email protected]>
Date:   Thu Apr 16 17:50:25 2015 -0400

    crates.io merit badge

Support for tsv as output

Using xsv sample -d $'\t' it correctly reads a TSV (Tab Separated Values) input, but then writes CSV as output. In my case I want to keep the input format and sample it.

An option to control the output delimeter would be very useful.

Unhelpful error message: "No such file or directory (os error 2)"

In more than one context I've observed that when xsv is pointed a path that doesn't exist, it prints this error message:

No such file or directory (os error 2)

In the context of a big script that crunches through a few hundred files, that means that you can't tell which path was the problem.

Friendlier help facility and unrecognized command errors

When I issue the command xsv help , this happens:

% xsv help
Could not match 'help' with any of the allowed variants: ["Cat", "Count", "FixLengths", "Flatten", "Fmt", "Frequency", "Headers", "Index", "Join", "Sample", "Search", "Select", "Slice", "Sort", "Split", "Stats", "Table"]

Because xsv is structured around the xsv <command> [<args> ...] paradigm, I find myself running into this all the time, and I can't imagine I'm the only one who will.

I'm guessing it's low effort and high value to:

  1. Add an actual xsv help subcommand
  2. Dispatch xsv help <command> to the same code as xsv <command> --help (when <command> exists)
  3. Dispatch to the same code as xsv --help when the user issues xsv help or xsv help <command> for a non-existent command.

Feature request: Add sugar for `| tail -n+2`

I feel almost ashamed to ask for a feature that has such an easy workaround, but sometimes, I need to manipulate a CSV file, select rows and columns based on their headers, and output a table without headers. The --no-headers option is of no help because I rely on the headers of the input file. The workaround on Unix is | tail -n+2, but I believe that an option in xsv to implement this behaviour could make sense as well.

Is it worth considering adding this feature?

What is the recommended approach to align csv files?

Hi, I am the author of a csv-plugin for vim and consider using xsv if available to align csv files nicely into tables. At first I thought, xsv table would be all I need, however it seems, that xsv table just makes the fields larger than needed and does not consider header lines. So my question is, is there a recommended way to align csv files so that each field is as large as needed but not larger?

add more data to benchmarks page

Hello,

Thank you for xsv and the desire to have one tool do well with a large amount of data.

I noticed you have this handy page:
https://github.com/BurntSushi/xsv/blob/master/BENCHMARKS.md

You may want to consider displaying data for less beefy systems as well, just so you can see if there's any performance differences between revisions of your code, etc.

Here's just some data running time on a freebsd-10.1 digital ocean droplet 1 gig of RAM droplet:
xsv stats worldcitiespop.csv --everything 12.62s user 1.28s system 16% cpu 1:24.95 total xsv table 0.00s user 0.00s system 0% cpu 1:24.95 total

Certainly not as noteworthy as your results, and I understand you wouldn't really want to do heavy data science on just a gig of RAM. Do you have any input on what would contribute to the length of time it took? Or do you think this is adequate?

Thanks!

Feature request: Case insensitive regexp search

Amongst the widely used options of grep and other regexp-based search programs is -i, --ignore-case, which allows to ignore case distinctions in the regexp pattern as well as in the input. I think that adding it to the search command would be a nice enhancement since it implements a feature not easily achievable in other ways, and most people (I guess) are already familiar with it from using grep or others (and could maybe even actually expect its presence in such a context).

Partition into files based on columns?

Wow, xsv is cool!

I was exploring Pachyderm for large-scale data ingestion tasks, and one use-case popped up fairly often. This is basically a map/reduce implementation using an immutable, versioned cluster file system, and worker jobs running inside of Docker containers. You read files from /pfs/$INPUT_NAME/* and write them to /pfs/out.

Imagine an input file which looks like:

value,count
FL,11957
CA,11816
TX,10157
IL,5633
OH,5556
GA,4535
NY,4088
MI,4008
NJ,3890
CO,3690

I would love to be able to write something like:

xsv partition value /pfs/out

...and create a file named /pfs/out/FL containing 11957, a file named /pfs/out/CA containing 11816, and so on. It would also be OK if the files contained the partitioning column: FL,11957 and CA,11816. If there are more than two columns, they should all be included. All rows with the same value column should be in the appropriate output file.

Does xsv have support for anything like this? Would you like support for something like this? If not, no worries, it would be a trivial standalone tool using your csv library. But I thought I'd check whether you wanted it upstream first.

Add count of empty/non empty values in stats

xsv stats gives a lot of useful information, and I use it a lot.

I'm currently missing one thing that may not be too much of a hassle to add.
Would you consider adding the count of empty/non empty values for each field?

Building fails on macos

cargo build --release yields the following error on mac os

/Users/<username>/.cargo/registry/src/github.com-1ecc6299db9ec823/filetime-0.1.2/src/lib.rs:83:29: 83:57 error: unresolved import `std::os::macos::fs::MetadataExt`. Could not find `macos` in `std::os`
/Users/<username>/.cargo/registry/src/github.com-1ecc6299db9ec823/filetime-0.1.2/src/lib.rs:83                         use std::os::$i::fs::MetadataExt;

Cannot build on macos because of this issue. Any tips/hints would be greatly appreciated

Do `xsv stats` and `xsv slice` respect the `--no-headers` option?

The help message printed by xsv stats --help mentions the --no-headers option:

    -n, --no-headers       When set, the first row will NOT be interpreted
                           as column names. i.e., They will be included
                           in statistics.

And yet I'm seeing the tool use the first column as the header in cases where I specify --no-headers:

% head -n 1 2015-3-18_CA_5E63F293-8C7A-4737-A6CE-9E53DFB7DAF7.txt
209,0000000

% xsv stats --no-headers 2015-3-18_CA_5E63F293-8C7A-4737-A6CE-9E53DFB7DAF7.txt
field,type,min,max,min_length,max_length,mean,stddev
209,Integer,209,951,3,3,650.419291,216.581229
0000000,Integer,0,9999999,7,7,5714603.611216,2326579.121568

% xsv slice --no-headers -i 0 2015-3-18_CA_5E63F293-8C7A-4737-A6CE-9E53DFB7DAF7.txt
209,0000000

Contrast with xsv frequency, which behaves as I expect:

% xsv frequency --no-headers 2015-3-18_CA_5E63F293-8C7A-4737-A6CE-9E53DFB7DAF7.txt
field,value,count
1,310,1562992
1,714,1406572
1,916,1322587
1,818,1291127
1,760,1285504
1,408,1185066
1,805,1120246
1,619,1113284
1,510,1104539
1,415,1100438
2,0000000,22
2,9999999,20

[...]

Wrong results with frequency

Hi,

I've been playing with frequency a bit, but got odd results, and here is what I found.

Take this simple csv file

alpha,num
"a",1
"b",2
"b",1
"a",2

run xsv frequency -s alpha,num

result:
field,value,count
alpha,b,2
alpha,a,2
num,2,2
num,1,2

then run xsv frequency -s num,alpha

result:
field,value,count
num,a,2
num,b,2
alpha,1,2
alpha,2,2

I get the correct result only if I put the fields in the same order as the file header

search return sorted results?

Hello,

Does the sample flag just return whatever it loads first?

Running xsv search -s Population '[0-9]' worldcitiespop.csv | xsv select Country,AccentCity,Population | xsv sample 10 | xsv table a few times results in various data each time. Is there a method to sort this all the time is that not really desired?

Thanks!

xsv search should be type aware (e.g., for comparing numeric values)

SELECT, JOIN and ORDER BY seems to be already in, but filtering rows (WHERE) or replacing groups of rows with stats seems not to be implemented.

I expect the interface to be like this:

xsv filter Population lt 1000 worldcitiespop.csv | xsv group Region | xsv select Region,Latitude/avg,Longitude/median | xsv table

Can xsv eventually replace q?

For fun: my most complicated q usage is this:

q 'SELECT COUNT(a.c3),SUM((a.c3 - b.c3)/1000/1000),SUM((a.c3 - b.c3)/1000/1000*(a.c3 - b.c3)/1000/1000) FROM l.txt a INNER JOIN l.txt b ON a.c2 = b.c2 WHERE a.c1 = "o" AND (b.c1 = "s" OR b.c1 = "i")'

Option to set a default delimiter

OK, can I just say first of all that I'm IN LOVE with this toolkit!? It is an absolute joy and it fills a gaping void for me! ;]

Anyway, I work with a lot of TSV files (tab-separated) and it's kind of annoying to have to type -d "\t" for every single command I run (but thank you for providing the option!). I'd love to be able to change the default separator from , to \t.

Unfortunately it's not easy to alias since the -d must come AFTER the command. I suppose I could do something like xsvtab() { xsv $1 -d "\t" ${@:2}; } (and maybe even alias xsv=xsvtab if I'm feeling really lazy - but then I lose the ability to override the delimiter if I actually do have a different type of file and I have to run it with \xsv to use the original command instead of the alias (specifying -d again with the above alias results in an Invalid arguments. error because now -d is specified twice)). That all feels a little clunky, although it does kind of work for my use case, it would just be slicker to be able to override the default.

Getting sum of a column

When I do xsv stats -s COLUMN --everything FILE, the metrics I can see are min, max, min_length, max_length, mean, stddev, median, mode, and cardinality. How can I get sum? Is it unimplemented or hiding somewhere?

BTW, thank you for the nice work!

Clarify support (or not) for character encodings other than UTF-8

The documentation in the README.md doesn't explain what is xsv's support or policy for character encodings. I think it really ought to.

Looking through the code for xsv and the csv crate, it looks like there isn't a consistent policy:

  1. Most of the code reads rows with the byte_records() function.
  2. xsv search, however, uses the records() function, which interprets the data as UTF-8.
  3. There are a few places where the code calls str::from_utf8() on byte data.
  4. The select module uses String to represent field names, which is UTF-8. What happens when you try to xsv select from a file that has Latin-1 field names?

Option for `xsv cat row` to raise error if input files differ in field names

An error is raised if records are different lengths (differing number of columns/field) across input files, but is silent if the number of columns/fields are the same even if the names are different.

Could we have a flag that forces an error if the input files have columns of different names, even if the number is the same?

Example:

data1.csv:

f1,f2,f3
 1, 2, 3
 4, 5, 6

data2.csv:

f1,f4,f5
 1, 2, 3
 4, 5, 6

Calling xsv cat rows --check-field-names data1.csv data2.csv raises an error.

exclude command idea

I think it would be handy to have a command to exclude matching columns from 2 csvs and print the remainder. For example:

source.csv:

id,boolean
1,yes
2,no
3,yes

exclusions.csv:

id,boolean
2,yes
4,no

syntax could be similar to join: xsv exclude <columns1> <input1> <columns2> <input2>, so

$ xsv exclude id exclusions.csv id source.csv
id,boolean
1,yes
3,yes

join does not respect column order

And maybe that's intended, but it seems like it would be useful to (at least have the option to) say join on "fileA.col1 = fileB.col2 and fileA.col2 = fileB.col1". Right now these both give the same results:

xsv join 1,2 a.txt 1,2 b.txt
xsv join 1,2 a.txt 2,1 b.txt 

I expected the second to match column 1 from a.txt to column 2 from b.txt.

By the way, thanks for a great (and very fast) tool.

Feature request: Filling in blanks

Basically, I want to to be able to run xsv impute and have it replace blank values (or values with an out-of-bounds marker you can specify on the command line, such as "NA"). Obvious choices to fill in with would be a fixed value specified on the command line, or the same value as the previous cell in a row or column.

Value: It's hard to do this using sed or such, when a "blank value" could be the strings ,$ ,""$ ,, ,"", and so on.

Not sure how far one wants to go with this, since it'd be easy to feature creep "well why don't we add the option to fill in blanks with an average" "well why don't we add interpolation" "well why don't we let you put in an arbitrary equation" and so on. Which is obviously not something xsv should care about.

What should print `search` on an empty search?

I often chain several searches with pipes:

$ echo -e "foo,bar\nbli,blo" | xsv search -s foo bla | xsv search -s bar blo
Selector name 'bar' does not exist as a named header in the given CSV data.

Here, the first search does not find any result, so nothing is sent to the second one, which will issue a warning because the column bar is not defined any more. Isn't it a bit strange, semantically? Especially in comparison to stats or frequency which always print a header line even on an empty input. Shouldn't search always print the headers line of the input file as well?

join is very slow burning system time on seek()

I was comparing xsv join to GNU join and noticed that xsv seems to be doing random seeks on its input. This is very slow for even moderate-sized files. From the docs ("Uses a simple hash index to make it fast") I assumed it reads one file into memory and then streams through the other, but perhaps the design was meant to operate out-of-core? (Explicitly building indexes does not help either.)

#!/bin/bash

export LC_ALL=C LINES=1000000 SORT_MEM=1G
seq $LINES | awk -vOFS=, "{print \$1, int(rand() * $LINES)}" > a.csv
seq $LINES | awk -vOFS=, "{print \$1, int(rand() * $LINES)}" > b.csv
time join -t, -j2 -o 1.1,1.2,2.1,2.2 <(sort -S"$SORT_MEM" -t, -k2,2 a.csv) <(sort -S"$SORT_MEM" -t, -k2,2 b.csv) | wc -l
#time (xsv index a.csv ; xsv index b.csv)
time xsv join -n 2 a.csv 2 b.csv | wc -l
rm -f a.csv b.csv a.csv.idx b.csv.idx
$ ./join.sh 
1999368

real    0m0.830s
user    0m0.480s
sys 0m0.076s
1999368

real    0m27.801s
user    0m3.004s
sys 0m24.836s

If it is meant to operate out-of-core, an option to operate in-core would be awesome.

Could not convert '##' to a single ASCII character

The case is when i use xsv count myfile.csv -d "##" and how to use mutil char for delimiter ?

if s.len() != 1 {
                    let msg = format!("Could not convert '{}' to a single \
                                       ASCII character.", s);
                    return Err(d.error(&*msg));
}

I'm not sure that rewriting this could be worked and do not know why it can not be supported by mutil char.

Search/inverted search get confused by this line

I have a line of valid CSV data, which is selected regardless of which columns I select, whether I used inverted matching, or which search term I use.

I installed version 0.11.0 with cargo install.

Line of CSV data (y.csv):

1970-01-01 00:00:51.000000,1970-01-01 00:31:34.000000,222.91.247.73,172.31.13.174,17,"{0,0,0,10}"

I would expect that not all of these invocations will print the line:

xsv search -v 999999999999999 /tmp/y.csv
xsv search 999999999999999 /tmp/y.csv  # prints the line even though there is no match
xsv search 1970 /tmp/y.csv
xsv search -v 1970 /tmp/y.csv # prints the line even though there is a match and inverted filtering is on

[Windows] error: unresolved name `libc::GetSystemInfo`

XSV is failing to build on Windows:

src\util.rs:17:9: 17:30 error: unresolved name `libc::GetSystemInfo` [E0425]
src\util.rs:17         ::libc::GetSystemInfo(&mut sysinfo);

That's from num_cpus. If I hard code num_cpus to return 4 it builds fine.

It looks to me like GetSystemInfo was removed from libc for Windows on this commit.

Windows binary

Hi,
I'm wondering if a windows binary is currently on the agenda. I have tried compiling the project there a while ago, but ran into some issues. I wanted to follow up on that once I understood Rust a bit more.

Last week, I saw you added AppVeyor CI and thought you would release a Windows binary any moment now. It's been almost 2 weeks now, however, so I feel that there is a chance you have missed the artifacts feature of Appveyor that lets you download the binaries it compiles.

Would you be interested in a pull request in that direction? I assume you didn't already hit a wall with that attempt...

Kind regards,
Matthias

add support for datetimes

First, this is a great piece of work.

I wanted to know whether stats functionality will in the future detect date time data types.
I'm planning to use this library to generate sql statements for postgres

Thanks
Ranjan

Binaries don't work on Ubuntu 14.04

fiatjaf@spooner ~> 
curl -sOL https://github.com/BurntSushi/xsv/releases/download/0.8.14/xsv-0.8.14-x86_64-unknown-linux-gnu.tar.gz
fiatjaf@spooner ~> tar xf xsv-0.8.14-x86_64-unknown-linux-gnu.tar.gz
fiatjaf@spooner ~> cd xsv-0.8.14-x86_64-unknown-linux-gnu/
fiatjaf@spooner ~/xsv-0.8.14-x86_64-unknown-linux-gnu> ./xsv --version
Failed to execute process './xsv'. Reason:
exec: Exec format error
The file './xsv' is marked as an executable but could not be run by the operating system.

I tested it in two very different machines I have access to with the same result (but coincidentally they both are running Ubuntu 14.04).

Feature suggestion: Insert fields with constant values

A feature that I don't see in 0.9.6 that might be useful to add: insert new fields into a data set with a constant value.

Example use case: you have a set of files, each of which has data that pertains to a different date. The dates are in the file names but not in any field in the file. You would like to merge them into one data set, but still be able to tell which record came from which file. So you'd do something like:

for date in `generate-dates 2015-04-01 2015-06-30`
do
    xsv insert columns file_date:"$date" file_${date}.csv > file_${date}_with_file_date.csv
done

xsv cat rows file_*_with_file_date.csv > file_2015Q2.csv

Doesn't compile with latest rustc / cargo

~/s/g/xsv (master=) rustc --version
rustc 0.13.0-nightly (636663172 2014-12-28 16:21:58 +0000)

~/s/g/xsv (master=) cargo --version
cargo 0.0.1-pre-nightly (fd5d7a9 2014-12-25 04:28:40 +0000)
~/s/g/xsv (master=) cargo build --release
    Updating registry `https://github.com/rust-lang/crates.io-index`
 Downloading streaming-stats v0.1.8
 Downloading rustc-serialize v0.1.5
 Downloading regex v0.1.4
 Downloading docopt v0.6.20
 Downloading csv v0.12.0
 Downloading tabwriter v0.1.5
   Compiling streaming-stats v0.1.8
   Compiling regex v0.1.4
   Compiling rustc-serialize v0.1.5
   Compiling csv v0.12.0
   Compiling docopt v0.6.20
   Compiling tabwriter v0.1.5
   Compiling xsv v0.8.7 (file:///Users/evanchan/src/github/xsv)
warning: using multiple versions of crate `regex`
/Users/evanchan/src/github/xsv/src/main.rs:8:1: 8:18 note: used here
/Users/evanchan/src/github/xsv/src/main.rs:8 extern crate csv;
                                             ^~~~~~~~~~~~~~~~~
note: crate name: regex
/Users/evanchan/src/github/xsv/src/main.rs:9:1: 9:21 note: used here
/Users/evanchan/src/github/xsv/src/main.rs:9 extern crate docopt;
                                             ^~~~~~~~~~~~~~~~~~~~
note: crate name: regex
/Users/evanchan/src/github/xsv/src/config.rs:41:25: 41:39 error: type `char` does not implement any method in scope named `to_ascii_opt`
/Users/evanchan/src/github/xsv/src/config.rs:41                 match c.to_ascii_opt() {
                                                                        ^~~~~~~~~~~~~~
/Users/evanchan/src/github/xsv/src/config.rs:42:49: 42:64 error: the type of this value must be known in this context
/Users/evanchan/src/github/xsv/src/config.rs:42                     Some(ascii) => Ok(Delimiter(ascii.as_byte())),
                                                                                                ^~~~~~~~~~~~~~~
error: aborting due to 2 previous errors
Could not compile `xsv`.

Update installation section of README

Right now it says there are only linux binaries available. Could it be updated to mention that there are MacOS and Windows binaries available too?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.