Giter VIP home page Giter VIP logo

csvquote's Introduction

csvquote

smart and simple CSV processing on the command line

Dan Brown, May 2013
https://github.com/dbro/csvquote

Are you looking for a way to process CSV data with standard UNIX shell commands?

Are you running into problems with embedded commas and newlines that mess everything up?

Do you wish there was some way to add some CSV intelligence to these UNIX tools?

  • awk, sed
  • cut, join
  • head, tail
  • sort, uniq
  • wc, split

This program can be used at the start and end of a text processing pipeline so that regular unix command line tools can properly handle CSV data that contain commas and newlines inside quoted data fields.

Without this program, embedded special characters would be incorrectly interpreted as separators when they are inside quoted data fields.

By using csvquote, you temporarily replace the special characters inside quoted fields with harmless nonprinting characters that can be processed as data by regular text tools. At the end of processing the text, these nonprinting characters are restored to their previous values.

In short, csvquote wraps the pipeline of UNIX commands to let them work on clean data that is consistently separated, with no ambiguous special characters present inside the data fields.

By default, the program expects to use these as special characters:

" quote character  
, field delimiter  
\n record separator  

It is possible to specify different characters for the field and record separators, such as tabs or pipe symbols.

Note that the quote character can be contained inside a quoted field by repeating it twice, eg.

field1,"field2, has a comma in it","field 3 has a ""Quoted String"" in it"

Typical usage of csvquote is as part of a command line pipe, to permit the regular unix text-manipulating commands to avoid misinterpreting special characters found inside fields. eg.

csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u

or taking input from stdin,

cat foobar.csv | csvquote | cut -d ',' -f 7,4,2 | csvquote -u

other examples:

csvquote -t foobar.tsv | wc -l

csvquote -q "'" foobar.csv | sort -t, -k3 | csvquote -u

csvquote foobar.csv | awk -F, '{sum+=$3} END {print sum}'

How it works

We can compare the hexadecimal representation of the data to see how csvquote substitutes the non-printing characters (1e and 1f) for newlines (0a) and commas (2c). The embedded special characters are shown in bold font below. The first command shows the original data, the second command shows how the command 'csvquote' sanitizes the delimiters contained inside quoted strings, and the third command shows how the command 'csvquote -u' restores the original data. Note that xxd uses '.' characters to represent unprintable characters in its text representation on the right side below.

$ echo 'ab,"cd,ef","hi
jk"' | xxd -g 1 -c 20

00000000: 61 62 2c 22 63 64 2c 65 66 22 2c 22 68 69 0a 6a 6b 22 0a     ab,"cd,ef","hi.jk".


$ echo 'ab,"cd,ef","hi
jk"' | csvquote | xxd -g 1 -c 20

00000000: 61 62 2c 22 63 64 1f 65 66 22 2c 22 68 69 1e 6a 6b 22 0a     ab,"cd.ef","hi.jk".


$ echo 'ab,"cd,ef","hi
jk"' | csvquote | csvquote -u | xxd -g 1 -c 20

00000000: 61 62 2c 22 63 64 2c 65 66 22 2c 22 68 69 0a 6a 6b 22 0a     ab,"cd,ef","hi.jk".

Installation

$ make
$ sudo make install

This will install the csvquote program as well as csvheader, which is a convenient script that prints out the field numbers next to the first row of data.

Depends on the "build-essentials" package.

via Homebrew (Mac, Linux)

$ brew install sschlesier/csvutils/csvquote

Known limitations

The program does not correctly handle multi-character delimiters, but this is rare in CSV files. It is able to work with Windows-style line endings that use \r\n as the record separator.

If you need to search for special characters (commas and newlines) within a quoted field, then csvquote will PROBABLY NOT work for you. These delimiter characters get temporarily replaced with nonprinting characters so they would not be found. There are two options you could try:

  1. Use csvquote as normal, and search for the substitute nonprinting characters instead of the regular delimiters.
  2. Instead of using csvquote, try a csv-aware text tool such as csvfix.

Some thoughts on CSV as a data storage format

CSV is easy to read. This is its essential advantage over other formats. Most programs that work with data provide methods to read and write CSV data. It can be understood by humans, which is helpful for validation and auditing.

But CSV has some challenges when used to transfer information among systems and programs. This program exists to handle the challenge of ambiguity in CSV. The separator characters can be used as either data (aka text) or metadata (aka special characters) depending on whether it is quoted or not. This context sensitivity complicates parsing, because (for example) a newline cannot be interpreted without its context. If it is surrounded by double-quotes then it must be handled as text, otherwise it is a separator.

To repeat: the meaning of each character depends on the entire contents of the file up to that point. This has two negative consequences.

  • CSV files must be parsed sequentially from start to end. They cannot be split up and processed in parallel (without first scanning them to make sure the splits would be in "safe" places).

  • CSV files do not have a mechanism to recover from corruption. If an extra double-quote character gets inserted into the file, then all of the remaining data will be misinterpreted.

Another issue is that the character encoding of CSV files is not specified.

Run-time Speed comparison

How fast is csvquote? Here are some data processing rates measured when running csvquote on an Intel i7 CPU model from 2013. Due to recently introduced optimizations in csvquote, the processing speed depends on how common the quote characters are in the source data.

  • 1.9 GB/sec : csvquote reading random csv data with 10% of fields quoted
  • 0.5 GB/sec : csvquote reading random csv data with 100% of fields quoted
  • 3.7 GB/sec : csvquote reading random csv data with no quoted fields (nothing for csvquote to do!)

A common use of csvquote is as one (or two) steps in a pipeline sequence of commands. When each command can run on a separate processor, the time to complete the overall pipeline sequence will be determined by the slowest step in the chain of dependencies. So as long as csvquote is not the slowest step in the sequence, then its relative speed will not affect the overall run time. This seems likely if some of these commands are involved:

  • 3.1 GB/sec : wc -l
  • 1.3 GB/sec : grep 'ZZZ'
  • 1.0 GB/sec : tr 'a' 'b'
  • 0.3 GB/sec : cut -f1

In January 2022, csvquote was rewritten to be approximately 10x faster than before, from optimizing for source data with at least half of its fields not quoted. The inspiration to revisit this old code came from a rewrite by skeeto@. His version is especially aimed at modern CPUs (Intel and AMD from about the year 2015) and runs approximately 50% faster using AVX2 SIMD instructions. When running skeeto's version on my CPU (without support for AVX2 SIMD) it processes data at a consistent pace of 0.7 GB/sec, and does not vary depending on how many quote characters are in the source data. When running on a CPU with AVX2 SIMD support, it does slow down somewhat less than this version of csvquote does as the frequency of quoted fields increases.

csvquote's People

Contributors

dbro avatar jwilk avatar lenzai avatar sschlesier avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csvquote's Issues

Little bug with Progress Export Files

Progress Database default file export format uses a "blank space" as field separator and uses the double quote char as string delimiter. If you use:

csvquote -d ' '

you corrupt the file as every space character will be replaced in every row.

So this is the case:

root@srv-zfs:/samba/public/progress/dump# csvquote a.d
"01010001" 01/01/01 ? "SOCIO SERVICOM C/SOTTOSCRIZIONE" "" "" "" "" "" "" "" "" 0 0
"01010002" 01/01/01 ? "SOCIO ELETTRO 2000 C/SOTTOSCRIZIONE" "" "" "" "" "" "" "" "" 0 0

root@srv-zfs:/samba/public/progress/dump# csvquote a.d -d ' '
"01010001" 01/01/01 ? "SOCIOSERVICOMC/SOTTOSCRIZIONE" "" "" "" "" "" "" "" "" 0 0
"01010002" 01/01/01 ? "SOCIOELETTRO2000C/SOTTOSCRIZIONE" "" "" "" "" "" "" "" "" 0 0

Behavior is changed?

Hi, was it intended that the change the treating of the quoted comma?

the past behavior is

$ echo 'foo,"ba,r"' | csvquote | xxd
00000000: 666f 6f2c 2262 611f 7222 0a              foo,"ba.r".

but for now, it changed to

$ echo 'foo,"ba,r"' | csvquote | xxd
00000000: 666f 6f2c 2262 612c 7222 0a              foo,"ba,r".

Currently, the following example no longer works in specific situations because foo, "ba,r" is treated as the foo|ba|r instead of foo|bar by cut

csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u

I'm just wondering if it is good. I don't have any suggestions/opinions about this change

Changing the number of lines

First of all thank you for this tool.
My issue is that when I'm using this tool it is changing the total number of lines.

wc -l old.csv
Displays: 344548 old.csv

csvquote old.csv > new.csv
wc -l new.csv
Displays: 344370 temp.csv

Have you got any idea why is this happening?
Thanks

Using non-ASCII separator characters doesn't work

When using a non-ASCII character (i.e. a character that can't be represented as one byte in UTF-8) as a separator, only the first byte of the character's representation is used.

As noted in the README, this is rarely required, but in my case the application I'm using won't accept CSV files with the field separator in a field, even when quoted, so I have to resort to using an alternative character.

Example:

$ echo 'a,b,c,"d,e,f",g,h,"i,j,k",l' | csvquote | cut -d, -f2,4,5 | csvquote -u -d、
b,"d�e�f",g

Add option to flush buffer

Sometimes you may want to pipe encoding and then you need only some parts of its output and decode that and use the result as part of output (which may include things that were not in the original file passed for csvquote). Normally this won't work due to buffering, but flushing makes it possible.

I have added such an option to awk implementation here jarnos@2d598c0

Escaped quotes quoted

Not sure how best to solve. Probably need to implement the escape somehow.

Test string (two lines) is

1,2,3,"\"hello",4,5,6
7,8,9

The csvquote code today will translate and un-translate properly, but the interim string is not usable by standard unix tools.

What I think is happening is that csvquote isn't escaping with the backslash, so it interprets the escaped double quote as the end, then the last double quote as the beginning of a quoted string. The output I get has the rest of the buffer with no spaces, commas, or even a newline (looks like: 1,2,3,""hello"456789). The non printing characters must be in there, as csvquote -u restores the string to the original.

CSV escaping (typically with backslash) is not standardized, but is common.

Using double quotes rather than a backslash escape seems to work.

I will see about making a patch. I'm not very good with C, but a colleague is and he may be interested in helping.

Eating up comma (,) in output

Hello, would be a great tool. According to the example a comma within a quoted field should be no problem, BUT:

$ ./csvquote
"abc","def,g"
"abc","defg"    <-- Look here: comma is missing in "def,g" !!
$

What's going on here? This should work, according to the doc.
Please help.

Other example:

$ echo 'field1,"field2, has a comma in it","field 3 has a ""Quoted String"" in it"' | ./csvquote | cut -d, -f2
"field2 has a comma in it"  <-- Look here, comma is missing !!

System: OS X El Capitan 10.11.6. just used standard "csvquote.c" version, make.

Add error checks for both input and output

Input may end early due to an error and should be checked. More importantly, a successful fwrite() only means bytes were successfully copied to the output buffer. They may still fail to be written later when the buffer is flushed. Flushing at the end and checking its result ensures the output buffer has been emptied. Without this check, some write errors go undetected:

echo | csvquote >/dev/full && echo success

0001-Add-error-checks-for-both-input-and-output.patch.txt

--- a/csvquote.c
+++ b/csvquote.c
@@ -94,6 +94,8 @@ const char del, const char quo, const char rec) {
         check(fwrite(buffer, sizeof(char), nbytes, stdout) == nbytes,
             "Failed to write %zu bytes\n", nbytes);
     }
+    check(ferror(in) == 0, "Failed to read input\n");
+    check(fflush(stdout) == 0, "Failed to flush output\n");
     return 0;
 
 error:

Document what the quoting mechanism is

After reading only the README, I wasn't able to tell what the sanitized output would be for the given examples.

Could you document that sanitize mode converts field separators (commas by default) inside fields into character 0x1F (US = Unit Separator), and record separators (newlines by default) into 0x1E (RS = Record Separator), and restore mode converts them back?

Note: If the original file has 0x1F or 0x1E characters, they'll be replaced with commas and newlines when the file is converted back.

Note^2: The program operates on bytes, so some UTF-8 characters may get mangled.

Note^3: After reading the other (closed) issues, I see others have asked the same question; this issue still stands, though: if it's documented, people will stop asking. :-)

working with body

To work with datasets as country-codes.csv we need some more workaround. The simple "sort by column3" is,

cat country-codes.csv | (read -r; printf "%s\n" "$REPLY"; csvquote | sort -t , -k 3,3 | csvquote -u) > byCol3.csv

So, perhaps you can build a shell complement to work in a simple syntax, as csvsort command.


PS: I have problems with csvkit, and, as suggested here back to usual unix commands... But need simple command, as suggested here (but not works fine and I using this ugly).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.