Hello, Could you explain what role sample.

sample.csv is just a random sample of <code class="no

csv join example about xsv HOT 2 CLOSED

burntsushi commented on May 11, 2024

csv join example

from xsv.

Comments (2)

BurntSushi commented on May 11, 2024

sample.csv is just a random sample of worldcitiespop.csv. It was generated with this command as described in the README:

$ xsv search -s Population '[0-9]' worldcitiespop.csv \
  | xsv select Country,AccentCity,Population \
  | xsv sample 10 \
  | xsv table

The point of the tour is to simulate an exploratory crawl through some unknown data. Random sampling unknown data that is large is beneficial because it lets you get a random small view of the data. In the tour, I expressed curiosity at what the country abbreviations meant and how we could solve that by joining the randomly sampled data with the country names. The step after that was, "Well, we did that for the sample, but can we also do it for the full CSV data set?"

You're running out of memory because xsv table has to buffer the entire CSV data in memory before computing alignments and printing the output. This is inherent in producing aligned data.

The actual join command is very memory efficient. Here's just the join:

[andrew@Serval tmp] /usr/bin/time -v xsv join --no-case Country worldcitiespop.csv Abbrev countrynames.csv > joined.csv
~7.5 MB of max resident memory usage reported by `time -v`

But if we pipe the output to xsv table then:

xsv join --no-case Country worldcitiespop.csv Abbrev countrynames.csv | /usr/bin/time -v xsv table > joined-table.csv
~1.7 GB of max resident memory usage reported by `time -v`

The purpose of xsv table is to produce human readable output in your shell. Running an entire corpus through it doesn't make any sense because you're probably not going to read through all of it. Instead, you should slice or sample the data and then pipe that to xsv table.

Note that xsv table documents this behavior:

[andrew@Serval tmp] xsv table --help                                                                                                                                                                  
Outputs CSV data as a table with columns in alignment.                                                                                                                                                

This will not work well if the CSV data contains large fields.                                                                                                                                        

Note that formatting a table requires buffering all CSV data into memory.                                                                                                                             
Therefore, you should use the 'sample' or 'slice' command to trim down large                                                                                                                          
CSV data before formatting it with this command.

N.B. You might wonder, why in the world does xsv table require 1.7GB of memory for a CSV file that is only 145MB? The answer is that computing the alignments requires a data structure other than just the buffered data and that data structure hasn't really been memory optimized because you're not supposed to feed hundreds of megabytes through it. (One easy answer to this is to force xsv table to flush its buffer after X bytes are read.)

from xsv.

jungle-boogie commented on May 11, 2024

Hello,

Thank you for your feedback. I'll run time and see what changes.

Thanks!

from xsv.

csv join example about xsv HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent