Giter VIP home page Giter VIP logo

Comments (2)

BurntSushi avatar BurntSushi commented on May 11, 2024

sample.csv is just a random sample of worldcitiespop.csv. It was generated with this command as described in the README:

$ xsv search -s Population '[0-9]' worldcitiespop.csv \
  | xsv select Country,AccentCity,Population \
  | xsv sample 10 \
  | xsv table

The point of the tour is to simulate an exploratory crawl through some unknown data. Random sampling unknown data that is large is beneficial because it lets you get a random small view of the data. In the tour, I expressed curiosity at what the country abbreviations meant and how we could solve that by joining the randomly sampled data with the country names. The step after that was, "Well, we did that for the sample, but can we also do it for the full CSV data set?"

You're running out of memory because xsv table has to buffer the entire CSV data in memory before computing alignments and printing the output. This is inherent in producing aligned data.

The actual join command is very memory efficient. Here's just the join:

[andrew@Serval tmp] /usr/bin/time -v xsv join --no-case Country worldcitiespop.csv Abbrev countrynames.csv > joined.csv
~7.5 MB of max resident memory usage reported by `time -v`

But if we pipe the output to xsv table then:

xsv join --no-case Country worldcitiespop.csv Abbrev countrynames.csv | /usr/bin/time -v xsv table > joined-table.csv
~1.7 GB of max resident memory usage reported by `time -v`

The purpose of xsv table is to produce human readable output in your shell. Running an entire corpus through it doesn't make any sense because you're probably not going to read through all of it. Instead, you should slice or sample the data and then pipe that to xsv table.

Note that xsv table documents this behavior:

[andrew@Serval tmp] xsv table --help                                                                                                                                                                  
Outputs CSV data as a table with columns in alignment.                                                                                                                                                

This will not work well if the CSV data contains large fields.                                                                                                                                        

Note that formatting a table requires buffering all CSV data into memory.                                                                                                                             
Therefore, you should use the 'sample' or 'slice' command to trim down large                                                                                                                          
CSV data before formatting it with this command. 

N.B. You might wonder, why in the world does xsv table require 1.7GB of memory for a CSV file that is only 145MB? The answer is that computing the alignments requires a data structure other than just the buffered data and that data structure hasn't really been memory optimized because you're not supposed to feed hundreds of megabytes through it. (One easy answer to this is to force xsv table to flush its buffer after X bytes are read.)

from xsv.

jungle-boogie avatar jungle-boogie commented on May 11, 2024

Hello,

Thank you for your feedback. I'll run time and see what changes.

Thanks!

from xsv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.