Comments (2)
sample.csv
is just a random sample of worldcitiespop.csv
. It was generated with this command as described in the README:
$ xsv search -s Population '[0-9]' worldcitiespop.csv \
| xsv select Country,AccentCity,Population \
| xsv sample 10 \
| xsv table
The point of the tour is to simulate an exploratory crawl through some unknown data. Random sampling unknown data that is large is beneficial because it lets you get a random small view of the data. In the tour, I expressed curiosity at what the country abbreviations meant and how we could solve that by joining the randomly sampled data with the country names. The step after that was, "Well, we did that for the sample, but can we also do it for the full CSV data set?"
You're running out of memory because xsv table
has to buffer the entire CSV data in memory before computing alignments and printing the output. This is inherent in producing aligned data.
The actual join
command is very memory efficient. Here's just the join:
[andrew@Serval tmp] /usr/bin/time -v xsv join --no-case Country worldcitiespop.csv Abbrev countrynames.csv > joined.csv
~7.5 MB of max resident memory usage reported by `time -v`
But if we pipe the output to xsv table
then:
xsv join --no-case Country worldcitiespop.csv Abbrev countrynames.csv | /usr/bin/time -v xsv table > joined-table.csv
~1.7 GB of max resident memory usage reported by `time -v`
The purpose of xsv table
is to produce human readable output in your shell. Running an entire corpus through it doesn't make any sense because you're probably not going to read through all of it. Instead, you should slice or sample the data and then pipe that to xsv table
.
Note that xsv table
documents this behavior:
[andrew@Serval tmp] xsv table --help
Outputs CSV data as a table with columns in alignment.
This will not work well if the CSV data contains large fields.
Note that formatting a table requires buffering all CSV data into memory.
Therefore, you should use the 'sample' or 'slice' command to trim down large
CSV data before formatting it with this command.
N.B. You might wonder, why in the world does xsv table
require 1.7GB of memory for a CSV file that is only 145MB? The answer is that computing the alignments requires a data structure other than just the buffered data and that data structure hasn't really been memory optimized because you're not supposed to feed hundreds of megabytes through it. (One easy answer to this is to force xsv table
to flush its buffer after X bytes are read.)
from xsv.
Hello,
Thank you for your feedback. I'll run time and see what changes.
Thanks!
from xsv.
Related Issues (20)
- Auto-detect delimiter
- Idea: Literally embed SQLite into xsv HOT 1
- Completions
- feature request: write stdout to pager command
- Please add Contributing.md HOT 2
- On motivation HOT 4
- --compress-program option for xsv split? HOT 4
- Ignore bad lines instead of crashing HOT 3
- Investigate increasing Maintainer Productivity with Pull Requests Environments HOT 1
- "@list" external parameter list
- Split a column
- xsv fixlengths never finishes on some input HOT 1
- How to avoid extra quotes HOT 1
- Is it possible to generate a new csv from the associated fields of two csv's? HOT 2
- Use file name patterns in cat HOT 1
- New build needed for `reverse`
- Scanning of "cargo-geiger" showed unsafe code HOT 1
- xsv might silently fail
- Cargo test failed on Windows
- fixlengths -- insert extra commas not at end
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xsv.