Comments (6)
I definitely think that line buffering on output was a big issue in the last benchmark. That's fixed in the latest commit; reading is still slower though.
from frawk.
Fascinating! Thanks for filing this issue. I think mawk, frawk, and gawk may all be buffering their file IO a bit differently. I can try to take a look at what mawk is doing and compare it with the Rust standard library.
from frawk.
I assume mawk might not do line buffering in this case.
The original code I had is actually decompressing gziiped files and writing out gzipped files via those *_cmd
commands:
https://github.com/aertslab/single_cell_toolkit/blob/master/barcode_10x_scatac_fastqs.sh
from frawk.
Could CommanReader be used for reading from pipes to solve this issue? https://docs.rs/grep-cli/latest/grep_cli/struct.CommandReader.html
from frawk.
Feel free to try things out on that latest commit: I don't notice any improvement (and wrapping in a BufRead
doesn't seem to help either, unfortunately).
from frawk.
Probably it is not related to reading from a pipe, but just getline that is slow.
When reading from a premade file directly (with getline) instead of a piped filehandle, the slowdown is the same.
❯ time yes | head -n 100000000 | frawk '{ print $0 }' > /dev/null
real 0m8.219s
user 0m8.313s
sys 0m0.421s
❯ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 } }' > /dev/null
real 0m23.011s
user 0m23.014s
sys 0m0.491s
❯ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 > "/dev/null" } }'
real 0m26.739s
user 0m26.760s
sys 0m0.512s
❯ time yes | head -n 100000000 | frawk 'BEGIN { write_cmd = "cat > /dev/null"; while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 | write_cmd } }'
real 0m26.507s
user 0m26.519s
sys 0m0.564s
# Create file first.
❯ yes | head -n 100000000 > 100000000.txt
❯ time frawk 'BEGIN { while ( (getline line1 < "100000000.txt") > 0 ) { print line1 } }' > /dev/null
real 0m23.025s
user 0m22.778s
sys 0m0.148s
Also now that CommandReader is used, it should be relatively straightforward to be able to handle compressed text files automagically if requested by constructing a CommandReader with the correct decompression tool.
from frawk.
Related Issues (20)
- CSV/TSV convenience variables HOT 1
- Windows / cranelift: misaligned pointer dereference HOT 4
- close(filehandle) is not executed. HOT 3
- Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT HOT 6
- Panic when there are unused string maps and function calls HOT 1
- Checking a csv header unexpectedly adds it to FI HOT 1
- Continue does not jump to correct point in loop
- ENVIRON variable support?
- Windows supported? HOT 1
- error[E0635]: unknown feature `stdsimd`
- Field assignment doesn't work in CSV mode HOT 2
- Arch build failure HOT 1
- Can't build on latest Arch Linux HOT 3
- support for parquet files HOT 3
- Not issue - but a large file performance stat HOT 4
- Plans for a new release? Last release was 1.5 years ago. HOT 5
- error[E0554]: `#![feature]` may not be used on the stable release channel
- "Failure in runtime invalid format arg Null (this should have been caught earlier)" HOT 1
- Is AOT compilation supported/planned? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from frawk.