Comments (22)
I have started implementing support for reading from pipes. Writing should already work.
Automatic detection of compressed data from pipes is very difficult to implement correctly, especially followed by automatic FASTA/FASTQ detection. In order to simplify the implementation I have decided to support compressed input from pipes only when a specific option is specified, e.g. --gzip
or --bzip2
. This is similar to GNU tar: http://www.gnu.org/software/tar/manual/html_node/gzip.html
Uncompressed data from pipes is no problem, and automatic compression detection on files will still work.
A file with the name -
will be considered a synonym for /dev/stdin
or /dev/stdout
.
The progress indicator will be unavailable when reading from a stream.
Writing compressed output files (FASTA and FASTQ) is another matter that we could also support.
from vsearch.
If vsearch reads and write from pipes, is there really a need to support compressed data? It is a genuine question, as I cannot imagine a case where streaming compressed data is the only option. In general, I use anonymous pipes (or named pipes when I need to do fancy stuff) and I decompress the data before streaming it.
from vsearch.
I am not sure it would be very useful, expect for dereplication. When one tries to dereplicate a bunch of fasta files, instead of merging them into a big file before the dereplication, it would be more elegant to pipe them into vsearch stdin (it would save disk space and time to).
Another way to achieve that is to allow multiple filenames for the --derep_fulllength
command. It should be possible to parse that, assuming filenames do not start with hyphens.
from vsearch.
Check if input is a regular file or a stream. If a stream do not provide progress bar for reading.
from vsearch.
Check if filename is a dash (-) and read/write to stdin/stdout instead.
from vsearch.
A problem is that we cannot rewind on streams. When we check if a file is compressed with zip or bzip2 we currently need to rewind after auto-detecting whether the input is compressed by reading the signature of the file (first two "magic" bytes"). The same is true for auto detection of FASTA or FASTQ files (File starts with ">" or "@"). We need to keep the data read so that we do not have to rewind the file.
from vsearch.
I was just about to request that multiple files could be passed via -i
, and came across this thread. This would be extremely useful for me for dereplication as I'm currently cat'ing a bunch of fasta files together, dereplicating and then deleting the cat'ed file.
Also, a side note, but we released biom 2.1.5 this week, which now has a from-uc
command that facilitates getting your vsearch data into QIIME. I have some notes on what the pipeline looks like here. Feel free to share if it's useful for your users.
from vsearch.
Hi @gregcaporaso, I'll try to figure out a way to support input from a stream so that dereplicating together several files can be performed.
Thanks for the update to biom. I'll post a note about it.
from vsearch.
Great, thank you!
from vsearch.
Hi @torognes,
Thank you for your fantastic software.
I wanted to voice support for this option. I would like to use this in the context a makefile where I can perform some streaming quality filtering and pipe it into merge_pairs. Right now I trim and write to a file before merging but pipes/process substitution would be a more performant/elegant way to do it
Thanks again,
zach cp
from vsearch.
Hi @zachcp ! Thanks for your suggestion. If you want to merge FASTQ files by piping the results of quality filtering into vsearch, how would you supply all the input through one pipe? Do you use a kind of interleaved format where the sequences from the two ends are both found in the same file (121212...)? At the moment, vsearch requires these reads to be in two separate files. - Torbjørn
from vsearch.
Hi @torognes,
Thanks for getting back to me!
I was thinking of using vsearch in the context of a Makefile where quality trimming and merging can be done in a single, streaming step. Compare the following two examples:
$(clusterdir)/%.fq: $(fastqdir)/%_F.fastq $(fastqdir)/%_R.fastq
# trim and create tempfiles
seqtk trimfq $< > tempF.fq
seqtk trimfq $(word 2,$^) > tempR.fq
# run search with tempfiles as input
vsearch \
-fastq_mergepairs tempF.fq \
-reverse tempR.fq \
-fastqout $@
# remove tempfiles
rm tempF.fq tempR.fq
$(clusterdir)/%.fq: $(fastqdir)/%_F.fastq $(fastqdir)/%_R.fastq
# trim and merge in one streaming step with no intermediates
vsearch \
-fastq_mergepairs <(seqtk trimfq $<) \
-reverse <(seqtk trimfq $(word 2,$^)) \
-fastqout $@
In the first case I pipe my forward/reverse fastq though Seqtk prior to merging and save them to files. I then use these files in search, and then delete them. In the second case I do the same thing with process substitution to avoid file creation. Setting aside for a second whether this strategy is the best for merging (quick tests indicated a much higher percentage of merged reads, but I don't know if I should truncate reads at consistent lengths), it is a more performant and less error-prone (no temp files to mix up).
So although I am not using pipes, per se, the process substitution model runs on the same streaming model. The second example will throw an error as vsearch will not be handed an actual file.
Thanks again,
zach cp
from vsearch.
@zachcp , ok, thanks, now I see what you mean. I'll try to change vsearch so that it will generally accept input from pipes as well and not just ordinary files. But it will require some restructuring and changes to the code so it may take some time. - Torbjørn
from vsearch.
+1 for this feature
Re need for rewinding:
There is no real need for compressed streams. When using pipes, it's easy enough to chain gunzip/bunzip2 outside of vsearch. So just assume uncompressed data if S_ISFIFO() is true.
That leaves testing for FastQ vs FastA. Since w/o compression, the file is guaranteed to be a FILE* and only the first character is needed, you can use getc() and ungetc() to peek at the first character.
Elmar
from vsearch.
👍 for pipes. Power users would get so much mileage out of this. For example
mkfifo pipe
zcat *fasta.gz > pipe &
vsearch --derep_fulllength pipe --output pipe_test.fna
Even better, supporting pipes would instantly close several issues: #144 #115
It would even close #100!
vsearch --derep_fulllength test.fna --output pipe &
sed 's/;/\|/g' < pipe > no_semicolon.fna
from vsearch.
Hi @colinbrislawn,
mkfifo pipe
zcat *fasta.gz > pipe &
vsearch --derep_fulllength pipe --output pipe_test.fna
nice usage of named pipes, a little known feature of modern shells.
from vsearch.
Thanks! I'm looking forward to using these pipes, whenever they become
supported.
I'm not sure pipes are needed for all inputs and outputs, but they would be
really handy for certain functions.
Colin
On Wed, Feb 17, 2016, 12:28 AM Frédéric Mahé [email protected]
wrote:
Hi @colinbrislawn https://github.com/colinbrislawn,
mkfifo pipe
zcat *fasta.gz > pipe &
vsearch --derep_fulllength pipe --output pipe_test.fnanice usage of named pipes https://en.wikipedia.org/wiki/Named_pipe, a
little known feature of modern shells.—
Reply to this email directly or view it on GitHub
#39 (comment).
from vsearch.
Thanks for all comments. Seems this will be a very useful feature.
The idea of using the ungetc function suggested by @epruesse is brilliant. I hope it will work. Thanks!
from vsearch.
I think adding support for reading from / writing to pipes is a major change, allowing for a new version number vsearch 2.0.0
:-). It should not break existing scripts and vsearch commands, but it offers the possibility to rewrite and simplify workflows.
from vsearch.
I do not think it is really necessary to support compression, just a convenience.
from vsearch.
Available now in version 2.0.0.
from vsearch.
Very nice, @torognes , very nice. I look forward to trying this. Thank you!
from vsearch.
Related Issues (20)
- always report the rightmost match if multiple equivalent occurrences are present in target sequence? No. HOT 5
- compilation warning with ar: 'u' modifier ignored since 'D' is the default HOT 2
- sintax output is sometimes 4 columns and other times 5 columns HOT 3
- fastq_stripleft when the resulting length is null?
- forward read trimming and filtering (Minardi et al. 2021) HOT 1
- control of 2 separate randseed events in sintax HOT 10
- from fasta files to an OTU table HOT 1
- --uchime_denovo takes abundance information into account HOT 1
- how to detect matches containing many ambiguous symbols? HOT 1
- more compile-time checks HOT 2
- Issue encountered when using vsearch --usearch_global to generate OTU frequency table HOT 3
- clean-up stale branches HOT 1
- --makeudb_usearch truncates fasta headers HOT 3
- maxseqlength is not supported by uchime_denovo command HOT 6
- vsearch --usearch_global not showing "full alignment" instead only the segment pair HOT 3
- vsearch --top_hits_only --maxaccepts 1 returns sometimes 2 values HOT 6
- Issue related to usearch_global match HOT 4
- missing userfields options
- Consequences of using vsearch on NovaSeq data HOT 4
- Fix warnings reported by Lintian HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vsearch.