Support for reading from / writing to pipes (input from stdin and output to stdout) about vsearch HOT 22 CLOSED

torognes commented on June 3, 2024

Support for reading from / writing to pipes (input from stdin and output to stdout)

from vsearch.

Comments (22)

torognes commented on June 3, 2024 2

I have started implementing support for reading from pipes. Writing should already work.

Automatic detection of compressed data from pipes is very difficult to implement correctly, especially followed by automatic FASTA/FASTQ detection. In order to simplify the implementation I have decided to support compressed input from pipes only when a specific option is specified, e.g. --gzip or --bzip2. This is similar to GNU tar: http://www.gnu.org/software/tar/manual/html_node/gzip.html

Uncompressed data from pipes is no problem, and automatic compression detection on files will still work.

A file with the name - will be considered a synonym for /dev/stdin or /dev/stdout.

The progress indicator will be unavailable when reading from a stream.

Writing compressed output files (FASTA and FASTQ) is another matter that we could also support.

from vsearch.

frederic-mahe commented on June 3, 2024 2

If vsearch reads and write from pipes, is there really a need to support compressed data? It is a genuine question, as I cannot imagine a case where streaming compressed data is the only option. In general, I use anonymous pipes (or named pipes when I need to do fancy stuff) and I decompress the data before streaming it.

from vsearch.

frederic-mahe commented on June 3, 2024

I am not sure it would be very useful, expect for dereplication. When one tries to dereplicate a bunch of fasta files, instead of merging them into a big file before the dereplication, it would be more elegant to pipe them into vsearch stdin (it would save disk space and time to).

Another way to achieve that is to allow multiple filenames for the --derep_fulllength command. It should be possible to parse that, assuming filenames do not start with hyphens.

from vsearch.

torognes commented on June 3, 2024

Check if input is a regular file or a stream. If a stream do not provide progress bar for reading.

from vsearch.

torognes commented on June 3, 2024

Check if filename is a dash (-) and read/write to stdin/stdout instead.

from vsearch.

torognes commented on June 3, 2024

A problem is that we cannot rewind on streams. When we check if a file is compressed with zip or bzip2 we currently need to rewind after auto-detecting whether the input is compressed by reading the signature of the file (first two "magic" bytes"). The same is true for auto detection of FASTA or FASTQ files (File starts with ">" or "@"). We need to keep the data read so that we do not have to rewind the file.

from vsearch.

gregcaporaso commented on June 3, 2024

I was just about to request that multiple files could be passed via -i, and came across this thread. This would be extremely useful for me for dereplication as I'm currently cat'ing a bunch of fasta files together, dereplicating and then deleting the cat'ed file.

Also, a side note, but we released biom 2.1.5 this week, which now has a from-uc command that facilitates getting your vsearch data into QIIME. I have some notes on what the pipeline looks like here. Feel free to share if it's useful for your users.

from vsearch.

torognes commented on June 3, 2024

Hi @gregcaporaso, I'll try to figure out a way to support input from a stream so that dereplicating together several files can be performed.

Thanks for the update to biom. I'll post a note about it.

from vsearch.

gregcaporaso commented on June 3, 2024

Great, thank you!

from vsearch.

zachcp commented on June 3, 2024

Hi @torognes,

Thank you for your fantastic software.

I wanted to voice support for this option. I would like to use this in the context a makefile where I can perform some streaming quality filtering and pipe it into merge_pairs. Right now I trim and write to a file before merging but pipes/process substitution would be a more performant/elegant way to do it

Thanks again,
zach cp

from vsearch.

torognes commented on June 3, 2024

Hi @zachcp ! Thanks for your suggestion. If you want to merge FASTQ files by piping the results of quality filtering into vsearch, how would you supply all the input through one pipe? Do you use a kind of interleaved format where the sequences from the two ends are both found in the same file (121212...)? At the moment, vsearch requires these reads to be in two separate files. - Torbjørn

from vsearch.

zachcp commented on June 3, 2024

Hi @torognes,

Thanks for getting back to me!

I was thinking of using vsearch in the context of a Makefile where quality trimming and merging can be done in a single, streaming step. Compare the following two examples:

$(clusterdir)/%.fq: $(fastqdir)/%_F.fastq $(fastqdir)/%_R.fastq
    # trim and create tempfiles
    seqtk trimfq  $<  > tempF.fq
    seqtk trimfq $(word 2,$^) > tempR.fq
    # run search with tempfiles as input
    vsearch \
      -fastq_mergepairs tempF.fq \
      -reverse tempR.fq \
      -fastqout $@ 
    # remove tempfiles
    rm tempF.fq tempR.fq

$(clusterdir)/%.fq: $(fastqdir)/%_F.fastq $(fastqdir)/%_R.fastq
    # trim and merge in one streaming step with no intermediates
    vsearch \
      -fastq_mergepairs <(seqtk trimfq $<) \
      -reverse          <(seqtk trimfq $(word 2,$^))   \
      -fastqout $@

In the first case I pipe my forward/reverse fastq though Seqtk prior to merging and save them to files. I then use these files in search, and then delete them. In the second case I do the same thing with process substitution to avoid file creation. Setting aside for a second whether this strategy is the best for merging (quick tests indicated a much higher percentage of merged reads, but I don't know if I should truncate reads at consistent lengths), it is a more performant and less error-prone (no temp files to mix up).

So although I am not using pipes, per se, the process substitution model runs on the same streaming model. The second example will throw an error as vsearch will not be handed an actual file.

Thanks again,
zach cp

from vsearch.

torognes commented on June 3, 2024

@zachcp , ok, thanks, now I see what you mean. I'll try to change vsearch so that it will generally accept input from pipes as well and not just ordinary files. But it will require some restructuring and changes to the code so it may take some time. - Torbjørn

from vsearch.

epruesse commented on June 3, 2024

+1 for this feature

Re need for rewinding:

There is no real need for compressed streams. When using pipes, it's easy enough to chain gunzip/bunzip2 outside of vsearch. So just assume uncompressed data if S_ISFIFO() is true.

That leaves testing for FastQ vs FastA. Since w/o compression, the file is guaranteed to be a FILE* and only the first character is needed, you can use getc() and ungetc() to peek at the first character.

Elmar

from vsearch.

colinbrislawn commented on June 3, 2024

👍 for pipes. Power users would get so much mileage out of this. For example

mkfifo pipe
zcat *fasta.gz > pipe &
vsearch --derep_fulllength pipe --output pipe_test.fna

Even better, supporting pipes would instantly close several issues: #144 #115
It would even close #100!

vsearch --derep_fulllength test.fna --output pipe &
sed 's/;/\|/g' < pipe > no_semicolon.fna

from vsearch.

frederic-mahe commented on June 3, 2024

Hi @colinbrislawn,

mkfifo pipe
zcat *fasta.gz > pipe &
vsearch --derep_fulllength pipe --output pipe_test.fna

nice usage of named pipes, a little known feature of modern shells.

from vsearch.

colinbrislawn commented on June 3, 2024

Thanks! I'm looking forward to using these pipes, whenever they become
supported.

I'm not sure pipes are needed for all inputs and outputs, but they would be
really handy for certain functions.
Colin

On Wed, Feb 17, 2016, 12:28 AM Frédéric Mahé [email protected]
wrote:

Hi @colinbrislawn https://github.com/colinbrislawn,

mkfifo pipe
zcat *fasta.gz > pipe &
vsearch --derep_fulllength pipe --output pipe_test.fna

nice usage of named pipes https://en.wikipedia.org/wiki/Named_pipe, a
little known feature of modern shells.

—
Reply to this email directly or view it on GitHub
#39 (comment).

from vsearch.

torognes commented on June 3, 2024

Thanks for all comments. Seems this will be a very useful feature.

The idea of using the ungetc function suggested by @epruesse is brilliant. I hope it will work. Thanks!

from vsearch.

frederic-mahe commented on June 3, 2024

I think adding support for reading from / writing to pipes is a major change, allowing for a new version number vsearch 2.0.0 :-). It should not break existing scripts and vsearch commands, but it offers the possibility to rewrite and simplify workflows.

from vsearch.

torognes commented on June 3, 2024

I do not think it is really necessary to support compression, just a convenience.

from vsearch.

torognes commented on June 3, 2024

Available now in version 2.0.0.

from vsearch.

zachcp commented on June 3, 2024

Very nice, @torognes , very nice. I look forward to trying this. Thank you!

from vsearch.

Support for reading from / writing to pipes (input from stdin and output to stdout) about vsearch HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent