bluenote-1577 / sylph Goto Github PK

View Code? Open in Web Editor NEW

128.0 3.0 6.0 28.17 MB

ultrafast genome querying and taxonomic profiling for metagenomic samples by abundance-corrected minhash.

License: MIT License

Rust 97.12% Python 2.88%

average-nucleotide-identity k-mer metagenomics sketching-algorithm taxonomic-classification

sylph's Introduction

sylph - fast and precise species-level metagenomic profiling with ANIs

Introduction

sylph is a program that performs ultrafast (1) ANI querying or (2) metagenomic profiling for metagenomic shotgun samples.

Containment ANI querying: sylph can search a genome, e.g. E. coli, against your sample. If sylph outputs an estimate of 97% ANI, your sample contains an E. coli with 97% ANI to the queried genome.

Metagenomic profiling: sylph can determine the species/taxa in your sample and their abundances, just like Kraken or MetaPhlAn.

Profiling 1 Gbp of mouse gut reads against 85,205 genomes in a few seconds

Why sylph?

Precise species-level profiling: Our tests show that sylph is more precise than Kraken and about as precise and sensitive as marker gene methods (MetaPhlAn, mOTUs).
Ultrafast, multithreaded, multi-sample: sylph can be > 50x faster than MetaPhlAn for multi-sample processing. sylph only takes ~15GB of RAM for profiling against the entire GTDB-R220 database (110k genomes).
Accurate (containment) ANIs down to 0.1x effective coverage: for bacterial ANI queries of > 90% ANI, sylph can often give accurate ANI estimates down to 0.1x coverage.
Easily customized databases: sylph can profile against metagenome-assembled genomes (MAGs), viruses, eukaryotes, and more. Taxonomic information can be incorporated downstream for traditional profiling reports.

How does sylph work?

sylph uses a k-mer containment method, similar to sourmash or Mash. sylph's novelty lies in using a statistical technique to correct ANI for low coverage genomes within the sample, giving accurate results for low abundance genomes. See here for more information on what sylph can and can not do.

Very quick start

Profile metagenome sample against GTDB-R220 (113,104 bacterial/archaeal species representative genomes)

# download GTDB-R220 pre-built database (~13 GB)
wget https://storage.googleapis.com/sylph-stuff/gtdb-r220-c200-dbv1.syldb

# multi-sample paired-end profiling (sylph version >= 0.6)
sylph profile gtdb-r220-c200-dbv1.syldb -1 *_1.fastq.gz -2 *_2.fastq.gz -t (threads) > profiling.tsv

# multi-sample single-end profiling
sylph profile gtdb-r220-c200-dbv1.syldb *.fastq -t (threads) > profiling.tsv

Install (current version v0.6.1)

Option 1: conda install

conda install -c bioconda sylph

Warning

conda install may break if AVX2 instructions are not available on your CPU. See the issue here. The binary and source install still work.

Option 2: Build from source

Requirements:

rust (version > 1.63) programming language and associated tools such as cargo are required and assumed to be in PATH.
A c compiler (e.g. GCC)
make
cmake

Building takes a few minutes (depending on # of cores).

git clone https://github.com/bluenote-1577/sylph
cd sylph

# If default rust install directory is ~/.cargo
cargo install --path . --root ~/.cargo
sylph query test_files/*

Option 3: Pre-built x86-64 linux statically compiled executable

If you're on an x86-64 system, you can download the binary and use it without any installation.

wget https://github.com/bluenote-1577/sylph/releases/download/latest/sylph
chmod +x sylph
./sylph -h

Note: the binary is compiled with a different set of libraries (musl instead of glibc), probably impacting performance.

Standard usage

Sketching reads/genomes (indexing)

# all fasta -> one *.syldb; fasta are assumed to be genomes
sylph sketch genome1.fa genome2.fa -o database
#EQUIVALENT: sylph sketch -g genome1.fa genome2.fa -o database

# multi-sample sketching of paired reads
sylph sketch -1 A_1.fq B_1.fq -2 A_2.fq B_2.fq -d read_sketch_folder

# multi-sample sketching for single end reads, fastq are assumed to be reads
sylph sketch reads.fq 
#EQUIVALENT: sylph sketch -r reads.fq

Profiling or querying with sketch files

# ANI querying 
sylph query database.syldb read_sketch_folder/*.sylsp -t (threads) > ani_queries.tsv

# taxonomic profiling 
sylph profile database.syldb read_sketch_folder/*.sylsp -t (threads) > profiling.tsv

Tutorials, manuals, and pre-built databases

Pre-built databases

The pre-built databases available here can be downloaded and used with sylph for profiling and containment querying.

Cookbook

For common use cases and fast explanations, see the above cookbook.

Tutorials

Manuals

sylph-utils

For incorporating taxonomy and manipulating output formats, see the sylph-utils repository.

Changelog

Version v0.6.1 - 2024-04-29.

Made unknown estimation (-u) more robust for low-depth short-read sequencing.

See the CHANGELOG for complete details.

Citing sylph

Jim Shaw and Yun William Yu. Metagenome profiling and containment estimation through abundance-corrected k-mer sketching with sylph (2023). bioRxiv. (Accepted for publication)

sylph's People

Contributors

Stargazers

Watchers

Forkers

alienzj hirenbioinfo schaudge nahid18 yhg926 gzhoffie

sylph's Issues

-o option only works for sketching databases, but not samples

Hi @bluenote-1577,

-o option seems to be ignored while sketching samples.

Could you fix this?

Conda executable crashes

Hi @bluenote-1577,

I installed sylph using bioconda but it crashed immediately with the following error "Illegal instruction (core dumped)".
Notably, the executable available on github works.

Best,
Florian

Can it replace kraken2 for pathogen identification?

I have tested using a 20M 50-bp sequencing fastq for pathogen identification, but I didn't get any results.
Is this software only suitable for applications with large data volumes such as gut microbiota?

debug and trace option do nothing with prebuilt binaries

Hi @bluenote-1577,

Maybe it's on purpose, but the debug and trace option do not generate a different output.
Yet, I am interested by some statistics such as the number of deduplicated reads.

Taxonomic profiling for contigs

Hi, I want to use sylph to taxonomically profile a set of contigs against the GTDB v220 database. I installed sylph v.0.6.1 with conda.

I sketched the database from the contigs like this:
sylph sketch contigs.fa -i -o contigs

then I want to do the taxonomic profiling like this:
sylph profile contigs.syldb v0.3-c200-gtdb-r214.syldb -o contigs_gtdb_sylph_results.tsv

But I get the error:
2024-06-28T12:45:16.610Z INFO [sylph::contain] Obtaining sketches...
2024-06-28T12:45:16.610Z ERROR [sylph::contain] No read files found; see sylph query/profile -h for help. Exiting

Apparently sylph is looking for read files, and doesn't understand it's a contig database. Could you help me fix this problem?

Thanks a lot in advance

redundant .clone() call on read_sketch_file

Hello! Working on getting a Docker image built over in StaPH-B/docker-builds#816 and noticed this during compilation:

#10 50.29    Compiling sylph v0.4.1 (/sylph-0.4.1)
#10 50.84 warning: call to `.clone()` on a reference in this situation does nothing
#10 50.84    --> src/contain.rs:485:47
#10 50.84     |
#10 50.84 485 |         let file = File::open(read_sketch_file.clone()).expect(&format!(
#10 50.84     |                                               ^^^^^^^^ help: remove this redundant call
#10 50.84     |
#10 50.84     = note: the type `str` does not implement `Clone`, so calling `clone` on `&str` copies the reference, which does not do anything and can be removed
#10 50.84     = note: `#[warn(noop_method_call)]` on by default

Is this something we need to be concerned with? cc @erinyoung

sylph and long-read

Hello

Does sylph works with long nanopore reads ?

Best

Version used for pre-sketched viral database

Hello, thanks for this amazing tool!

Just wanted to confirm which version of the IMG/VR4 database you made the pre-sketched viral database from.
And whether you possess the corresponding metadata file.

Thank you!

Unexpected behaviour on the ZymoBIOMICS Fecal Reference

Hi @bluenote-1577,

I have downloaded Illumina sequencing data of the ZymoBIOMICS Fecal Reference.
Then, I have performed taxonomic profiling with sylph (database: gtdb_r207, parameters: --estimate-unknown --read-length 150)
and with meteor, the tool we are currently developping based on species specific marker genes.

Then, I have compared species coverage of both tool in this log-scale scatterplot:

As you can see, both tools have good agreement for high coverage (lambda) species but sylph starts overestimating coverage when it reestimates lambda with its statistical model.
In addition, syplh fails to detect species below ~ 0.2x coverage which is one order of magnitude above the detection limit with simulated data.

There are strain mixtures of the same species in this sample that may explain this strange behaviour.

Your feedback is welcomed,
Florian

[FEATURE REQUESTS] - post here for suggestions/feature requests

Feature requests

Purpose: this is a place to easily log suggestions/feature requests. E.g:

"I want to display XXX output as an option!"
"I want to be able to combine database sketches!"

Give a rationale and provide concise/clear instructions if possible. Opinions are welcome too.

You're welcome to email me or open another issue. This thread is to aggregate suggestions without the hassle of opening another issue.

Current feature requests

Here are some current feature requests.

Originally posted by @jolespin in bluenote-1577/skani#23 (comment)

~~Option for renaming samples. Sylph currently fixes each sample sketch to the read names. `~~ done in v0.5.0
Command line options for inspecting database sketches.
Command line option to append/merge databases.
~~Line-delimited file for database sketches for sylph profile/query~~ done in v0.5.0

@fplaza #6 (comment)

~~Save read length while sketching so the user does not have to provide it to compute true coverage.~~ done in v0.5.0

Different ways of sketching reads and groupings

Integrating results generated from pre-built databases across domains

Hi -

I'm interested in obtaining estimates of relative abundance across all domains of life. Is this possible with sylph? If so, it's unclear to me if that would entail concatenation of pre-built databases or whether a simple merge (after sylph_to_taxprof.py) would be sufficient.

Thanks very much.

sylph sketch -1 4914_4_cat_R*

The reads files are in form of:

4914_4_cat_R1.fastq.gz 4914_4_cat_R2.fastq.gz

The command fails with output:

2024-05-08T18:01:26.964Z ERROR [sylph::sketch] Different number of paired sequences. Exiting.

I double checked the short read sets, and the number of sequences in the R1 and R2 fastq files are the same - 4,078,148 as confirmed by both seqkit stats and manual awk screening of the read files.

Any clue on what I could do here would be appreciated.

Thank you!

bluenote-1577 / sylph Goto Github PK

sylph's Introduction

sylph - fast and precise species-level metagenomic profiling with ANIs

Introduction

Why sylph?

How does sylph work?

Very quick start

Profile metagenome sample against GTDB-R220 (113,104 bacterial/archaeal species representative genomes)

Install (current version v0.6.1)

Option 1: conda install

Option 2: Build from source

Option 3: Pre-built x86-64 linux statically compiled executable

Standard usage

Sketching reads/genomes (indexing)

Profiling or querying with sketch files

Tutorials, manuals, and pre-built databases

Tutorials

Manuals

Changelog

Version v0.6.1 - 2024-04-29.

Citing sylph

sylph's People

Contributors

Stargazers

Watchers

Forkers

sylph's Issues

Feature requests

Current feature requests

Recommend Projects

Recommend Topics

Recommend Org