Giter VIP home page Giter VIP logo

sylph's Introduction

sylph - fast and precise species-level metagenomic profiling with ANIs

Introduction

sylph is a program that performs ultrafast (1) ANI querying or (2) metagenomic profiling for metagenomic shotgun samples.

Containment ANI querying: sylph can search a genome, e.g. E. coli, against your sample. If sylph outputs an estimate of 97% ANI, your sample contains an E. coli with 97% ANI to the queried genome.

Metagenomic profiling: sylph can determine the species/taxa in your sample and their abundances, just like Kraken or MetaPhlAn.

Profiling 1 Gbp of mouse gut reads against 85,205 genomes in a few seconds

Why sylph?

  1. Precise species-level profiling: Our tests show that sylph is more precise than Kraken and about as precise and sensitive as marker gene methods (MetaPhlAn, mOTUs).

  2. Ultrafast, multithreaded, multi-sample: sylph can be > 50x faster than MetaPhlAn for multi-sample processing. sylph only takes ~15GB of RAM for profiling against the entire GTDB-R220 database (110k genomes).

  3. Accurate (containment) ANIs down to 0.1x effective coverage: for bacterial ANI queries of > 90% ANI, sylph can often give accurate ANI estimates down to 0.1x coverage.

  4. Easily customized databases: sylph can profile against metagenome-assembled genomes (MAGs), viruses, eukaryotes, and more. Taxonomic information can be incorporated downstream for traditional profiling reports.

How does sylph work?

sylph uses a k-mer containment method, similar to sourmash or Mash. sylph's novelty lies in using a statistical technique to correct ANI for low coverage genomes within the sample, giving accurate results for low abundance genomes. See here for more information on what sylph can and can not do.

Very quick start

Profile metagenome sample against GTDB-R220 (113,104 bacterial/archaeal species representative genomes)

# download GTDB-R220 pre-built database (~13 GB)
wget https://storage.googleapis.com/sylph-stuff/gtdb-r220-c200-dbv1.syldb

# multi-sample paired-end profiling (sylph version >= 0.6)
sylph profile gtdb-r220-c200-dbv1.syldb -1 *_1.fastq.gz -2 *_2.fastq.gz -t (threads) > profiling.tsv

# multi-sample single-end profiling
sylph profile gtdb-r220-c200-dbv1.syldb *.fastq -t (threads) > profiling.tsv

Install (current version v0.6.1)

Option 1: conda install

Anaconda-Server Badge Anaconda-Server Badge

conda install -c bioconda sylph

Warning

conda install may break if AVX2 instructions are not available on your CPU. See the issue here. The binary and source install still work.

Option 2: Build from source

Requirements:

  1. rust (version > 1.63) programming language and associated tools such as cargo are required and assumed to be in PATH.
  2. A c compiler (e.g. GCC)
  3. make
  4. cmake

Building takes a few minutes (depending on # of cores).

git clone https://github.com/bluenote-1577/sylph
cd sylph

# If default rust install directory is ~/.cargo
cargo install --path . --root ~/.cargo
sylph query test_files/*

Option 3: Pre-built x86-64 linux statically compiled executable

If you're on an x86-64 system, you can download the binary and use it without any installation.

wget https://github.com/bluenote-1577/sylph/releases/download/latest/sylph
chmod +x sylph
./sylph -h

Note: the binary is compiled with a different set of libraries (musl instead of glibc), probably impacting performance.

Standard usage

Sketching reads/genomes (indexing)

# all fasta -> one *.syldb; fasta are assumed to be genomes
sylph sketch genome1.fa genome2.fa -o database
#EQUIVALENT: sylph sketch -g genome1.fa genome2.fa -o database

# multi-sample sketching of paired reads
sylph sketch -1 A_1.fq B_1.fq -2 A_2.fq B_2.fq -d read_sketch_folder

# multi-sample sketching for single end reads, fastq are assumed to be reads
sylph sketch reads.fq 
#EQUIVALENT: sylph sketch -r reads.fq

Profiling or querying with sketch files

# ANI querying 
sylph query database.syldb read_sketch_folder/*.sylsp -t (threads) > ani_queries.tsv

# taxonomic profiling 
sylph profile database.syldb read_sketch_folder/*.sylsp -t (threads) > profiling.tsv

Tutorials, manuals, and pre-built databases

The pre-built databases available here can be downloaded and used with sylph for profiling and containment querying.

For common use cases and fast explanations, see the above cookbook.

Tutorials

Manuals

For incorporating taxonomy and manipulating output formats, see the sylph-utils repository.

Changelog

Version v0.6.1 - 2024-04-29.

  • Made unknown estimation (-u) more robust for low-depth short-read sequencing.

See the CHANGELOG for complete details.

Citing sylph

Jim Shaw and Yun William Yu. Metagenome profiling and containment estimation through abundance-corrected k-mer sketching with sylph (2023). bioRxiv. (Accepted for publication)

sylph's People

Contributors

bluenote-1577 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

sylph's Issues

Conda executable crashes

Hi @bluenote-1577,

I installed sylph using bioconda but it crashed immediately with the following error "Illegal instruction (core dumped)".
Notably, the executable available on github works.

Best,
Florian

Can it replace kraken2 for pathogen identification?

I have tested using a 20M 50-bp sequencing fastq for pathogen identification, but I didn't get any results.
Is this software only suitable for applications with large data volumes such as gut microbiota?

Taxonomic profiling for contigs

Hi, I want to use sylph to taxonomically profile a set of contigs against the GTDB v220 database. I installed sylph v.0.6.1 with conda.

I sketched the database from the contigs like this:
sylph sketch contigs.fa -i -o contigs

then I want to do the taxonomic profiling like this:
sylph profile contigs.syldb v0.3-c200-gtdb-r214.syldb -o contigs_gtdb_sylph_results.tsv

But I get the error:
2024-06-28T12:45:16.610Z INFO [sylph::contain] Obtaining sketches...
2024-06-28T12:45:16.610Z ERROR [sylph::contain] No read files found; see sylph query/profile -h for help. Exiting

Apparently sylph is looking for read files, and doesn't understand it's a contig database. Could you help me fix this problem?

Thanks a lot in advance

redundant .clone() call on read_sketch_file

Hello! Working on getting a Docker image built over in StaPH-B/docker-builds#816 and noticed this during compilation:

#10 50.29    Compiling sylph v0.4.1 (/sylph-0.4.1)
#10 50.84 warning: call to `.clone()` on a reference in this situation does nothing
#10 50.84    --> src/contain.rs:485:47
#10 50.84     |
#10 50.84 485 |         let file = File::open(read_sketch_file.clone()).expect(&format!(
#10 50.84     |                                               ^^^^^^^^ help: remove this redundant call
#10 50.84     |
#10 50.84     = note: the type `str` does not implement `Clone`, so calling `clone` on `&str` copies the reference, which does not do anything and can be removed
#10 50.84     = note: `#[warn(noop_method_call)]` on by default

Is this something we need to be concerned with? cc @erinyoung

Version used for pre-sketched viral database

Hello, thanks for this amazing tool!

Just wanted to confirm which version of the IMG/VR4 database you made the pre-sketched viral database from.
And whether you possess the corresponding metadata file.

Thank you!

Unexpected behaviour on the ZymoBIOMICS Fecal Reference

Hi @bluenote-1577,

I have downloaded Illumina sequencing data of the ZymoBIOMICS Fecal Reference.
Then, I have performed taxonomic profiling with sylph (database: gtdb_r207, parameters: --estimate-unknown --read-length 150)
and with meteor, the tool we are currently developping based on species specific marker genes.

Then, I have compared species coverage of both tool in this log-scale scatterplot:
image

As you can see, both tools have good agreement for high coverage (lambda) species but sylph starts overestimating coverage when it reestimates lambda with its statistical model.
In addition, syplh fails to detect species below ~ 0.2x coverage which is one order of magnitude above the detection limit with simulated data.

There are strain mixtures of the same species in this sample that may explain this strange behaviour.

Your feedback is welcomed,
Florian

[FEATURE REQUESTS] - post here for suggestions/feature requests

Feature requests

Purpose: this is a place to easily log suggestions/feature requests. E.g:

  • "I want to display XXX output as an option!"
  • "I want to be able to combine database sketches!"

Give a rationale and provide concise/clear instructions if possible. Opinions are welcome too.

You're welcome to email me or open another issue. This thread is to aggregate suggestions without the hassle of opening another issue.

Current feature requests

Here are some current feature requests.

Originally posted by @jolespin in bluenote-1577/skani#23 (comment)

  • Option for renaming samples. Sylph currently fixes each sample sketch to the read names. ` done in v0.5.0
  • Command line options for inspecting database sketches.
  • Command line option to append/merge databases.
  • Line-delimited file for database sketches for sylph profile/query done in v0.5.0

@fplaza #6 (comment)

  • Save read length while sketching so the user does not have to provide it to compute true coverage. done in v0.5.0

#7

  • Different ways of sketching reads and groupings

Integrating results generated from pre-built databases across domains

Hi -

I'm interested in obtaining estimates of relative abundance across all domains of life. Is this possible with sylph? If so, it's unclear to me if that would entail concatenation of pre-built databases or whether a simple merge (after sylph_to_taxprof.py) would be sufficient.

Thanks very much.

Building custom Database

Hello,
Please I would like to know if I can create my own database with fungi included.
Also, I am using RefSeq224 data files which are *.fna.gz will the command pick that instead of the fa.gz in your example?
Thanks

ITS profiling?

Can sylph profile ITS dataset? I downloaded the UNITE fasta file and converted it to the syldb. Then, I tested this tool with some ITS dataset and didnt get any luck to get some outputs. Thanks

Sketch error: different number of paired sequences

Hi,

I've been trying to run the sylph sketch command on a forward and reverse short read set using command:

sylph sketch -1 4914_4_cat_R*

The reads files are in form of:

4914_4_cat_R1.fastq.gz 4914_4_cat_R2.fastq.gz

The command fails with output:

2024-05-08T18:01:26.964Z ERROR [sylph::sketch] Different number of paired sequences. Exiting.

I double checked the short read sets, and the number of sequences in the R1 and R2 fastq files are the same - 4,078,148 as confirmed by both seqkit stats and manual awk screening of the read files.

Any clue on what I could do here would be appreciated.

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.