Giter VIP home page Giter VIP logo

sixess's Introduction

Build Status License: GPL v3

โš ๏ธ THIS SOFTWARE IS STILL UNDER DEVELOPMENT - USE AT OWN RISK

sixess

Rapid 16s rDNA from isolate FASTQ files

Introduction

sixess is a command-line software tool to identify bacterial species based on 16S rDNA sequence directly from WGS FASTQ data. It includes databases from NCBI (default), RDP and SILVA.

Quick start

# just give it sequences!
% sixess R1.fastq.gz
Staphylococcus epidermidis

# sometimes there is no match
% sixess /dev/null
No matches

# give it as many sequence files as needed
% sixess R1.fq R2.fq
Enterococcus faecium

# we provide different databases you can choose
% sixess -d RDP contigs.fa
Bacillus cereus

# you can pipe to stdin too
% bzcat chernobyl.fq.bz2 | sixess -
Deinococcus radiodurans

Installation

Source

cd $HOME
git clone https://github.com/tseemann/sixess
export PATH=$HOME/sixess/bin:$PATH

Homebrew

brew install brewsci/bio/sixess  # COMING SOON

Bioconda

conda install -c bioconda -c conda-forge sixess  # COMING SOON

Usage

Input

The input can be one or more sequence files, or - denoting stdin. The input data can be FASTQ or FASTA, and may be .gz compressed. Any read length is accepted, even whole chromosomes.

Output

The output is a single line to stdout. If a match was found, it will be Genus species. If no prediction could be made, it will be No matches.

Options

  -q        Quiet mode, no output
  -p DIR    Database folder (/home/tseemann/git/sixess/db)
  -d FILE   Database {NCBI RDP SILVA.gz} (NCBI)
  -t NUM    CPU threads (1)
  -m FILE   Save alignments to FILE in PAF format
  -V        Print version and exit
  • -q enables "quiet mode" which only prints to stderr for errors
  • -p is the location of the sequence databases
  • -d selects the database; they can be .gz compressed (see Databases
  • -t increases threads; 3 is the suggested value for minimap2
  • -m allows you to save the PAF output of minimap2
  • -V prints the version and exits e.g. sixess 1.0

Databases

NCBI (bundled, default)

The NCBI 16S ribosomal RNA project contains curated 16S ribosomal RNA bacteria and archaea RefSeq entries. It has ~20,000 entries.

esearch -db nucleotide -query '33175[BioProject] OR 33317[BioProject]' \
  | efetch -db nuccore -format fasta \
  > $(which sixess)/../db/NCBI

RDP (bundled)

Bacterial 16S rDNA sequences for "type strains" from the RDP database are included. These are denoted with (T) in the FASTA headers. It contains ~10,000 entries.

wget --no-check-certificate https://rdp.cme.msu.edu/download/current_Bacteria_unaligned.fa.gz
gunzip -c current_Bacteria_unaligned.fa.gz \
  | bioawk -cfastx '/\(T\)/{print ">" $name " " $comment "\n" toupper($seq)}' \
  > $(which sixess)/../db/RDP

SILVA (bundled)

SILVA is a comprehensive on-line resource for quality checked and aligned ribosomal RNA sequence data. The filtered version of the aligned 16S/18S/SSU database contains ~100,000 entries.

# replace "132" with latest version as needed
wget https://www.arb-silva.de/fileadmin/silva_databases/release_132/Exports/SILVA_132_SSURef_Nr99_tax_silva.fasta.gz
gunzip -v SILVA_132_SSURef_Nr99_tax_silva.fasta.gz \
  | bioawk -cfastx \
    '$comment ~ /^Bacteria;|^Archaea;/ \
    && $comment !~ /(;unidentified|Mitochondria;|;Chloroplast|;uncultured| sp\.)/ \
    { sub(/^.*;/,"",$comment);
      gsub("U","T",$seq);
      print ">" $name " " $comment "\n" $seq }' \
  | seqtk seq -l 60 -U \
  > SILVA.tmp1
cd-hit-est -i SILVA.tmp1 -o SILVA.tmp2 -c 1.0 -T 0 -M 2000 -d 250
cp SILVA.tmp2 $(which sixess)/../db/SILVA
rm -f SILVA.tmp1 SILVA.tmp2 SILVA.tmp2.clstr

Custom databases

Assuming you have a FASTA file of 16S DNA sequences called /home/alex/GG.fa say, you can do this:

Global installaion

cp /home/alex/GG.fa $(which sixess)/../db/GG
sixess -d GG R1.fastq.gz

Local installaion

sixess -p /home/alex/data -d GG.fa R1.fastq.gz

Algorithm

  1. Identify reads which look like 16S (minimap2)
  2. Count up how many reads hit each 16S sequence (possibly weighted)
  3. Choose the top hit and report it

Feedback

Report bugs and give suggesions on the Issues page

License

GPL Version 3

Author

Torsten Seemann

sixess's People

Contributors

tseemann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sixess's Issues

Crazy results

I know this program is under development, but I am trying to id a bunch of species, and the results I am getting are all over the map. For example, I have a folder of isolates for which I was under the impression were all s. pseudintermedius, but when I run them using the following I get things like this:

sixess -d SILVA.gz 8050A_17.contigs.fa

I get: Top hit: CSCB01000016.56.1922
Description: >CSCB01000016.56.1922 Staphylococcus aureus
Species: Staphylococcus aureus

I also have a folder of reads that map 100% to the b. anthracis ames ancestor reference genome, and when I try those samples I get:

sixess -d SILVA.gz KNP_51.fa

Top hit: CFLJ01000066.3592.5030
Description: >CFLJ01000066.3592.5030 Streptococcus pneumoniae
Species: Streptococcus pneumoniae

Am I doing something wrong? I can't see how this program could possibly be this wrong, or maybe I'm that wrong? Nevertheless, I am pretty positive, at least about the b. anthracis samples, that it is definitely b. anthracis. Could this just be contamination? Some of these are coming out as Bacillus phocaeensis, and Bacillus cereus etc...

I know that both of my folders (anthracis and pseudintermedius) have pairwise ANIs way under 95%, and checkm also puts their contamination under 95%.

Just trying to figure out what is going on here, whether I have a bunch of wrong or contaminated samples from PIs who arent sure what they are doing, or whether this program has a lot more development to go, or whether I am an idiot and doing something wrong (this being the most likely scenario).

Any help is appreciated!

Support STDIN

$ gunzip -c -d test/R1.fq.gz | sixess -q -d SILVA.gz -t 4 -
ERROR: Can not read first FASTQ file: /home/tseemann/git/sixess/-

$ gunzip -c -d test/R1.fq.gz | sixess -q -d SILVA.gz -t 4 /dev/stdin
ERROR: Can not read first FASTQ file: /proc/160784/fd/pipe:[25567174]

I think minimap2 supports it

Output all results

Hi,
this is not really an issue and I haven't testet the software yet but I'm interested in the output. Is it possible to output all results instead of just the top hit?
We want to get the diversity out of many reads in one fastq file after we Nanopore sequenced a microbiome.
As far as I understand your Software it does everything we want like mapping the 16s reads against ncbi, silva and rdp database (without 16s amplification?) and then outputting the hits. So maybe there is an option to output all results or a simple addition in the source code we can write?

Thank you in advance.
Jan

Does this trial version already work?

Dear Torsten, sixess seems to be a nice and easy solution to address contamination problem in fastq. Is this version already operational since we had all the required tools? Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.