Giter VIP home page Giter VIP logo

gottcha2's Introduction

logo

Genomic Origin Through Taxonomic CHAllenge (GOTTCHA)

gottcha2 bioconda

GOTTCHA is an application of a novel, gene-independent and signature-based metagenomic taxonomic profiling method with significantly smaller false discovery rates (FDR) that is laptop deployable. Our algorithm was tested and validated on twenty synthetic and mock datasets ranging in community composition and complexity, was applied successfully to data generated from spiked environmental and clinical samples, and robustly demonstrates superior performance compared with other available tools.

GOTTCHAv2 is currently under development in BETA stage. Pre-built databases for v1 are incompatible with v2.


DEPENDENCIES

GOTTCHA2 profiler is written in Python3 and leverage minimap2 to map reads to signature sequences. In order to run GOTTCHA2 correctly, your system requires to have following dependencies installed correctly. The YAML file for Conda environment can be found in environment.yml.

  • Python 3.6+
  • minimap2 2.17+
  • pandas
  • samtools

QUICK START

  1. Install the package:

     via conda `conda install -c bioconda gottcha2`
    
     OR
    
     Download or git clone GOTTCHA2 from this repository and run `pip install .`
    
  2. Download the latest version of the GOTTCHA2 database. (This step may take some time)

     https://ref-db.edgebioinformatics.org/gottcha2/RefSeq-r220/
    
  3. Run GOTTCHA2:

     $ gottcha2.py -d RefSeq-r220_BAVxH-cg/gottcha_db.species.fna -t 8 -i <FASTQ>
     
     OR
     
     $ gottcha2 profile -d RefSeq-r220_BAVxH-cg/gottcha_db.species.fna -t 8 -i <FASTQ>
    

RESULT

GOTTCHA2 can output the profiling results in either CSV, TSV or BIOM format.

  • summary (.tsv or .csv) - A summary of profiling results (10 columns) in taxonomic ranks breakdown
  • full (.tsv or .csv) - A full profiling results including unfiltered profiling results and additional columns
  • lineage (.lineage.tsv or .lineage.tsv) - output lineage and abundance of the profiled taxon per line
  • extract (.extract[TAXID].fastq) - Extracted reads for a specific taxon.

Summary and full reports

A full GOTTCHA2 report has 22 columns in tab-delimited format. The summary report is a brief version that has the first 10 columns and qualified taxonomies. The report lists profiling results at taxonomic rank breakdown from superkingdom to strain, following by other information listed below. The rollup depth of coverage (ROLLUP_DOC) is used to calculate relative abundance (column 10) by default, as well as other relative abundance calculations (column 19-21).

COLUMN NAME DESCRIPTION NOTE
1 LEVEL Taxonomic rank
2 NAME Taxonomic name
3 TAXID Taxonomic ID
4 READ_COUNT Number of mapped reads
5 TOTAL_BP_MAPPED Total bases of mapped reads
6 TOTAL_BP_MISMATCH Total mismatch bases of mapped reads
7 LINEAR_LENGTH Number of non-overlapping bases covering the signatures
8 LINEAR_DOC Linear depth-of-coverage = TOTAL_BP_MAPPED / LINEAR_LENGTH
9 ROLLUP_DOC Rollup depth-of-coverage = โˆ‘ DOC of sub-level
10 REL_ABUNDANCE Relative abundance (normalized abundance) = ABUNDANCE / โˆ‘ ABUNDANCE of given level
11 LINEAR_COV Proportion of covered signatures to total signatures of mapped organism(s) = LINEAR_LENGTH / SIG_LENGTH_TOL
12 LINEAR_COV_MAPPED_SIG Proportion of covered signatures to mapped signatures = LINEAR_LENGTH / SIG_LENGTH_MAPPED
13 BEST_LINEAR_COV Best linear coverage of corresponding taxons
14 DOC Average depth-of-coverage = TOTAL_BP_MAPPED / SIG_LENGTH_TOL
15 BEST_DOC Best DOC of corresponding taxons
16 SIG_LENGTH_TOL Length of all signatures in mapped organism(s)
17 SIG_LENGTH_MAPPED Length of signatures in mapped signature fragment(s)
18 ABUNDANCE abundance of the taxon value of either ROLLUP_DOC, READ_COUNT or TOTAL_BP_MAPPED
19 ZSCORE Estimated Z-score of depth of coverage of the mapped region
20 NOTE Only note the reason for being filtered out

USAGE

usage: gottcha2.py [-h] [-i [FASTQ] [[FASTQ] ...]] [-s [SAMFILE]]
                   [-d [MINIMAP2_INDEX]] [-l [LEVEL]] [-ti [FILE]] [-np]
                   [-pm <INT>] [-e [TAXID]] [-fm [STR]] [-r [FIELD]]
                   [-t <INT>] [-o [DIR]] [-p <STR>] [-xm <STR>] [-mc <FLOAT>]
                   [-mr <INT>] [-ml <INT>] [-mz <FLOAT>] [-nc] [-c] [-v]
                   [--silent] [--debug]

Genomic Origin Through Taxonomic CHAllenge (GOTTCHA) is an annotation-
independent and signature-based metagenomic taxonomic profiling tool that has
significantly smaller FDR than other profiling tools. This program is a
wrapper to map input reads to pre-computed signature databases using minimap2
and/or to profile mapped reads in SAM format. (VERSION: 2.1.5 BETA)

optional arguments:
  -h, --help            show this help message and exit
  -i [FASTQ] [[FASTQ] ...], --input [FASTQ] [[FASTQ] ...]
                        Input one or multiple FASTQ/FASTA file(s). Use space
                        to separate multiple input files.
  -s [SAMFILE], --sam [SAMFILE]
                        Specify the input SAM file. Use '-' for standard
                        input.
  -d [MINIMAP2_INDEX], --database [MINIMAP2_INDEX]
                        The path of signature database. The database can be in
                        FASTA format or minimap2 index (5 files).
  -l [LEVEL], --dbLevel [LEVEL]
                        Specify the taxonomic level of the input database. You
                        can choose one rank from "superkingdom", "phylum",
                        "class", "order", "family", "genus", "species" and
                        "strain". The value will be auto-detected if the input
                        database ended with levels (e.g. GOTTCHA_db.species).
  -ti [FILE], --taxInfo [FILE]
                        Specify the path of taxonomy information file
                        (taxonomy.tsv). GOTTCHA2 will try to locate this file
                        when user doesn't specify a path. If '--database'
                        option is used, the program will try to find this file
                        in the directory of specified database. If not, the
                        'database' directory under the location of gottcha.py
                        will be used as default.
  -np, --nanopore       Adjust options for Nanopore reads. The 'mismatch'
                        option will be ignored. [-xm map-ont -mr 1]
  -pm <INT>, --mismatch <INT>
                        Mismatch penalty for the aligner. [default: 10]
  -e [TAXID], --extract [TAXID]
                        Extract reads mapping to a specific TAXID. [default:
                        None]
  -fm [STR], --format [STR]
                        Format of the results; available options include tsv,
                        csv or biom. [default: tsv]
  -r [FIELD], --relAbu [FIELD]
                        The field will be used to calculate relative
                        abundance. You can specify one of the following
                        fields: "LINEAR_LENGTH", "TOTAL_BP_MAPPED",
                        "READ_COUNT" and "LINEAR_DOC". [default: ROLLUP_DOC]
  -t <INT>, --threads <INT>
                        Number of threads [default: 1]
  -o [DIR], --outdir [DIR]
                        Output directory [default: .]
  -p <STR>, --prefix <STR>
                        Prefix of the output file [default:
                        <INPUT_FILE_PREFIX>]
  -xm <STR>, --presetx <STR>
                        The preset option (-x) for minimap2. Default value
                        'sr' for short reads. [default: sr]
  -mc <FLOAT>, --minCov <FLOAT>
                        Minimum linear coverage to be considered valid in
                        abundance calculation [default: 0.005]
  -mr <INT>, --minReads <INT>
                        Minimum number of reads to be considered valid in
                        abundance calculation [default: 3]
  -ml <INT>, --minLen <INT>
                        Minimum unique length to be considered valid in
                        abundance calculation [default: 60]
  -mz <FLOAT>, --maxZscore <FLOAT>
                        Maximum estimated zscore of depths of mapped region
                        [default: 10]
  -nc, --noCutoff       Remove all cutoffs. This option is equivalent to use
                        [-mc 0 -mr 0 -ml 0].
  -c, --stdout          Write on standard output.
  -v, --version         Print version number.
  --silent              Disable all messages.
  --debug               Debug mode. Provide verbose running messages and keep
                        all temporary files.
usage: pull_database.py [-h] [-u URL] [-r RANK]

This script will pull the latest version of the Gottcha2 database.

optional arguments:
  -h, --help            show this help message and exit
  -u URL, --url URL     specify a URL to pull from (will override the default)
  -r RANK, --rank RANK  taxonomic rank of the database (superkingdom, phylum, class, order, famiily, genus, species)

gottcha2's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

gottcha2's Issues

RefSeq-r90.cg.fna.tar issues

I am following the installation protocol but have a problem with decompressing the RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna.tar. I keep getting the following error:

tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors.

I believe the problem is that the downloaded file is actually an HTML document and not .fna.

head RefSeq-r90.cg.BacteriaArchaeaViruses.species.fna.tar 
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" class="ui-mobile" lang="en">
	 	<head>
		<title>
			System Unavailable
		</title>
		<meta content="width=device-width, initial-scale=1, user-scalable=no" name="viewport"/>
		<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
		<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
		<meta content="Los Alamos National Laboratory, Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy" name="author"/>

Any idea what is wrong?

System Unavailable

Hello,
I am trying to download the taxonomy dmp species signature index files (https://edge-lanl.gov/rest_of_path). I am redirected to https://lanal.gov/errors/system-notification.php with the following message displayed:
System Unavailable
We are currently performing system maintenance.
I have been receiving this message for about a month now. Is there another download location for these files? This also affects downloading databases/tools for the EDGE and PanGIA as they are in the same root path.
Thanks,
Scott

Problem at start with sam file

Hi,

i tried to use GOTTCHA2 with SAM file and I obtained this message error:

[00:00:00] Starting GOTTCHA (v2.1.7)
[00:00:00] Arguments and dependencies checked:
Traceback (most recent call last):
File "./gottcha2.py", line 667, in
print_message( " Input reads : %s" % [x.name for x in argvs.input], argvs.silent, begin_t, logfile )
TypeError: 'NoneType' object is not iterable

Could you help me please
Alex

Add gottcha2 to pypi

In order to create a bioconda recipe, gottcha2 needs to be uploaded into pypi.

The steps to take to complete this are:
[] Develop against the bioconda branch
[] Build gottcha2 dist and push to test-pypi
[] Test against test-pypi install
[] Once successful, push to master
[] From master, build dist and push to pypi

Run failed after processing SAM file

Hello,
I tested GOTTCHA2 on a set of nanopore reads I had. The reads were mapped to the signature database but the run failed with the following error:
Traceback (most recent call last):
File "/home/scott/GOTTCHA2/gottcha2.py", line 717, in
res_df = roll_up_taxonomy(res, db_stats, argvs.relAbu, argvs.dbLevel , argvs.minCov, argvs.minReads, argvs.minLen, argvs.maxZscore)
File "/home/scott/GOTTCHA2/gottcha2.py", line 449, in roll_up_taxonomy
str_df['LVL_NAME'] = str_df['TAXID'].apply(lambda x: gt.taxid2lineageDICT(x, True, True)[rank]['name'])
File "/home/scott/miniconda3/envs/SeqAnalysis/lib/python3.6/site-packages/pandas/core/series.py", line 3848, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
File "/home/scott/GOTTCHA2/gottcha2.py", line 449, in
str_df['LVL_NAME'] = str_df['TAXID'].apply(lambda x: gt.taxid2lineageDICT(x, True, True)[rank]['name'])
TypeError: string indices must be integers
I have also attached the log file.
Any help would be appreciated.
Thanks,
Scott
Guppy_v3.gottcha_species.gottcha_species.log

Fungi database

Hello,

It seems that GOTTCHA2 can use two reference databases for classification, namely, RefSeq BacteriaArchaeaViruses and RefSeq Fungal. But I cannot find the Refseq Fungal database at https://edge-dl.lanl.gov/GOTTCHA2/RefSeq-Release90/

Could you please help with this? Or let me know when the database is uploaded.

Many thanks for your help.

HY

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.