Giter VIP home page Giter VIP logo

taxonize_genbank's Introduction

Taxonize_genbank

PyPI - Downloads

Taxonize_genbank is a python package useful to download, filter, and curated the Genbank non-redundant protein and nucleotide databases, based on a given taxonomy ID (TaxID) and/or list of keywords.

Installation

Dependencies

To use this tool, make sure you have the following libraries installed:

  • Python 3.7 or higher
  • Biopython: 1.81
  • tqdm: 4.64.1
  • ete3: 3.1.3
  • networkx: 2.6.3
  • six: 1.16.0

Please make sure to install these dependencies before using the tool.

Then, you can clone this repository to your local machine using git.

Open your terminal and run the following command:

git clone https://github.com/msabrysarhan/taxonize_genbank

Or alternatively (recomended), you can install taxonize_gb using using pip:

pip install taxonize-gb

Usage

Taxonize_gb has three main modules:

  1. get_db.py
usage: get_db.py [-h] --db_name DB_NAME --out OUT

Download NCBI databases.

optional arguments:
  -h, --help         show this help message and exit
  --db_name DB_NAME  Which NCBI database to be downloaded.
                     Possible values are the following:
                     'taxdb': The NCBI taxonomy dump database files.
                     'nr': The non-redundant protein database.
                     'nt': The non-redundant nucleotide database.
                     'prot_acc2taxid': GenBank protein accession number to taxonomy ID mapping file.
                     'pdb_acc2taxid': Protein Database (PDB) accession number to taxonomy ID mapping file.
                     'nucl_gb_acc2taxid': Nucleotide (GenBank, GB) accession number to taxonomy ID mapping file.
                     'nucl_wgs_acc2taxid': Nucleotide (Whole Genome Shotgun, WGS) accession number to taxonomy ID mapping file.
  --out OUT          Path to output directory where the results are to be stored.
  1. taxonize_gb.py
usage: taxonize_gb.py [-h] --db DB [--db_path DB_PATH] [--taxdb TAXDB]
                      [--prot_acc2taxid PROT_ACC2TAXID]
                      [--pdb_acc2taxid PDB_ACC2TAXID]
                      [--nucl_gb_acc2taxid NUCL_GB_ACC2TAXID]
                      [--nucl_wgs_acc2taxid NUCL_WGS_ACC2TAXID]
                      [--taxid TAXID] [--keywords KEYWORDS] --out OUT

Filter NCBI nt/nr database based on a given taxid.

optional arguments:
  -h, --help            show this help message and exit
  --db DB               Which NCBI database to be used. Please use either nt
                        for nucleotide database or nr for protein database
  --db_path DB_PATH     Path to nt/nr gzipped fasta file (if not provided, the
                        latest version will be downloaded from the NCBI (must
                        be provided with --db)
  --taxdb TAXDB         Path to gzipped taxonomy database from the NCBI (if
                        not provided, the latest version will be downloaded
                        from the NCBI
  --prot_acc2taxid PROT_ACC2TAXID
                        Path to gzipped GenBank protein accession number to
                        taxid mapping file from the NCBI; works with --db nr
                        (if not provided, the latest version will be
                        downloaded from the NCBI
  --pdb_acc2taxid PDB_ACC2TAXID
                        Path to gzipped PDB protein accession number to taxid
                        mapping file from the NCBI; works with --db nr (if not
                        provided, the latest version will be downloaded from
                        the NCBI
  --nucl_gb_acc2taxid NUCL_GB_ACC2TAXID
                        Path to gzipped Genbank nucleotide accession number to
                        taxid mapping file from the NCBI; works with --db nt
                        (if not provided, the latest version will be
                        downloaded from the NCBI
  --nucl_wgs_acc2taxid NUCL_WGS_ACC2TAXID
                        Path to gzipped whole genome sequence accession number
                        to taxid mapping file from the NCBI; works with --db
                        nt (if not provided, the latest version will be
                        downloaded from the NCBI
  --taxid TAXID         Target taxonomy ID to filter for
  --keywords KEYWORDS   keywords to be included in the fasta headers of the
                        target taxonomy ID
  --out OUT             Path to output directory where the results are to be
                        stored.
  1. get_taxonomy.py
usage: get_taxonomy.py [-h] --fasta FASTA --map MAP --out OUT

Get taxonomic lineages of FASTA accessions.

optional arguments:
  -h, --help     show this help message and exit
  --fasta FASTA  NCBI FASTA file to be filtered.
  --map MAP      Accession number to taxonomy IDs gzipped mapping file.
  --out OUT      Path to output file to write the taxonomic lineages of the GenBank accession numbers.

Examples

  1. Plant non-redundant protein database

First, we need to use the get_db module to download the following files to a directory databases:

# The nr FASTA file
get_db --db_name nr --out databases

# The NCBI accession to taxonomy ID mapping file
get_db --db_name prot_acc2taxid --out databases

# The PDB accession to taxonomy ID mapping file
get_db --db_name pdb_acc2taxid --out databases

# The NCBI taxonomy database
get_db --db_name taxdb --out databases

Now that we have the databases downloaded, we can use taxonize_gb to filter the nr FASTA to keep only the plant protein records:

# we use the taxid of Viridiplantae (33090) and we write the outputs to a directory "plant_nr"
taxonize_gb --db nr --db_path databases/nr.gz --taxdb databases/taxdump.tar.gz --prot_acc2taxid databases/prot.accession2taxid.gz --pdb_acc2taxid pdb.accession2taxid.gz --taxid 33090 --out plant_nr
  1. Insects non-redundant nucleotide database

Similar to the previous example, we first need to use the get_db to download the database file:

# The nt FASTA file
get_db --db_name nt --out databases

# The GenBank accession to taxonomy ID mapping file
get_db --db_name nucl_gb_acc2taxid --out databases

# The WGS accession to taxonomy ID mapping file
get_db --db_name nucl_wgs_acc2taxid --out databases

# The NCBI taxonomy database
get_db --db_name taxdb --out databases

Now we can use the taxonize_gb to filter the nt FASTA to keep only the insects nucleotide records:

# we use the taxid of Insecta (50557) and we write the outputs to a directory "insects_nt"
taxonize_gb --db nt --db_path databases/nt.gz --taxdb databases/taxdump.tar.gz --nucl_gb_acc2taxid databases/nucl_gb.accession2taxid.gz --nucl_wgs_acc2taxid databases/nucl_wgs.accession2taxid.gz --taxid 50557 --out insect_nt
  1. Nematodes mitochondrial genomes

We can use the databases we downloaded in the previous steps, then we can proceed to the next step to use the taxonize_db with the flag --keywords mitochondrion,'complete genome' to find the header that contain both words, as follow:

# we use the taxid of Nematoda (6231) and we write the outputs to a directory "nematoda_mito"
taxonize_gb --db nt --db_path databases/nt.gz --taxdb databases/taxdump.tar.gz --nucl_gb_acc2taxid databases/nucl_gb.accession2taxid.gz --nucl_wgs_acc2taxid databases/nucl_wgs.accession2taxid.gz --taxid 6231 --keywords mitochondrion,'complete genome' --out nematoda_mito

License

This package is distributed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). Please see the LICENSE file for the full license text.

taxonize_genbank's People

Contributors

msabrysarhan avatar

Stargazers

Josh L. Espinoza avatar Guy Leonard avatar Miguel Angel Gonzalez avatar  avatar

Watchers

 avatar Frank Maixner avatar

taxonize_genbank's Issues

Out of memory issue

The nucl_wgs_acc2taxid will bw downloaded to Mito_fun
The nucl_wgs_acc2taxid has been downloaded to Mito_fun
Reading nodes.dmp: 100%|██████████| 2564321/2564321 [00:04<00:00, 559641.73 lines/s]
Reading names.dmp: 100%|██████████| 3998998/3998998 [00:04<00:00, 866842.55 lines/s]
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=37946086.0. Some of your processes may have been killed by the cgroup out-of-memory handler.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.