aureme / mpwt Goto Github PK

Pathway Tools multiprocessing wrapper (for PathoLogic).

License: GNU Lesser General Public License v3.0

Python 100.00%

multiprocessing pathway-tools metabolic-network

mpwt's Introduction

mpwt: Multiprocessing Pathway Tools

mpwt is a python package for running Pathway Tools [PathwayToolsarXiv] on multiple genomes using multiprocessing. More precisely, it launches one PathoLogic [Karp2011] process for each organism (PathoLogic and Pathway Tools pathway prediction are described in this blog entry). This allows to increase the speed of draft metabolic network reconstruction when working on multiple organisms.

The last version of Pathway Tools supported by mpwt is shown in the badge named "Pathway Tools".

mpwt: Pipeline summary

The following picture shows the main argument of mpwt:

Table of contents

Installation

Requirements

mpwt needs at least Python 3.8. It has been tested on Ubuntu and macOS but it is not working on Windows. mpwt requires three python depedencies (biopython , chardet and gffutils) and Pathway Tools. For the multiprocessing, mpwt uses the multiprocessing library of Python 3.

You must have an environment where Pathway Tools is installed. Pathway Tools can be obtained here.

Pathway Tools needs Blast, so it must be install on your system. Depending on your system, Pathway Tools needs a file named .ncbirc to locate Blast, for more informations look at this page.

/!\ For all OS, Pathway-Tools must be in $PATH.

On Linux and MacOS: export PATH=$PATH:/your/install/directory/pathway-tools.

Consider adding Pathway Tools in $PATH permanently by using the following command and then sourcing bashrc:

echo 'export PATH="$PATH:/your/install/directory/pathway-tools:"' >> ~/.bashrc
source ~/.bashrc

If your OS doesn't support Pathway Tools, you can use a docker container. If it's your case, look at Pathway Tools Multiprocessing Docker. It is a dockerfile that will create a container with Pathway Tools, its dependencies and this package. You just need to give a Pathway Tools installer as input.

You can also look at Pathway Tools Multiprocessing Singularity. More manipulations are required compared to Docker but with this you can create a Singularity image.

Using pip

pip install mpwt

Use

Input data

The script takes a folder containing sub-folders as input. Each sub-folder contains a Genbank/GFF file or multiple PathoLogic Format (PF) files.

Folder_input
├── species_1
│   └── species_1.gbk
├── species_2
│   └── species_2.gff
│   └── species_2.fasta
├── species_3
│   └── species_3.gbk
├── species_4
│   └── scaffold_1.pf
│   └── scaffold_1.fasta
│   └── scaffold_2.pf
│   └── scaffold_2.fsa
├── taxon_id.tsv
..

Input files must have the same name as the folder in which they are located and also finished with a .gbk/.gbff or a .gff (the name must not be only uppercase otherwise this can cause issue with Pathway Tools such as this one: Error: Cannot use the organism identifier ORGID as a genetic element ID.).

For PF files, there is one file for each scaffold/contig and one corresponding fasta file.

Pathway Tools will run on each Genbank/GFF/PF files. It will create the results in the ptools-local folder but you can also choose an output folder.

Genbank

Folder_input
├── species_1
│   └── species_1.gbk
..

Genbank file example:

LOCUS       scaffold1         XXXXXX bp    DNA     linear   INV DD-MMM-YYYY
DEFINITION  My species genbank.
ACCESSION   scaffold1
VERSION     scaffold1
KEYWORDS    Key words.
SOURCE      Source
ORGANISM  Species name
            Taxonomy; Of; My; Species; With;
            The; Genus.
FEATURES             Location/Qualifiers
    source          1..XXXXXX
                    /scaffold="scaffold1"
                    /db_xref="taxon:taxonid"
    gene            START..STOP
                    /locus_tag="gene1"
    mRNA            START..STOP
                    /locus_tag="gene1"
    CDS             START..STOP
                    /locus_tag="gene1"
                    /db_xref="InterPro:IPRXXXXXX"
                    /go_component="GO:XXXXXXX"
                    /EC_number="X.X.X.X"
                    /translation="AMINOAACIDSSEQUENCE"

Look at the NCBI GBK format for more informations. You can also look at the example provided on Pathway Tools site.

GFF

Folder_input
├── species_2
│   └── species_2.gff
│   └── species_2.fasta
..

GFF file example:

##gff-version 3
##sequence-region scaffold_1 1 XXXXXX
scaffold_1  RefSeq  region  1   XXXXXXX .   +   .   ID=region_id;Dbxref=taxon:XXXXXX
scaffold_1  RefSeq  gene    START   STOP    .   -   .   ID=gene_id
scaffold_1  RefSeq  CDS START   STOP    .   -   0   ID=cds_id;Parent=gene_id;ec_number=X.X.X.X"

Warning: it seems that metabolic networks from GFF file have less reactions/pathways/compounds than metabolic networks from Genbank file or PathoLogic File. Lack of some annotations (EC, GO) can be the reason explaining these differences.

Look at the NCBI GFF format for more informations.

You have to provide a nucleotide sequence file (either '.fasta' or '.fsa' extensions) associated with the GFF file containing the chromosome/scaffold/contig sequence.

>scaffold_1
ATGATGCTGATACTGACTTAGCAT

PathoLogic Format

Folder_input
├── species_4
│   └── scaffold_1.pf
│   └── scaffold_1.fasta
│   └── scaffold_2.pf
│   └── scaffold_2.fsa
├── taxon_id.tsv
..

PF file example:

;;;;;;;;;;;;;;;;;;;;;;;;;
;; scaffold_1
;;;;;;;;;;;;;;;;;;;;;;;;;
ID  gene_id
NAME    gene_id
STARTBASE   START
ENDBASE STOP
FUNCTION    ORF
PRODUCT-TYPE    P
PRODUCT-ID  prot gene_id
EC  X.X.X.X
DBLINK  GO:XXXXXXX
INTRON  START1-STOP1
//

Look at the Pathologic format for more informations.

You have to provide one nucleotide sequence (either '.fasta' or '.fsa' extension) for each pathologic containing one scaffold/contig. This is optionnal since mpwt 0.7.0.

>scaffold_1
ATGATGCTGATACTGACTTAGCAT

You also need to add the taxon ID in the taxon_id.tsv (a tsv file with two values: the name of the folder containing the PF files and the taxon ID corresponding).

taxon_id.tsv file

This tabulated file is required when using PathoLogic Format as input. But it can also bee used to give more informations to Pathway Tools.

A simple file looks like this:

species	taxon_id
species_4	4

If you don't have taxon ID in your Genbank or GFF file, you can add one in this file for the corresponding species.

You can also add more informations for the genetic elements like circularity of genome (Y or N), type of genetic element (:CHRSM, :PLASMID, :MT (mitochondrial chromosome), :PT (chloroplast chromosome), or :CONTIG) or codon table (see the corresponding code below).

You can also specify reference PGDB. This can be useful if you have PGDB with manual curation, especially with reactions or pathways not present in MetaCyc. These reactions or pathways will be added into MetaCyc before reaction and pathways prediction (if the reactions or pathways are supported by evidence other than computational ones).

Example:

species	taxon_id	circular	element_type	codon_table	corresponding_file	reference_pgdb
species_1	10	Y	:CHRSM	1		pgdb_id
species_4	4	N	:CHRSM	1	scaffold_1
species_4	4	N	:MT	1	scaffold_2

As you can see for PF file (species_4) you can use the column corresponding_file to add information for each PF files.

Genetic code for Pathway Tools:

Corresponding number	Genetic code
0	Unspecified
1	The Standard Code
2	The Vertebrate Mitochondrial Code
3	The Yeast Mitochondrial Code
4	The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
5	The Invertebrate Mitochondrial Code
6	The Ciliate, Dasycladacean and Hexamita Nuclear Code
9	The Echinoderm and Flatworm Mitochondrial Code
10	The Euplotid Nuclear Code
11	The Bacterial, Archaeal and Plant Plastid Code
12	The Alternative Yeast Nuclear Code
13	The Ascidian Mitochondrial Code
14	The Alternative Flatworm Mitochondrial Code
15	Blepharisma Nuclear Code
16	Chlorophycean Mitochondrial Code
21	Trematode Mitochondrial Code
22	Scenedesmus obliquus Mitochondrial Code
23	Thraustochytrium Mitochondrial Code

Input files created by mpwt

Three input files are created by mpwt. Informations are extracted from the Genbank/GFF/PF file. myDBName corresponds to the name of the folder and the Genbank/GFF/PF file. taxonid corresponds to the taxonid in the db_xref of the source feature in the Genbank/GFF/PF. The species_name is extracted from the Genbank/GFF/PF files.

**organism-params.dat**
ID  myDBName
STORAGE FILE
NCBI-TAXON-ID   taxonid
NAME    species_name

**genetic-elements.dats**
NAME    
ANNOT-FILE  gbk_pathname
//

**flat_files_creation.lisp**
(in-package :ecocyc)
(select-organism :org-id 'myDBName)
(let ((*progress-noter-enabled?* NIL))
        (create-flat-files-for-current-kb))

Command Line and Python arguments

By using the python multiprocessing library, mpwt launches parallel PathoLogic processes on physical cores. Regarding memory requirements, they depend on the genome but we advise to use at least 2 GB per core.

mpwt can be used with the command lines:

mpwt -f=FOLDER [-o=FOLDER] [--patho] [--hf] [--op] [--tp] [--nc] [--flat] [--md] [--mx] [--mo] [--mc] [-p=FLOAT] [--cpu=INT] [-r] [-v] [--clean] [--log=FOLDER] [--taxon-file]
mpwt --flat [-f=FOLDER] [-o=FOLDER] [--md] [--mx] [--mo] [--mc] [--cpu=INT] [-v]
mpwt -o=FOLDER [--md] [--mx] [--mo] [--mc] [--cpu=INT] [-v]
mpwt --clean [--cpu=INT] [-v]
mpwt --delete=STR [--cpu=INT]
mpwt --list
mpwt --version
mpwt topf -f=FOLDER -o=FOLDER [--cpu=INT] [--clean]

Optional argument are identified by [].

mpwt can be used in a python script with an import:

import mpwt

folder_input = "path/to/folder/input"
folder_output = "path/to/folder/output"

mpwt.multiprocess_pwt(input_folder=folder_input,
          output_folder=folder_output,
          patho_inference=optional_boolean,
          patho_hole_filler=optional_boolean,
          patho_operon_predictor=optional_boolean,
          patho_transporter_inference=optional_boolean,
          patho_complex_inference=optional_boolean,
          no_download_articles=optional_boolean,
          flat_creation=optional_boolean,
          dat_extraction=optional_boolean,
          xml_extraction=optional_boolean,
          owl_extraction=optional_boolean,
          col_extraction=optional_boolean,
          size_reduction=optional_boolean,
          number_cpu=int,
          patho_log=optional_folder_pathname,
          pathway_score=pathway_score,
          taxon_file=optional_str,
          verbose=optional_boolean,
          permission=optional_str)

Command line argument	Python argument	description
-f	input_folder(string: folder pathname)	Input folder as described in Input data
-o	output_folder(string: folder pathname)	Output folder containing PGDB data or flat files (see --flat arguments)
--patho	patho_inference(boolean)	Launch PathoLogic inference on input folder
--hf	patho_hole_filler(boolean)	Launch PathoLogic Hole Filler with Blast
--op	patho_operon_predictor(boolean)	Launch PathoLogic Operon Predictor
--tp	patho_transporter_inference(boolean)	Launch PathoLogic Transport Inference Parser
--cp	patho_complex_inference(boolean)	Use with --patho and at least Pathway Tools 26.0. Run the Complex Inference of Pathway Tools.
--nc	no_download_articles(boolean)	Launch PathoLogic without loading PubMed citations (not working)
-p	pathway_score(float)	Launch PathoLogic using a specified pathway prediction score cutoff
--flat	flat_creation(boolean)	Create BioPAX/attribute-value flat files
--md	dat_extraction(boolean)	Move the dat files into the output folder
--mx	xml_extraction(boolean)	Move the metabolic-reactions.xml file into the output folder
--mo	owl_extraction(boolean)	Move owl files into the output folder
--mc	col_extraction(boolean)	Move tabular files into the output folder
--cpu	number_cpu(int)	Number of cpu used for the multiprocessing
-r	size_reduction(boolean)	Delete PGDB in ptools-local to reduce size and return compressed files
--log	patho_log(string: folder pathname)	Folder where log files for PathoLogic inference will be store
--delete	mpwt.remove_pgdbs(string: pgdb name)	Delete a specific PGDB
--clean	mpwt.cleaning()	Delete all PGDBs in ptools-local folder or only PGDB from input folder
--taxon-file	taxon_file(string: file pathanme)	Force mpwt to use the taxon ID in the taxon_id.tsv file
--permission	permission(string: 'all', 'group')	Choose permission access to PGDB in ptools-local and output files
-v	verbose(boolean)	Print some information about the processing of mpwt

There is also another argument:

mpwt topf -f input_folder -o output_folder --cpu cpu_number

import mpwt
mpwt.to_pathologic.create_pathologic_file(input_folder, output_folder, cpu_number)

This argument reads the input data inside the input folder. Then it converts Genbank and GFF files into PathoLogic Format files. And if there is already PathoLogic files it copies them.

It can be used to avoid issues with parsing Genbank and GFF files. But it is an early Work in Progress as at this moment the PathoLogic files created do not produce the same PGDB as the corresponding GenBank/GFF files. Especially some genes are missing in th PGDB.

PathoLogic Hole Filler

The --hf/patho_hole_filler option uses the Hole Filler [Karp2019arXiv]:

The pathway hole-filling program PHFiller (a component of PathoLogic) generates hypotheses as to which genes code for these missing enzymes by using the following method. Given a reaction that is a pathway hole, the program first queries the UniProt database to find all known sequences for enzymes that catalyze that same reaction in other organisms. The program then uses the BLAST tool to compare that set of sequences against the full proteome of the organism in which we are seeking hole fillers. It scores the resulting BLAST hits using a Bayesian classifier that considers information such as genome localization (that is, is a potential hole filler in the same operon as another gene in the same metabolic pathway?). At a stringent probability-score cutoff, our method finds potential hole fillers for approximately 45% of the pathway holes in a microbial genome [59].

This option is more precisely described in [Green2004]:

Sequence retrieval – Retrieve from Swiss-Prot and PIR sequences for enzymes that catalyze the desired reaction in other organisms. Because these sequences are not necessarily homologs, we will refer to enzymes with the same function in a variety of organisms as isozymes. For Swiss-Prot, the program retrieves Swiss-Prot IDs directly from the ENZYME database. For PIR sequences, the program retrieves IDs from the MetaCyc PGDB. Sequences are then retrieved directly from the most recent version of each database.
Homology search – BLAST each query isozyme sequence against the genome of the organism of interest.
Data consolidation – Congruence analysis of the resulting BLAST hits to consolidate data reported for sequences that align with one or more query isozymes.
Candidate evaluation – Determine the probability that each candidate protein has the activity required by the missing reaction.

Operon Predictor

The --op/patho_operon_predictor identifies operon [Karp2019arXiv]:

The Pathway Tools operon predictor identifies operon boundaries by examining pairs of adjacent genes A and B and using information such as intergenic distance, and whether it can identify a functional relationship between A and B, such as membership in the same pathway, membership in the same multimeric protein complex, or whether A is a transporter for a substrate within a metabolic pathway in which B is an enzyme.

Transport Inference

The --tp/patho_transporter_inference tries to answer the question "What chemicals can the organism import or export?" [Karp2019arXiv]:

To answer such queries, Pathway Tools uses an ontology-based representation of transporter function in which transport events are represented as reactions in which the transported compound(s) are substrates. Each substrate is labeled with the cellular compartment in which it resides, and each substrate is a controlled-vocabulary term from the extensive set of chemical compounds in MetaCyc. The TIP program converts the free-text descriptions of transporter functions found in genome annotations (examples: “phosphate ABC transporter”and “sodium/proline symporter”) into computable transport reactions.

Pathway prediction score cutoff

The -p/pathway_score determines the cutoff for pathway prediction.

This cutoff is defined in ptools-init.dat:

During the pathway prediction process, pathways are assigned a score between 0 and 1 based on the evidence for the presence of that pathway. Pathways whose score does not exceed this cutoff value will usually be rejected (although certain rules may cause them to be predicted as present).

This pathway prediction score has also been explained in [Karp2018]:

A very strict pathway score cutoff of 1.0 was supplied to PathoLogic to predict into BlongCyc (from MetaCyc) only the pathways that have gene annotations associated with all pathway reactions, to minimize the effects of pathway inference on biomass goal reachability. PathoLogic inference of a metabolic pathway causes all reactions within the pathway to be imported from the MetaCyc database into the new PGDB, including reactions lacking gene assignments — using the 1.0 cutoff means that no reactions lacking gene assignments were imported from MetaCyc during pathway inference. The resulting PGDB was subjected to the following manual refinement steps. That is, some manual refinement occurred before gap filling began.

Examples

Possible uses of mpwt:

Create PGDBs of studied organisms inside ptools-local:

Convert Genbank and GFF files into PathoLogic files then create PGDBs of studied organisms inside ptools-local:

Create PGDBs of studied organisms inside ptools-local with Hole Filler, Operon Predictor, Transport Inference Parser and create logs:

Create PGDBs of studied organisms inside ptools-local with pathway prediction score of 1:

Create PGDBs of studied organisms inside ptools-local and create flat files:

Create PGDBs of studied organisms inside ptools-local. Then move all the PGDB files to the output folder.

Create PGDBs of studied organisms inside ptools-local and create flat files. Then move the dat files to the output folder.

Create flat files for the PGDB inside ptools-local. And move them to the output folder.

Move PGDB from ptools-local to the output folder:

Move dat files from ptools-local to the output folder:

Useful functions

Run the multiprocess Pathway Tools on input folder

Delete all the previous PGDB and the metadata files

Delete a specific PGDB

Return the path of ptools-local

Return a list containing all the PGDBs inside ptools-local folder

Errors

If you encounter errors (and it is highly possible) there is informations that can help you resolved them.

For error during PathoLogic inference, you can use the log arguments. The log contains the summary of the build and the error for each species. There is also a pathologic.log (created by Pathway Tools), a pwt_terminal.log (log of the terminal during PathoLogic process) and a flat_files_creation.log (log of the terminal during attributes-values files creation) in each sub-folders.

If the build passed you have also the possibility to see the result of the inference with the file resume_inference.tsv. For each species, it contains the number of genes/proteins/reactions/pathways/compounds in the metabolic network.

If Pathway Tools crashed, mpwt can print some useful information in verbose mode. It will show the terminal in which Pathway Tools has crashed. Also, if there is an error in pathologic.log, it will be shown after === Error in Pathologic.log ===.

There is a Pathway Tools forum where you can find informations on Pathway Tools errors.

Output

If you did not use the output argument, results (PGDB with/without BioPAX/flat files) will be inside your ptools-local folder ready to be used with Pathway Tools. Have in mind that mpwt does not create the cellular overview. So if you want these results you should run them after.

The different file formats created are described on Pathway Tools data-file format site.

If you use the output argument, mpwt will copy each of the PGDB folders to the output folder:

Folder_output
├── species_1
│   └── default-version
│   └── 1.0
│       └── data
│           └── contains BioPAX/flat files if you used the --flat/flat_creation option.
│       └── input
│           └── species_1.gbk
│           └── genetic-elements.dat
│           └── organism-init.dat
│           └── organism.dat
│       └── kb
│           └── species_1.ocelot
│       └── reports
│           └── contains Pathway Tools reports.
├── species_2
..
├── species_3
..

If you want specific files, you can use the --mX/XXX_extraction options.

--md/dat_extraction will only copy the attribute-values dat files:

Folder_output
├── species_1
│   └── classes.dat
│   └── compounds.dat
│   └── dnabindsites.dat
│   └── enzrxns.dat
│   └── genes.dat
│   └── pathways.dat
│   └── promoters.dat
│   └── protein-features.dat
│   └── proteins.dat
│   └── protligandcplxes.dat
│   └── pubs.dat
│   └── reactions.dat
│   └── regulation.dat
│   └── regulons.dat
│   └── rnas.dat
│   └── species.dat
│   └── terminators.dat
│   └── transunits.dat
│   └── ..
├── species_2
..
├── species_3
..

--mx/xml_extraction will only copy the metabolic-reactions.xml file of each PGDB (created by MetaFlux) and rename it:

Folder_output
├── species_1.xml
├── species_2.xml
├── species_3.xml
..

--mo/owl_extraction will only copy the biopax-level2.owl and the biopax-level3.owl files of each PGDB and rename them:

Folder_output
├── species_1-level2.owl
├── species_1-level3.owl
├── species_2-level2.owl
├── species_2-level3.owl
├── species_3-level2.owl
├── species_3-level3.owl
..

--mc/col_extraction will only copy the tabular files of each PGDB:

Folder_output
├── species_1
│   └── enzymes.col
│   └── genes.col
│   └── pathways.col
│   └── protcplxs.col
│   └── transporters.col
├── species_2
..
├── species_3
..

It is also possible to use a combination of these arguments:

mpwt -f input_folder -o output_folder --patho --flat --md --mx --mo --mc

Folder_output
├── species_1
│   └── biopax-level2.owl
│   └── biopax-level3.owl
│   └── classes.dat
│   └── compounds.dat
│   └── dnabindsites.dat
│   └── enzrxns.dat
│   └── enzymes.col
│   └── genes.col
│   └── genes.dat
│   └── metabolic-reactions.xml
│   └── pathways.col
│   └── pathways.dat
│   └── promoters.dat
│   └── protcplxs.col
│   └── protein-features.dat
│   └── proteins.dat
│   └── protligandcplxes.dat
│   └── pubs.dat
│   └── reactions.dat
│   └── regulation.dat
│   └── regulons.dat
│   └── rnas.dat
│   └── species.dat
│   └── terminators.dat
│   └── transporters.col
│   └── transunits.dat
│   └── ..
├── species_2
..
├── species_3
..

By using the -r /size_reduction argument, you will have compressed zip files (and PGDBs inside ptools-local will be deleted):

Folder_output
├── species_1.zip
├── species_2.zip
├── species_3.zip
..

For developer

mpwt uses logging so you need to create the handler configuration if you want mpwt's log in your application:

import logging

from mpwt import multiprocess_pwt

logging.basicConfig()

multiprocess_pwt(...)

Release Notes

Changes between version are listed on the release page.

Bibliography

Citation

Belcour* A, Frioux* C, Aite M, Bretaudeau A, Hildebrand F, Siegel A. Metage2Metabo, microbiota-scale metabolic complementarity for the identification of key species. eLife 2020, 9, e61968 https://doi.org/10.7554/eLife.61968.

mpwt depends on the following tools:

Pathway Tools for the reconstruction of draft metabolic networks (the article can be not up-to-date, look at the Publications on the BioCyc site):

Karp P D, Midford P E, Billington R, Kothari A, Krummenacker M, Latendresse M, Ong W K, Subhraveti P, Caspi R, Fulcher C, Keseler I M, Paley SM. Pathway Tools version 23.0 update: software for pathway/genome informatics and systems biology. Briefings in Bioinformatics 2021, 22, 109–126 https://doi.org/10.1093/bib/bbz104.

Biopython for GenBank parsing:

Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., de Hoon, M.J.L. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25, 1422–1423 https://doi.org/10.1093/bioinformatics/btp163.

gffutils for GFF parsing:

GitHub repository: https://github.com/daler/gffutils

chardet for character encoding detection:

GitHub repository: https://github.com/chardet/chardet

Acknowledgements

Mézaine Aite for his work on the first draft of this package.

Clémence Frioux for her work and feedbacks.

Peter Karp, Suzanne Paley, Markus Krummenacker, Richard Billington and Anamika Kothari from the Bioinformatics Research Group of SRI International for their help on Pathway Tools and on Genbank format.

GenOuest bioinformatics (https://www.genouest.org/) core facility for providing the computing infrastructure to test this tool.

All the users that have tested this tool.

License

This package is licensed under the GNU LGPL-3.0-or-later - see the LICENCE file for details.

Green2004: Green, M.L., Karp, P.D. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 5, 76 (2004). https://doi.org/10.1186/1471-2105-5-76
Karp2011: Karp, P. D., Latendresse, M., & Caspi, R. The pathway tools pathway prediction algorithm. Standards in genomic sciences 5(3), 424–429 (2011). https://doi.org/10.4056/sigs.1794338
Karp2018: Karp, P. D., Weaver, D. & Latendresse, M. How accurate is automated gap filling of metabolic models?. BMC Systems Biology 12(1), 73 (2018). https://doi.org/10.1186/s12918-018-0593-7
Karp2019arXiv: Karp, P. D., Paley, S. M., Midford, P. E., Krummenacker, M., Billington, R., Kothari, A., Ong, W. K., Subhraveti, P., Keseler, I. M. & Caspi R. Pathway Tools version 23.0: Integrated Software for Pathway/Genome Informatics and Systems Biology. arXiv (2019). https://arxiv.org/abs/1510.03964v3
PathwayToolsarXiv: Karp, P. D., Paley, S. M., Midford, P. E., Krummenacker, M., Billington, R., Kothari, A., Ong, W. K., Subhraveti, P., Keseler, I. M. & Caspi R. Pathway Tools: Integrated Software for Pathway/Genome Informatics and Systems Biology. arXiv. https://arxiv.org/abs/1510.03964

mpwt's People

Contributors

Stargazers

Watchers

mpwt's Issues

No join and close of Pool if an error occurred.

If an error occurred in mpwt_workflow, the package will exit with a sys.exit(). But the multiprocessing Pool is not closed.

Furthermore modification of the ptools-init.dat are not cleaned.

Replace print with logging.

Add version to mpwt.

To see the version of mpwt in python, mpwt should have: mpwt.__version__

Added in 0391347.

Accept GFF file as input file.

Since version 22.0, PathoLogic accepts GFF file (release note). So mpwt should also be able to accept GFF as input file.

Add an option to use Pathway-Tools Hole-Filler.

During PathoLogic call, Hole Filler could be used with an option (associated with --patho).

--md should not be used without -o

At this moment, it is possible to use:

mpwt -f input_folder --patho --dat --md

mpwt should return an error because --md option has no folder where it can move the attribute-values files.

Split multipwt.py into multiple scripts.

The script multipwt.pwy contains around 1000 lines. To ease reading, it must be split into multiple scripts.

We could have:

input_data.py: create input data for PathoLogic.
pwt_wrapper.py: wrapper around Pathway Tools calls.
check_results: check results (inference and dat files).

Multiprocess when moving file.

Add an option to use Transporter Inference Parser with Pathologic.

Issue with encoding of flat_files_creation.log.

With the 'flat_files_creation.log' in check_log function, there is an encoding error when reading the file.

Add support to gbff.

Reduce size argument could return compressed PGDB.

Better error message if Pathway-Tools is not in PATH.

Accept Pathologic file as input.

Add a --version option to mpwt

Just to be compliant with many tools and be able to get the version of Mpwt at a glance

Add support for .fsa extension.

Add an argument to select pathway prediction score.

Add an option to use Pathway Tools operon predictor.

Add a license to the project

Add argument to delete only PGDB redundant with input data.

When using the --clean argument with mpwt before a run on an input folder, all PGDBs are deleted.

With a new option (like --delete), only PGDBs redundant with the input files will be deleted.

Example:

Folder_input
├── species_1
│   └── species_1.gbk
├── species_2
│   └── species_2.gff
│   └── species_2.fasta
├── species_3
│   └── species_3.gbk

And species_1, species_2 and species_4 are in ptools-local.

Actually by using mpwt -f Folder_input --patho --dat --clean
species_1, species_2 and species_4 are deleted inside ptools-local.

The new argument will only delete species_1 and species_2.

Error when trying to run mpwt on Ubuntu with Python3.7

Hi,
I ran into this issue trying to run mpwt.
Could it be because of the Python version?

my command line call is:

mpwt -f path/to/input -o path/to/output --patho --hf --dat --cpu 4 --log path/to/logs --ignore-error -v

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/ptools_venv/lib/python3.7/site-packages/mpwt/pathologic_input.py", line 417, in pwt_input_files
    check_datas_lisp = create_dats_and_lisp(run_folder, taxon_file)
  File "/ptools_venv/lib/python3.7/site-packages/mpwt/pathologic_input.py", line 304, in create_dats_and_lisp
    region_feature = [feature for feature in DataIterator(gff_pathname) if feature.featuretype == 'region'][0]
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/ptools_venv/bin/mpwt", line 10, in <module>
    sys.exit(run_mpwt())
  File "/ptools_venv/lib/python3.7/site-packages/mpwt/__main__.py", line 113, in run_mpwt
    verbose=verbose)
  File "/ptools_venv/lib/python3.7/site-packages/mpwt/mpwt_workflow.py", line 102, in multiprocess_pwt
    mpwt_pool.map(pwt_input_files, multiprocess_inputs)
  File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 290, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/lib/python3.7/multiprocessing/pool.py", line 683, in get
    raise self._value
IndexError: list index out of range

Issue when checking input files.

When checking the presence of input folders/files, mpwt does not check if there is the corresponding '.gbk' or '.gff' in the input folder.

Manage issue with PubMed citations.

If there is too many runs of Pathway Tools, some of them can failed because their requests to load PubMed citations have been stopped. One fix is to avoid loading these citations.

Encoding issues when creating terminal logs.

There is multiple issues with encoding when writing the terminal log files.

Fixed in:

bc77418 (add chardet to find encoding linked to molecule typing)
dae2c46 (replace error in decoded)
39f34bc (for singularity)

Add an option to create taxon_id.tsv.

If there is no taxon_id.tsv file in input folder, mpwt could be able to create it.

For example by using the command:

mpwt -f input_folder --taxon-file

Pop-up showing after attribute-value files creations.

With Singularity, pop-ups appear after attribute-value files creations.

It seems that the child processes of Pathway Tools are not killed correctly.

Issue with Pathway Tools 25.0 when checking log.

There is an error with the check_pwt function with Pathway Tools 25.0.

In the pathologic.log file the term 'proteins' has been replaced by 'polypeptides'. And the word is used for indicating the number of proteins in the draft metabolic network.

Error when trying to create dat files without input folder.

If the user uses the option to create only dat files: mpwt --dat --md -o output_folder -v

This will lead to an error:

  File "/usr/local/bin/mpwt", line 11, in <module>
    load_entry_point('mpwt', 'console_scripts', 'mpwt')()
  File "/mnt/c/Users/Arnaud/Downloads/Work_directory/programs/mpwt/mpwt/__main__.py", line 149, in run_mpwt
    verbose=verbose)
  File "/mnt/c/Users/Arnaud/Downloads/Work_directory/programs/mpwt/mpwt/mpwt_workflow.py", line 217, in multiprocess_pwt
    input_folder_path = input_folder + '/' + dat_run_id + '/'
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'```

Handle error if PathoLogic build is aborted.

Add a better error if Pathway Tools failed.

More flexible tool.

Split run of Pathologic and dat creation, so each one can be run separately.

Run BioPAX/dat creation on PGDBs inside ptools-local but not in output folder.

Add more informations in taxon_id.tsv.

taxon_id.tsv could store more informations, like genome type or genome circularity.

Add error message when there is multiple input files.

mpwt should send an error when it detects multiple input files in the same input folder.

Error with --clean when used with -v and --cpu.

Print error in pathologic.log in the terminal.

mpwt --dat kills pathway-tools command if there is a non-fatal error

When creating attribute-values files, mpwt can kill pathway-tools if it detects an 'Error'. But for pathway-tools not all Errors are fatal.

mpwt --dat must be able to distinguish between fatal and non-fatal error.

mpwt move output files should be more flexible.

At this moment, mpwt only allows to move dat files. It should allow to move other like xml (--mx) or owl files(--mo).

Like:
mpwt -f input_folder --patho --dat --mx -o output_folder

input_folder
├── species_1
│ └── species_1.gbk
├── species_2
│ └── species_2.gbk

mpwt will create:

output_folder
├── species_1.xml
├── species_2.xml

With the xml files being the metabolic-reactions.xml file of each PGDB.

With:
mpwt -f input_folder --patho --dat --mx --md -o output_folder

mpwt will create:

output_folder
├── species_1
│ └── classes.dat
│ └── compounds.dat
│ └── dnabindsites.dat
│ └── enzrxns.dat
│ └── genes.dat
│ └── metabolic-reactions.xml
│ └── pathways.dat
│ └── promoters.dat
│ └── protein-features.dat
│ └── proteins.dat
│ └── protligandcplxes.dat
│ └── pubs.dat
│ └── reactions.dat
│ └── regulation.dat
│ └── regulons.dat
│ └── rnas.dat
│ └── species.dat
│ └── terminators.dat
│ └── transunits.dat
│ └── ..
├── species_2
..
├── species_3
..

Add gbk2pf converter.

As genbank formats created by different tools are different, we can have errors during PathoLogic. To solve these errors, we could convert genbank into PathoLogic Format.

The idea is to add a new argument:

mpwt gbk2pf -i input_folder -o output_folder --cpu number_cpu

This command takes as argument a folder containing the genbank (in the same structure than the input_folder of mpwt -f) and an output folder. It will then generates the PathoLogic Format files from the genbank.

Refactor how paths are handled.

Paths are not well formatted (with lot of '/'). This is incompatible with some Operating Systems (like Windows).

Make each run of Pathway Tools independent.

Right now, mpwt runs PathoLogic, checks errors, creates flat files (if the option has been used) and then moves the output. Each of these steps are made separately by sending all the input to one of the function with a starmap call.

But it could be useful to launch all of these steps independently for each organism. In this way, if there is an error, only the organism with the error will stop, the other will continue.

Also this could allow to delete the previous PGDB created in ptools-local (if the builds have passed) to decrease the amount of disk usage when dealing with a high number of organisms.

PathoLogic error can occur after PGDB creation.

In the pathologic.log file, error can occur after the line "PGDB build done" leading mpwt to print that the PathoLogic run has failed and succeed at the same time.

Replace Pool map by starmap.

To have understandable function inputs for run_pwt, run_pwt_dat and some other functions, we can use Pool starmap instead of map.

Pool starmap allows to have multiple arguments in function whereas map allows only one argument.

But starmap is only avaialbe in Python version superior to 3.3. With the release of version 0.5.7, mpwt drops support for the version of Python inferior to 3.3.

Begin in 4384094.

Add better error handling.

Error with dat creation if PGDB already there.

Creation of BioPAX/Attribute-values files is not launched when there is a PGDB of an input species inside ptools-local.

Add a check to delete unfinished PathoLogic build.

If mpwt is killed during a run, Pathway Tools could not have finished the build of some species (they do not contain all the reactions/pathways in their metabolic network).

After this, if mpwt is re-run it will not launch again Pathway Tools for these PGDBs. To avoid this issue, mpwt should check the pathologic.log file of already finished PGDB to see if the previous build has been successful.
If not, mpwt should delete these unfinished PGDB and re-run Pathway Tools on them.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.