mpc-bioinformatics / protgraph Goto Github PK

View Code? Open in Web Editor NEW

9.0 6.0 5.0 16.95 MB

ProtGraph - A Graph-Generator for Proteins

License: Other

Python 98.98% Shell 1.02%

graph peptide fasta protein python3 pypi bioconda uniprot

protgraph's Introduction

ProtGraph - A Protein-Graph-Generator

Summary

ProtGraph in short is a python-package, which allows to convert protein-entries from the UniProtKB to so-called protein-graphs. We use the SP-EMBL-Entries, provided by UniProtKB via *.txt or *.dat-files, and parse the available feature-information. In contrast to a FASTA-file-entry of a protein, a SP-EMBL-file-entry is more detailed. SP-EMBL-files not only contain the canonical sequence, but also contain isoform-sequences, specifically cleaved peptides, like SIGNAL-peptides, PROPEPtides (or simply: PEPTIDEs, e.g. ABeta 40/42), variationally altered aminoacid-sequences (VARIANTs, mutations or sequence-CONFLICTs) and even more. The SP-EMBL-files are parsed, using BioPython, and corresponding protein-graphs, containing the additional feature-information are generated with igraph. The resulting graphs contain all possibly resulting protein-sequences from the canonical as well as from the feature-information and can also be further processed (e.g. digestion, to contain peptides).

So, what can we do with ProtGraph? First, the protein-graphs can be saved by ProtGraph in various formats. These can be then opened via python/R/C++/... or any other tool (e.g. Gephi) and algorithms, statistics, visualizations, etc... can be applied. ProtGraph then acts solely as a converter of protein-entries to a graph-structure. Second, while generating protein-graphs, a statistics-file is generated, containing various on-the-fly retrievable information about the protein-graph. We can calculate the number of nodes/edges/features within a protein-graph, the number of protein-/peptide-sequences contained and even binned by specific attributes. These can give a quick overview e.g. of the tryptic search space while considering all feature-information of the provided species-proteome. Lastly, the protein-graphs per se are not useful, especially in identification. Therefore, we extended ProtGraph to optionally convert protein-graph into FASTA-entries, to be used for identification and to enable searches of feature-induced-peptides.

Curious what we do with ProtGraph and its output? Check out materials_and_posters for an explanation of the protein-graph-generation and further materials. Below in the README.md, additional examples are provided.

Setting up ProtGraph

You can download ProtGraph directly from pypi. It is also available in bioconda. The installation instruction can be found here. ProtGraph is also installable directly from this repository and from pip:

pip install protgraph

# OR

conda install -c bioconda protgraph 

# OR

git clone [email protected]:mpc-bioinformatics/ProtGraph.git
cd ProtGraph/
pip install .

# OR 

pip install git+https://github.com/mpc-bioinformatics/ProtGraph.git

You can use [full] if you want to install ProtGraph with all its dependencies. To see an overview of possible parameters and options use: protgraph --help. The help-output contains brief descriptions for each flag. Also check out all other available commands under protgraph_.

Troubleshooting

ProtGraph has many dependencies which are mostly due to the export functionalities. For the dependency psycopg, it may be necessary to install PostgreSQL on your operating system, so that the wheel can be built. It should be sufficient to install it as follows sudo pacman -S postgres (adapt this for apt and alike).

NOTE: UniProt updated the SP-EMBL-Format (August 3. 2022) which is not compatible anymore with the biopython version up to 1.79. If you encounter errors, please try this version, which can be installed with:

pip install git+https://github.com/biopython/biopython.git@947868c487a12799d51173c5f651a44ecb3fb6fa

For the SQLite-databases, we use the dependency apsw from package. It can be installed via:

pip install git+https://github.com/rogerbinns/apsw.git

If the command protgraph cannot be found, make sure that you have included the python executable-folder in your environment variable PATH. (If pip was executed with the flag --user, then the binaries should be available in the folder: ~/.local/bin)

Generating and retrieving Statistics from a SP-EMBL-Example-Entry

Assume we have downloaded the SP-EMBL-Entry from the protein with the accession QXXXXX in (QXXXXX.txt). A look into this text file shows the following:

$ cat examples/QXXXXX.txt
ID   MY_CUSTOM_PROTEIN             Reviewed;         8 AA.
AC   QXXXXX

...

FT   INIT_MET        1
FT                   /note="Removed"
FT                   /evidence="EXAMPLE Initiator Methionine"
FT   SIGNAL          1..4
FT                   /evidence="EXAMPLE Signal Peptide"
FT   VARIANT         3
FT                   /note="Missing (in Example)"
FT                   /evidence="EXAMPLE Variant 1"
FT                   /id="VAR_XXXXX1"
FT   VARIANT         4
FT                   /note="O -> L (in Example)"
FT                   /evidence="EXAMPLE Variant 2"
FT                   /id="VAR_XXXXX2"
FT   VARIANT         5
FT                   /note="T -> K (in Example)"
FT                   /evidence="EXAMPLE Variant 3"
FT                   /id="VAR_XXXXX3"
SQ   SEQUENCE   8 AA;  988 MW;  XXXXXXXXXXXXXXXX CRC64;
     MPROTEIN

We can see that this entry contains the canonical sequence (described in section SQ). This would be the sequence which would be present in the FASTA-file. Besides the canonical sequence, we can see a detailed overview of all feature-information for this protein (described in section FT). In this example, a SIGNAL peptide is present, which may lead to a peptide, which is not covered when cleaving (e.g. via the digestion enzyme Trypsin). Additionally, 3 VARIANTs are present in this entry, informing about substituted aminoacids.

If we execute protgraph examples/QXXXXX.txt, we will generate a protein-graph which would look, if drawn manually, like the following: The visualization shows, that the information from the section FT and SQ was added. The chain in the middle describes the canonical sequence. Additional nodes are added illustrating aminoacid-substitutions of the VARIANTs and additional edges have been added to mimic a SIGNAL-peptide. ProtGraph additionally adds a start node and a end node (internally denoted by: __start__ and __end__) and digested it by Trypsin (which is the default for each protein-graph).

All graphs generated by ProtGraph are so called directed and acyclic. These properties allow us to calculate interesting statistics about this protein:

E.G.: Executing protgraph -cnp examples/QXXXXX.txt adds the number of possible (non-repeating) paths between the start- and end-node into the statistics-output. We retrieve that there are 46 possible paths or in other words peptides in this graph.

E.G.: Executing protgraph -cnpm examples/QXXXXX.txt would include the number of possible paths binned by miscleavages into the statistics-output. This protein-graph contains 23 peptides with no miscleavages, 19 peptides with exactly 1 miscleavage and 4 peptides with exactly 2 miscleavages.

As seen in the image, ProtGraph, merges some nodes into a single node (e.g. EIN). We can prevent this behavior with the --no-merge-flag. Combining this flag with another, we can bin the number of paths by the path-length itself with protgraph -nm -cnph examples/QXXXXX.txt. The statistics-output would contain the total number of individual peptide-lengths within this protein-graph:

Peptide Length	1	2	3	4	5	6	7	8
#Peptides	3	5	8	8	6	4	8	4

Exporting Protein-Graphs to various graph-file-formats

While executing ProtGraph, generated graphs are not saved, which is the default behavior. We directly exclude the generated protein-graphs. To export each protein-graph into a file, simply set the flags -edot, -egraphml and/or -egml, which will create the corresponding dot, GraphML or GML files. Other tools, like Gephi, could then visualize this graph. With the flag -epickle it is also possible to generate a binary pickle file of a protein-graph (which can be used by other python programs). This is illustrated on the example protein QXXXXX:

$ protgraph -epickle examples/QXXXXX.txt 
1proteins [00:02,  2.05s/proteins]

To load the protein-graph back into python, it is enough to do the following:

In [1]: import pickle

In [2]: with open("exported_graphs/QXXXXX.pickle", "rb") as in_file:
   ...:     qxxxxx = pickle.load(in_file)

In [3]: qxxxxx
Out[3]: <igraph.Graph at 0x7f98b848e6b0>

In [4]: qxxxxx.vcount()
Out[4]: 10

In [5]: qxxxxx.ecount()
Out[5]: 24

In [6]: list(qxxxxx.vs[:])
Out[6]: 
[igraph.Vertex(<igraph.Graph object at 0x7f98b848e6b0>, 0, {'aminoacid': '__start__', 'position': 0, 'accession': 'QXXXXX'}),
 igraph.Vertex(<igraph.Graph object at 0x7f98b848e6b0>, 1, {'aminoacid': 'M', 'position': 1, 'accession': 'QXXXXX'}),
 igraph.Vertex(<igraph.Graph object at 0x7f98b848e6b0>, 2, {'aminoacid': 'P', 'position': 2, 'accession': 'QXXXXX'}),
 igraph.Vertex(<igraph.Graph object at 0x7f98b848e6b0>, 3, {'aminoacid': 'R', 'position': 3, 'accession': 'QXXXXX'}),
 igraph.Vertex(<igraph.Graph object at 0x7f98b848e6b0>, 4, {'aminoacid': 'O', 'position': 4, 'accession': 'QXXXXX'}),
 igraph.Vertex(<igraph.Graph object at 0x7f98b848e6b0>, 5, {'aminoacid': 'T', 'position': 5, 'accession': 'QXXXXX'}),
 igraph.Vertex(<igraph.Graph object at 0x7f98b848e6b0>, 6, {'aminoacid': '__end__', 'position': 9, 'accession': 'QXXXXX'}),
 igraph.Vertex(<igraph.Graph object at 0x7f98b848e6b0>, 7, {'aminoacid': 'L', 'position': None, 'accession': 'QXXXXX'}),
 igraph.Vertex(<igraph.Graph object at 0x7f98b848e6b0>, 8, {'aminoacid': 'K', 'position': None, 'accession': 'QXXXXX'}),
 igraph.Vertex(<igraph.Graph object at 0x7f98b848e6b0>, 9, {'aminoacid': 'EIN', 'position': 6, 'accession': 'QXXXXX'})]

Retrieving Statistics over the Tryptic Search Space of a Species (E.Coli)

In the example-folder you can find an already download SP-EMBL-entries of Escherichia coli (strain K12). Generating protein-graphs and a basic statistics-file over all entries can be calculated in seconds:

$ protgraph examples/e_coli.dat 
9434proteins [00:15, 627.51proteins/s]

ProtGraph shows us that 9344 proteins have been processed. An excerpt of the basic statistics file informs us that ProtGraph included 28 isoforms, 815 signal peptides and 5034 mutagenic information. To include and retrieve the number of peptides for E.Coli we have to set the corresponding flags in ProtGraph and can then summarize the corresponding columns in the statistics file:

$ protgraph -nm -cnp -cnpm -cnph examples/e_coli.dat 
9434proteins [00:27, 348.70proteins/s]

$ protgraph_print_sums -cidx 12 protein_graph_statistics.csv 
9433rows [00:00, 135774.65rows/s]
Results from column 'num_paths':

Sum of each entry
0:  14842148316403611
....

$ protgraph_print_sums -cidx 13 protein_graph_statistics.csv 
9433rows [00:00, 19692.29rows/s]
Results from column 'list_paths_miscleavages':

Sum of each entry
  0:          24884044
  1:          80527605
  2:         126597315
  3:         152794276
...

$ protgraph_print_sums -cidx 14 protein_graph_statistics.csv 
9433rows [00:00, 19696.70rows/s]
Results from column 'list_paths_hops':

Sum of each entry
   0:                 0
   1:             32342
   2:             31252
   3:             31277
   4:             31303
   5:             30733
   6:             29118
   7:             30827
   8:             30653
   9:             30127
  10:             30780
...

From the output, it can be seen that already the proteins in E.Coli can yield 14842148316403611 peptides, if including all feature-information. The other outputs show, that we can bin those peptides by the number of miscleavages (e.g. exactly 24 884 044 peptides with 0 miscleavages) and by the peptide length (e.g. exactly 31 252 peptides with 2 aminoacids).

Replacing Aminoacids (J->I,L)

In some protein-entries provided by UniProtKB, some aminoacids are summarized by a single letter code. E.G.: the letter B corresponds to the aminoacid D (Aspartic acid) or N (Asparagine). ProtGraph can replace such aminoacids with the actual one as depicted below:

$ protgraph -cnp -raa "B->D,N" examples/e_coli.dat
9434proteins [00:16, 588.38proteins/s] 
$ protgraph_print_sums -cidx 12 protein_graph_statistics.csv 
9433rows [00:00, 142756.67rows/s]
Results from column 'num_paths':

Sum of each entry
0:  14842148316404547
...

$ protgraph -cnp -raa "J->I,L" examples/e_coli.dat
9434proteins [00:14, 671.71proteins/s] 
$ protgraph_print_sums -cidx 12 protein_graph_statistics.csv 
9433rows [00:00, 180680.48rows/s]
Results from column 'num_paths':

Sum of each entry
0:  14842148316403617
...

protgraph -cnp -raa "Z->Q,E" examples/e_coli.dat
9434proteins [00:06, 1560.85proteins/s]
$ protgraph_print_sums -cidx 12 protein_graph_statistics.csv 
9433rows [00:00, 197856.01rows/s]
Results from column 'num_paths':

Sum of each entry
0:  14842148316373256
...

$ protgraph -cnp -raa "X->A,C,D,E,F,G,H,I,K,L,M,N,O,P,Q,R,S,T,U,V,W,Y" examples/e_coli.dat
9434proteins [00:06, 1560.41proteins/s]
$ protgraph_print_sums -cidx 12 protein_graph_statistics.csv 
9433rows [00:00, 199515.24rows/s]
Results from column 'num_paths':

Sum of each entry
0:  14842148528266967
...

It can be observed that the number of peptides for E.Coli (tryptically digested) increases, since it now considers all possibilities of the substitutes in a sequence. The aminoacid-replacements (-raa) can also be chained and works for all aminoacids.

Generation of a E.Coli-FASTA-databases, containing only specific information

There are other formats like spectral libraries and alike but we first focused on the FASTA-format, since it is broadly used. ProtGraph traverses the proteins-graphs in a depth-search-manner and limits can be specified globally over a set of SP-EMBL-entries.

First, we generate a FASTA-file from the SP-EMBL-entries of E.Coli. It should only contain the canonical sequence (so no feature-information applied on the protein-graphs) and should be undigested:

$ protgraph -ft NONE -d skip -epepfasta --pep_fasta_out e_coli.fasta examples/e_coli.dat 
9434proteins [00:05, 1869.91proteins/s]

$ head e_coli.fasta 
>pg|ID_9|P13445(1:330,mssclvg:-1)
MSQNTLKVHDLNEDAEFDENGVEVFDEKALVEQEPSDNDLAEEELLSQGATQRVLDATQL
YLGEIGYSPLLTAEEEVYFARRALRGDVASRRRMIESNLRLVVKIARRYGNRGLALLDLI
EEGNLGLIRAVEKFDPERGFRFSTYATWWIRQTIERAIMNQTRTIRLPIHIVKELNVYLR
TARELSHKLDHEPSAEEIAEQLDKPVDDVSRMLRLNERITSVDTPLGGDSEKALLDILAD
EKENGPEDTTQDDDMKQSIVKWLFELNAKQREVLARRFGLLGYEAATLEDVGREIGLTRE
RVRQIQVEGLRRLREILQTQGLNIEALFRE
>pg|ID_3|P0A840(1:253,mssclvg:-1)
MRILLSNDDGVHAPGIQTLAKALREFADVQVVAPDRNRSGASNSLTLESSLRTFTFENGD
IAVQMGTPTDCVYLGVNALMRPRPDIVVSGINAGPNLGDDVIYSGTVAAAMEGRHLGFPA

$ cat e_coli.fasta | grep "^>" | wc -l
9434

NOTE: The generated FASTA-files using -epepfasta can have same sequences for multiple entries!

The FASTA-file contains the same number of entries as in the SP-EMBL-file. It can be noticed, that the header-format was reformatted to contain all the information along the path. pg stands for ProtGraph, the ID_XXX-describes an unique identifier for this FASTA-entry followed by the information along the path. In this case, it tells us the whole sequence of a Protein and that it had no miscleavages (which is true, since the protein-graph was not digested). Next, it would be interesting to export a digested FASTA-file (peptide-FASTA), containing SIGNAL-, PEPTIDE- and PROPEP-information:

$ protgraph -ft SIGNAL -ft PROPEP -ft PEPTIDE -d trypsin -epepfasta --pep_fasta_out e_coli_signal_propep_pep.fasta examples/e_coli.dat 
9434proteins [00:29, 324.64proteins/s]

$ head e_coli_signal_propep_pep.fasta 
>pg|ID_0|P23857(4:4,mssclvg:0)
K
>pg|ID_11|P23857(4:104,mssclvg:12)
KGLLALALVFSLPVFAAEHWIDVRVPEQYQQEHVQGAINIPLKEVKERIATAVPDKNDTV
KVYCNAGRQSGQAKEILSEMGYTHVENAGGLKDIAMPKVKG
>pg|ID_22|P23857(4:103,mssclvg:11)
KGLLALALVFSLPVFAAEHWIDVRVPEQYQQEHVQGAINIPLKEVKERIATAVPDKNDTV
KVYCNAGRQSGQAKEILSEMGYTHVENAGGLKDIAMPKVK
>pg|ID_33|P23857(4:101,mssclvg:10)
KGLLALALVFSLPVFAAEHWIDVRVPEQYQQEHVQGAINIPLKEVKERIATAVPDKNDTV

$ cat e_coli_signal_propep_pep.fasta | grep "SIGNAL" | head -n 1
>pg|ID_143|P23857(4:19,mssclvg:1,SIGNAL[1:19])

$ cat e_coli_signal_propep_pep.fasta | grep "^>" | wc -l
6660482

This FASTA-file is significantly larger (1.7 GB, containing ~6.6 million peptides) but contains all resulting peptides using the selected features while considering up to infinite many miscleavages. The output shows that the smallest peptides (like K) are also considered. Additionally, one entry is showcased resulting from a SIGNAL-peptide-feature. Since we considered a FASTA without any limitation, we now limit the minimum and maximum peptide length to 6-60 and allow up to 2 miscleavages:

$ protgraph -nm -ft SIGNAL -ft PROPEP -ft PEPTIDE -d trypsin --pep_miscleavages 2 --pep_min_pep_length 6 --pep_hops 60 -epepfasta --pep_fasta_out e_coli_with_selected_features_limited.fasta examples/e_coli.dat 
9434proteins [00:14, 671.77proteins/s] 

$ head e_coli_with_selected_features_limited.fasta
>pg|ID_8|P0A840(1:24,mssclvg:2)
MRILLSNDDGVHAPGIQTLAKALR
>pg|ID_19|P0A840(1:21,mssclvg:1)
MRILLSNDDGVHAPGIQTLAK
>pg|ID_30|P0A840(22:38,mssclvg:2)
ALREFADVQVVAPDRNR
>pg|ID_41|P0A840(22:36,mssclvg:1)
ALREFADVQVVAPDR
>pg|ID_52|P0A840(130:151,mssclvg:2)
HYDTAAAVTCSILRALCKEPLR

$ cat e_coli_with_selected_features_limited.fasta| grep "^>" | wc -l
648551

We can see that the number peptides within this FASTA is significantly reduced (to ~650 000 entries). NOTE: for setting an upper length-limit of peptides we need to set -nm. In case of not setting this parameter, longer peptides may be exported. Finally, since the FASTA-file still contains for some entries same sequences, we can concatenate these by using a different (and more sophisticated) FASTA-exporter:

$ protgraph -nm -ft SIGNAL -ft PROPEP -ft PEPTIDE -d trypsin --pep_miscleavages 2 --pep_min_pep_length 6 --pep_hops 60 -epepsqlite --pep_sqlite_database e_coli_database.db examples/e_coli.dat 
9434proteins [00:16, 587.41proteins/s] 

$ protgraph_pepsqlite_to_fasta -o e_coli_compact.fasta e_coli_database.db 
100%|███████████████| 404212/404212 [00:00<00:00, 511718.70entries/s]

$ head e_coli_compact.fasta
>pg|ID_0|P0A840(1:24,mssclvg:2),A0A4S5AWU9(1:24,mssclvg:2)
MRILLSNDDGVHAPGIQTLAKALR
>pg|ID_1|P0A840(1:21,mssclvg:1),A0A4S5AWU9(1:21,mssclvg:1)
MRILLSNDDGVHAPGIQTLAK
>pg|ID_2|P0A840(22:38,mssclvg:2),A0A4S5AWU9(22:38,mssclvg:2)
ALREFADVQVVAPDRNR
>pg|ID_3|P0A840(22:36,mssclvg:1),A0A4S5AWU9(22:36,mssclvg:1)
ALREFADVQVVAPDR
>pg|ID_4|P0A840(130:151,mssclvg:2),A0A4S5AWU9(130:151,mssclvg:2)
HYDTAAAVTCSILRALCKEPLR

$ cat e_coli_compact.fasta | grep "^>" | wc -l
404212

Instead of directly generating a FASTA-file we first create a database, summarizing same sequences and headers. As a post-processing step, the database-entries are exported into FASTA. The generated FASTA has unique sequence as entries and offers some additionally insights. From the first entries we see peptides shared by exactly 2 proteins. Looking at the difference between the compact and non-compact FASTA, we see that ~244 000 entries could be summarized into already included entries in the FASTA. This generated FASTA-file can be used for identification.

NOTE: Protein-Graphs can contain large amounts of peptides/proteins. Do a dry run with the flags -cnp, -cnpm or cnph (or all of them) WITHOUT the export functionality first and examine the statistics output if it is feasible to generate a FASTA-file. Without a dry run it may happen that a protein like P04637 (P53 Human) with all possible peptides and variants is exported, which will very likely take up all your disk space.

protgraph's People

Contributors

Stargazers

Watchers

Forkers

luxxii ipark2021 tabeasays antonneubauer enmingguo

protgraph's Issues

New Export Format PEFF

If we export Fasta, then we should also somehow include PEFF.

New Statistic: Get all possible Weights from a Protein Graph

We could use Dynamic Programming by Rev. Top Sort and sets to propagate possible weights to the start node.

the final set for proteins should be small with some Test Scripts (around 1/60)

Bioconda add License to recipe

The License-File is currently missing in the recipe for BioConda as well as for our PyPI package. We need to add this

Enhance Dyn Programming and Top Order and include other interesting statistics

Extend VC-Count in BPCSR Output to also consider not only VARIANTs but user-chosen Features

See title.

Specifically: we want to parameterize the following line:

https://github.com/mpc-bioinformatics/ProtGraph/blob/master/protgraph/export/pcsr.py#L82

Generating the number of Paths depending on the number of miscleavages

It would be interesting to see the distribution of the number of possible paths for the exact number of miscleavages (cummulative?)

Updating PTMs

Currently we only provide n-terminal (and c-terminal) peptide PTMs as well as PTMs on aminoacids. There are also other and different PTMs which ProtGraph should be able to map

No Isoform generation for Q9QXS1 possible

It is currently not possible to generate isoforms for the protein https://www.uniprot.org/uniprot/Q9QXS1.txt

This should be fixed. Instead of handcrafted parsing (using ,) we should use other mechanisms to scheck, wheather the varitaional sequences are present in FT and in CC.

FASTA-Exporter also exports spaces in the Headers. We should remove them

Code Reference:

https://github.com/mpc-bioinformatics/ProtGraph/blob/master/protgraph/export/peptides/pep_fasta.py#L67

Export: JanusGraph

We currently only have exports to files and databases (redis and Postgres).

Those are not well suited to traverse and process graphs (postgres is actually very perfromant if the rec. depth is not large).

We need to use some dedicated graph processing algorithms/databases for such large graphs. Here JanusGraph is tested

Error in PepFasta, when retrieving substitution information

An error exists in variants, where we actually try to retrieve the substitution

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "../../Luxxii/ProtGraph/utilities/generate_fasta_from_peppg_pickle_export.py", line 121, in execute
    peptide, part_header = get_pep_and_header_def(row[0], row[1], base_folder)
  File "../../Luxxii/ProtGraph/utilities/generate_fasta_from_peppg_pickle_export.py", line 141, in get_pep_and_header_def
    l_str_qualifiers = PF._get_qualifiers(graph, edges)
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/export/peptides/pep_fasta.py", line 100, in _get_qualifiers
    + str(f.location.end) + "," + self._get_variant_qualifier(f) + "]"
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/export/peptides/pep_fasta.py", line 118, in _get_variant_qualifier
    message = message[:message.index("(")-1]
ValueError: substring not found

Project Name Suggestions

Suggestions:

Prograph (ProGraph)
Protgraph (ProtGraph)

Feel free to add other suggestions!

Potential Bottlenecks in ProtGraph

I ran a line profiler on the generate_graph_consumer method on the human_review dataset (20k proteins)

It gives me the following output:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   114                                           def generate_graph_consumer(entry_queue, graph_queue, common_out_queue, proc_id, **kwargs):
   115                                               """
   116                                               TODO
   117                                               describe kwargs and consumer until a graph is generated and digested etc ...
   118                                               """
   119                                               # Set proc id
   120         1          3.0      3.0      0.0      kwargs["proc_id"] = proc_id
   121                                           
   122                                               # Set feature_table dict boolean table
   123         1          1.0      1.0      0.0      ft_dict = dict()
   124         1          1.0      1.0      0.0      if kwargs["feature_table"] is None or len(kwargs["feature_table"]) == 0 or "ALL" in kwargs["feature_table"]:
   125         1          6.0      6.0      0.0          ft_dict = dict(VARIANT=True, VAR_SEQ=True, SIGNAL=True, INIT_MET=True, MUTAGEN=True, CONFLICT=True)
   126                                               else:
   127                                                   for i in kwargs["feature_table"]:
   128                                                       ft_dict[i] = True
   129                                           
   130                                               # Initialize the exporters for graphs
   131         1         58.0     58.0      0.0      graph_exporters = Exporters(**kwargs)
   132                                           
   133                                               while True:
   134                                                   # Get next entry
   135     20387    7568810.0    371.3      0.7          entry = entry_queue.get()
   136                                           
   137                                                   # Stop if entry is None
   138     20387      18156.0      0.9      0.0          if entry is None:
   139                                                       # --> Stop Condition of Process
   140         1          3.0      3.0      0.0              break
   141                                           
   142                                                   # Beginning of Graph-Generation
   143                                                   # We also collect interesting information here!
   144                                           
   145                                                   # Generate canonical graph (initialization of the graph)
   146     20386    4416910.0    216.7      0.4          graph = _generate_canonical_graph(entry.sequence, entry.accessions[0])
   147                                           
   148                                                   # FT parsing and appending of Nodes and Edges into the graph
   149                                                   # The amount of isoforms, etc.. can be retrieved on the fly
   150     20386      22188.0      1.1      0.0          num_isoforms, num_initm, num_signal, num_variant, num_mutagens, num_conficts =\
   151     20386  312577641.0  15333.0     29.4              _include_ft_information(entry, graph, ft_dict)
   152                                           
   153                                                   # Replace Amino Acids based on user defined rules: E.G.: "X -> A,B,C"
   154     20386      83272.0      4.1      0.0          replace_aa(graph, kwargs["replace_aa"])
   155                                           
   156                                                   # Digest graph with enzyme (unlimited miscleavages)
   157     20386  457306111.0  22432.4     43.0          num_of_cleavages = digest(graph, kwargs["digestion"])
   158                                           
   159                                                   # Merge (summarize) graph if wanted
   160     20386      29893.0      1.5      0.0          if not kwargs["no_merge"]:
   161     20386  268518281.0  13171.7     25.3              merge_aminoacids(graph)
   162                                           
   163                                                   # Collapse parallel edges in a graph
   164     20386      29694.0      1.5      0.0          if not kwargs["no_collapsing_edges"]:
   165     20386   10804029.0    530.0      1.0              collapse_parallel_edges(graph)
   166                                           
   167                                                   # Annotate weights for edges and nodes (maybe even the smallest weight possible to get to the end node)
   168     20386     948172.0     46.5      0.1          annotate_weights(graph, **kwargs)
   169                                           
   170                                                   # Calculate statistics on the graph:
   171     20386      11921.0      0.6      0.0          (
   172     20386      12094.0      0.6      0.0              num_nodes, num_edges, num_paths, num_paths_miscleavages, num_paths_hops,
   173     20386       9768.0      0.5      0.0              num_paths_var, num_path_mut, num_path_con
   174     20386     297176.0     14.6      0.0          ) = get_statistics(graph, **kwargs)
   175                                           
   176                                                   # Verify graphs if wanted:
   177     20386      11624.0      0.6      0.0          if kwargs["verify_graph"]:
   178                                                       verify_graph(graph)
   179                                           
   180                                                   # Persist or export graphs with speicified exporters
   181     20386      38415.0      1.9      0.0          graph_exporters.export_graph(graph, common_out_queue)
   182                                           
   183                                                   # Output statistics we gathered during processing
   184     20386      10500.0      0.5      0.0          if kwargs["no_description"]:
   185                                                       entry_protein_desc = None
   186                                                   else:
   187     20386      37338.0      1.8      0.0              entry_protein_desc = entry.description.split(";", 1)[0]
   188     20386      37142.0      1.8      0.0              entry_protein_desc = entry_protein_desc[entry_protein_desc.index("=") + 1:]
   189                                           
   190     40772     312422.0      7.7      0.0          graph_queue.put(
   191     20386      12337.0      0.6      0.0              (
   192     20386      11818.0      0.6      0.0                  entry.accessions[0],  # Protein Accesion
   193     20386      10432.0      0.5      0.0                  entry.entry_name,  # Protein displayed name
   194     20386       9196.0      0.5      0.0                  num_isoforms,  # Number of Isoforms
   195     20386       9231.0      0.5      0.0                  num_initm,  # Number of Init_M (either 0 or 1)
   196     20386       9244.0      0.5      0.0                  num_signal,  # Number of Signal Peptides used (either 0 or 1)
   197     20386       9232.0      0.5      0.0                  num_variant,  # Number of Variants applied to this protein
   198     20386       9227.0      0.5      0.0                  num_mutagens,  # Number of applied mutagens on the graph
   199     20386       9231.0      0.5      0.0                  num_conficts,  # Number of applied conflicts on the graph
   200     20386       9274.0      0.5      0.0                  num_of_cleavages,  # Number of cleavages (marked edges) this protein has
   201     20386       9240.0      0.5      0.0                  num_nodes,  # Number of nodes for the Protein/Peptide Graph
   202     20386       9269.0      0.5      0.0                  num_edges,  # Number of edges for the Protein/Peptide Graph
   203     20386       9311.0      0.5      0.0                  num_paths,  # Possible (non repeating paths) to the end of a graph. (may conatin repeating peptides)
   204     20386       9318.0      0.5      0.0                  num_paths_miscleavages,  # As num_paths, but binned to the number of miscleavages (by list idx, at 0)
   205     20386       9288.0      0.5      0.0                  num_paths_hops,  # As num_paths, only that we bin by hops (E.G. useful for determine DFS or BFS depths)
   206     20386       9363.0      0.5      0.0                  num_paths_var,  # Num paths of feture variant
   207     20386       9519.0      0.5      0.0                  num_path_mut,  # Num paths of feture mutagen
   208     20386       9508.0      0.5      0.0                  num_path_con,  # Num paths of feture conflict
   209     20386       9476.0      0.5      0.0                  entry_protein_desc,  # Description name of the Protein (can be lenghty)
   210                                                       )
   211                                                   )
   212                                           
   213                                               # Close exporters (maybe opened files, database connections, etc... )
   214         1         13.0     13.0      0.0      graph_exporters.close()

Bottlenecks are:

Merge Aminoacids (~25%)
Apply Features (~29%)
Digestion (~43%)

Add New Feature: MUTAGEN

Cannot parse Isoform Mutagens and Conflicts for specific Proteins

Currently some Proteins with Isoform Mutagens and Conflicts are currently skipped.

Example for Conflict: https://www.uniprot.org/uniprot/P52744.txt

Example for Mutagen: https://www.uniprot.org/uniprot/P35613.txt

Add new Feature: CONFLICT

Dependency apsw is not provided in PyPI

It looks like apsw is not officially in PyPI.

Last working version in ProtGraph:pip install apsw==3.8.11.1-r1

Error of reading SP-EMBL files via Windows

It seems that the new reader is not able to read files from windows. This may be due to the new line of \r\n

Pep Export depending on mass

Instead of using the number of AAs, we could (more precisely) use a range of allowed masses.

This could be a new CLI Parameter in ProtGraph

Split the AminoAcids (B, Z maybe even X or others) into the concrete AminoAcids

We should or could add a small script, which adds/changes features/sequences, to expand Letters, which refer to more than 1 Aminoacid.

E.G.: B -> D or N or J -> I or L (, X -> A, C, .....)

Include Feature CHAIN

The CHAIN feature-information can also be "cleaved" as stated in the documentation: https://www.uniprot.org/help/chain

ProtGraph therefore should also set those points as specific cleavage points (similar to PEPTIDE and PROPPEP).

It was first noticed in https://www.uniprot.org/uniprotkb/P05067/entry (for Amyloid-Beta 40/42)

Optimization of Merge Aminoacids

The current implementation needs a lot of time to process huge graphs (e.g. for the protein Titin).

We should at some point look into the implementation of merging nodes/aminoacids here, and further optimize it.

Bug? Edges can go directly from start to end

There are some Protens, where such a case happens.

Example is needed!

Protein P20729 seems to be generated wrong

Generating a graph from protein P20729 yields an edge that connects the start and the end node directly.

This is something that should not happen.

Reimplementing (Refactoring) of Signal Peptides

The Protein which causes a Problem: P20729 (can produce "null"-Peptides/Proteins)

This should be easily fixable if using the new implementation of PEPTIDE or PROPEP

INIT_MET has errors on Proteins: A0A4X1VEZ3 and F1SN05

Here are the actual error Messages:

Accession: F1SN05, Aminoacid(s): None, Position: 1
Additional Context: No M found to skip for the feature INIT_MET for the given cases
Message: type: INIT_MET
location: [0:1]
qualifiers:
    Key: evidence, Value: ECO:0000256|HAMAP-Rule:MF_03009
    Key: note, Value: Removed

Accession: A0A4X1VEZ3, Aminoacid(s): None, Position: 1
Additional Context: No M found to skip for the feature INIT_MET for the given cases
Message: type: INIT_MET
location: [0:1]
qualifiers:
    Key: evidence, Value: ECO:0000256|HAMAP-Rule:MF_03009
    Key: note, Value: Removed

Only apply variants which are significant (or unknown or other)

If looking at the Feature Viewer in UniProt for a single protein, options occur where it can be selected between "Likely Disease", "Predicted Consequences", etc..

These Information is parsed from UniProt via the note= - Information.

If we want to apply only significant (or other interesting variants on a protein) we should also implement such a filtering.

As a general consensus: Everything that uses a: in XXX is a variant which causes likely the disease XXX.

Everything that contains Unknown or something similar is then categorized specifically.

Maybe we should send a Message to the UniProt-Team how they bin those Variants (do they have some specific keywords)

FASTA Export & Full Cleavage

PEPTIDE and PROPEP (CHAIN, maybe even SIGNAL-peptides) may have an unique identifier

Currently these features (except for CHAINs), currently do not provide these IDs in FASTA_Headers. We should expand this!

It may be enough to change the following lines:

(SIGNAL)
https://github.com/mpc-bioinformatics/ProtGraph/blob/master/protgraph/export/peptides/pep_fasta.py#L103

(PEPTIDE)
https://github.com/mpc-bioinformatics/ProtGraph/blob/master/protgraph/export/peptides/pep_fasta.py#L116

(PROPEP)
https://github.com/mpc-bioinformatics/ProtGraph/blob/master/protgraph/export/peptides/pep_fasta.py#L112

Conda channel priority problems

At least in my (clean) miniconda environment, I needed to run
conda install -c bioconda --no-channel-priority protgraph
to get protgraph installed, with Python 3.9.16 as only additional package installed before.
Maybe hint to it in the documentation.

Protgraph PR 0.30 consumes to much RAM

The newer unreleased Version of ProtGraph consumes roughly per pocess 5-7 GB of RAM. This can be an issue on servers with many cores and few available RAM.

However there is a workarround by simply reducing the number of threads -np.

Split Protgraph into possibly two Projects?

Currently we have a large Project, reading SwissProt-EMBL and generating graphs out of it.

Currently it is not possible to retrieve them via Python directly. It is only possible through Pickle by saving and loading graphs separately.

Currently there are many Consumer and one Producer Thread, which do very basic operations. These operations may be separated into another Project, so that by importing it, the Protgraph references to a library.

FT: INIT_MET currently ignores ref (isoforms) and duplicates edges from the start node to the second canonical node

This is not correct and should be fixed. INIT_MET references the aminoacid M explicitly from either a isoform (via the reference information) or from the canonical sequence (emtpy reference).

Protein Q9R1E6 generates parallel edges, how do we want to handle parallel edges?

FT   MUTAGEN         12..27
FT                   /note="Missing: Complete inhibition of secretion."
FT                   /evidence="ECO:0000269|PubMed:17208043"
FT   MUTAGEN         12..22
FT                   /note="Missing: Complete inhibition of secretion."
FT                   /evidence="ECO:0000269|PubMed:17208043"
FT   MUTAGEN         23..27
FT                   /note="Missing: No effect on secretion."
FT                   /evidence="ECO:0000269|PubMed:17208043"

These Entries are selected specifically in this Protein, so that ProtGraph is going to generate a parallel edge from 12 to 27 by three features.

What should we do about parallel edges in general in ProtGraph?

ProtGraph might not generate files while called through Python

The calls (like in the functional tests) might not work properly.

Here is an example:

        args = protgraph.parse_args([] + self.procs_num + self.example_files)
        protgraph.prot_graph(**args)

Update Readme with new available Options and changing CLI

This should also include

MUTAGEN
CONFLICT
Replacement Syntax
FT and Digestion new CLI options

Utilities should import Protgraph if methods are used from them

We should import the methods, which are used by ProtGraph from the utilities folder. This reduces redundant code

Error occures when digesting via full

An error which is not catched occurs, when digesting via the digestion method full

Here is the stacktrace:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/graph_generator.py", line 119, in generate_graph_consumer
    num_of_cleavages = digest(graph, kwargs["digestion"])
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/digestion.py", line 11, in digest
    return dict(
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/digestion.py", line 114, in _digest_via_full
    end_out.remove(i)
ValueError: list.remove(x): x not in list

This is happens with the recent Version of ProtGraph 0.1.0. It seems like a case is not considered here.

PyPi Integration and Documentation

PyPi Integration (automatically via action)
Documentation Clean UP
Adding of Cheat Sheet for arguments (since, we have a lot of them!) into README.md
Available via CLI (protgraph

Outsource Parsing into Processes

ProtGraph currently has a reading process, which also parses the entry via biopython. We could outsource this to the graph-generating processes (as well as the blacklist).

This could possibly even further speed up the graph-generation, since we can currently observe with a high number of processes, that the reading thread is not fast enough for the consumers.

Generted FASTAs contain OR[|None] Entries, which contain no information

This is probably due to cleavages directly at the ending or beginning of a protein. Removing such entries could slightly reduce the exported FASTA.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.