sorgerlab / indra Goto Github PK

INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system interfacing with NLP systems and databases to collect knowledge, and through a process of assembly, produce causal graphs and dynamical models.

Home Page: http://indra.bio

License: BSD 2-Clause "Simplified" License

Python 98.78% JavaScript 0.37% HTML 0.80% CSS 0.01% Dockerfile 0.05%

systems-biology modeling biology computational-biology bioinformatics pysb nlp indra sbml

indra's People

Contributors

Stargazers

Watchers

indra's Issues

Combine RasGef and RasGap queries with ActivityActivity

ActivityActivity statements are of the same form as the RasGef and RasGap queries. These queries should be collapsed into one, and the BelPy processor will get an instance of the appropriate BelPy statement based on the Activity of the object (if the object is a RasGTPase, then an increase means that the statement is about a RasGEF; if a decrease, then about a RasGAP).

Protein names containing hyphens cause an error in the BioPax API

For example, the protein name RAF1-BRAF is propagated through the PysbAssembler and into the PySB model, where it triggers an InvalidComponentNameError, e.g.

bp = biopax_api.process_pc_neighborhood(['BRAF'])
bp.get_phosphorylation()
pa.add_statements(bp.statements)
pa.make_model()
...
InvalidComponentNameError: Not a valid component name: 'RAF1-BRAF'

Separate test code into "tests" and "diagnostics"

To use a test suite to measure our progress in implementing various features of the parser, assemblers, etc., it would be nice to have these tests separate from the tests of the correctness of basic functionality, which we want to run on Travis.

Add get_ubiquitination to Biopax processor

MEK dephosphorylates ERK in Biopax

Query:

from indra.biopax import biopax_api as ba
bp = ba.process_pc_neighborhood(['MAP2K1', 'MAP2K2', 'MAPK1', 'MAPK3'])
bp.get_dephosphorylation()

This yields, among others, the statements:

Dephosphorylation(MAP2K1(), MAPK3(), PhosphorylationTyrosine, 203),
 Dephosphorylation(MAP2K1(), MAPK3(), PhosphorylationThreonine, 201),
 Dephosphorylation(MAP2K2(), MAPK3(), PhosphorylationTyrosine, 203),
 Dephosphorylation(MAP2K2(), MAPK3(), PhosphorylationThreonine, 201),
 Dephosphorylation(MAP2K1(), MAPK1(), PhosphorylationTyrosine, 184),
 Dephosphorylation(MAP2K1(), MAPK1(), PhosphorylationThreonine, 182),
 Dephosphorylation(MAP2K2(), MAPK1(), PhosphorylationTyrosine, 184),
 Dephosphorylation(MAP2K2(), MAPK1(), PhosphorylationThreonine, 182),
 Dephosphorylation(MAP2K1(), MAPK3(), PhosphorylationTyrosine, 203),
 Dephosphorylation(MAP2K1(), MAPK3(), PhosphorylationThreonine, 201),
 Dephosphorylation(MAP2K2(), MAPK3(), PhosphorylationTyrosine, 203),
 Dephosphorylation(MAP2K2(), MAPK3(), PhosphorylationThreonine, 201),
 Dephosphorylation(MAP2K1(), MAPK1(), PhosphorylationTyrosine, 184),
 Dephosphorylation(MAP2K1(), MAPK1(), PhosphorylationThreonine, 182),
 Dephosphorylation(MAP2K2(), MAPK1(), PhosphorylationTyrosine, 184),
 Dephosphorylation(MAP2K2(), MAPK1(), PhosphorylationThreonine, 182),

Which appear to be artifactual. Strangely, they don't seem to turn up in the corresponding pathsbetween query among the same genes.

Resource files should not be version controlled

We should have a scripts folder at the top level folder, which contains scripts to create some of the resource files that are now under version control. This includes the different ID mappings: HGNC, UniProt, ChEBI, PubChem, and also the hierarchy RDFs.
After installing or cloning INDRA, the user could just run a script like

python scripts/get_resources.py

which would donwload/create the required resource files.

ActivityModifications have protein modified at site X having increased by being modified at site X

If you run the following script:

from indra.biopax import biopax_api as bpa
from indra.preassembler import Preassembler, render_stmt_graph
from indra.preassembler.hierarchy_manager import \
        modification_hierarchy as mh, entity_hierarchy as eh

bpp = bpa.process_pc_pathsfromto(['MAP2K1'], ['MAPK1'])
bpp.get_phosphorylation()
bpp.get_activity_modification()
pa1 = Preassembler(eh, mh, bpp.statements)
pa1.combine_duplicates()

You get a number of statements including the following:

[ActivityModification(MAPK1(mods: (phosphorylation, Y, 184), (active), (phosphorylation, T, 182)), [(phosphorylation, T, 182), (phosphorylation, Y, 184)], increases, Activity),
 ActivityModification(MAPK1(mods: (phosphorylation, Y, 184), (active), (phosphorylation, T, 182)), [(phosphorylation, Y, 184), (phosphorylation, T, 182)], increases, Activity),
 ActivityModification(MAPK1(mods: (active), (phosphorylation, Y, 185), (phosphorylation, T, 183)), [(phosphorylation, Y, 185), (phosphorylation, T, 183)], increases, Activity),
 ActivityModification(MAPK1(mods: (phosphorylation, Y, 187), (active), (phosphorylation, T, 185)), [(phosphorylation, T, 185), (phosphorylation, Y, 187)], increases, Activity),
 ActivityModification(MAPK1(mods: (phosphorylation, Y, 187), (active), (phosphorylation, T, 185)), [(phosphorylation, Y, 187), (phosphorylation, T, 185)], increases, Activity),
 ActivityModification(MAPK3(mods: (phosphorylation, Y, 203), (phosphorylation, T, 201), (active)), [(phosphorylation, T, 201), (phosphorylation, Y, 203)], increases, Activity),
 ActivityModification(MAPK3(mods: (phosphorylation, Y, 203), (phosphorylation, T, 201), (active)), [(phosphorylation, Y, 203), (phosphorylation, T, 201)], increases, Activity),
 ActivityModification(MAPK3(mods: (phosphorylation, T, 202), (active), (phosphorylation, Y, 204)), [(phosphorylation, T, 202), (phosphorylation, Y, 204)], increases, Activity),

These statements in many cases include the "active" flag on the agent, and in addition have the activating modifications as operating on the already modified protein. Should we change this to indicate that activating modification is relative to the unmodified protein?

PMIDs associated with evidence should be bare PMIDs, not bio2rdf URLs

This appears to be a problem only in the BEL API, not Biopax.

Refactor Dephosphorylations to be a subclass of Modification

Dephosphorylation adds no new functionality on top of Modification, the only difference is that it has a field 'phos' instead of 'enz'. Changing this will require all code that touches Dephosphorylation to use the enz fieldname instead of phos.

Agent modifications should be objects

We should implement a class called ModificationCondition similar to what we have for BoundCondition to represent Agent modifications. We will then be able to conveniently encode modification types, sites and negation flags.

Data structure for specifying policies for assembly

As a simple use case for alternative assembly methods, consider the case where the purpose of assembly is only to produce a contact map. Here the goal is not to produce kinetics, but rather to provide a kind of graphical interactome of the relationships between the members. Here the rules produced by the assembly methods would be binding rules rather than, e.g., one-step conversion rules. Note however, that each statement might still need to specify its own assembly approach, since the binding relationships inferred by a Phosphorylation statement would likely limit all binding to a kinase active site, for example.

As a result, each assemble and monomers method would need to gain information on which approach to use. Since statement instantiation should be independent of assembly, this information should not be passed into each constructor at Statement instantiation time, but rather passed in as an argument to each assembly method. Thus perhaps a data structure at the level of the BelProcessor instance is most appropriate.

The data structure would likely be some kind of dict, with an entry for each statement type and a value indicating the type of assembly method to use. An additional possibility is for the simple specification of global settings, e.g., "contact map", "one-step conversion", etc. that would apply uniform policies for all statement types.

Perhaps this could start out as a dict and then, if necessary, graduate to a dedicated class.

Hierarchy for activity types

Might be worthwhile having a hierarchy of activity types so that statements involving activities at various levels of resolution could also be subject to refinements.

For example, the statement 'X increases the kinase activity of Y' should be a refinement of 'X increases the activity of Y'.

Not sure how important this will be in practice--it's not clear how often the distinctions will be made in database entries or text.

Implement monomers and assemble method for ActivatingSubstitution statements

Currently these are not implemented at all.

Biopax processor should capture multi-member complexes

Currently the Biopax processor queries for statements involving complexes and then returns the results as a series of two-member Complexes. This should instead return Complex statements with the full list of members, indicating that the members can be part of the same connected component.

Modified species in complex

The Complex statement constructor takes a list of members (member names). We don't have a way of specifying that a species is in a modified form.

Separating Bind from Complex

It would make sense to have a separate statement called Bind which describes direct binding through a site or domain. The current Complex([A, B, C]) statement means "A, B and C can be in a complex" or "There exists an A-B-C complex". From BioPax and BEL that is exactly how we extract complexes - we don't extract them from processes but rather from complex entities. Because of these semantics, Complex statements can naturally have multiple members even though molecules typically don't bind all at once. So historically, this makes sense.

But we should also be able to express the process that "The SH2 domain of GRB2 binds EGFR on a phosphorylated tyrosine residue". The main question is: how do we represent the binding sites? We could have 2 binding site arguments in the statement which could be strings.
This would work fine for "SH2" or "RBD" but not clear how to best do it for "phosphorylated tyrosine residue". I'm thinking of course that we ought to be able to then correctly PySB assemble this into EGFR(Y='p').

Need to aggregate/infer information about site modifications during assembly

Consider the sentences:

"XXX results in phosphorylation of EGFR and tyrosine 1068 and tyrosine 1172."
"GRB2 binds EGFR phosphorylated at residue 1068."

In parsing the second sentence about GRB2, a site name is ultimately required for the modification being referred to as a precondition for the binding event. However, it is not explicitly specified in that sentence that that residue is a tyrosine, thus, while the first statement will result in a Phosphorylation sentence that leads to specification of a site with name "Y1068" on EGFR, the second statement will not have sufficient information to identify the name of the appropriate site.

This points to the need to have a aggregation/inference step at assembly time for these issues to be resolved. If we have information anywhere about what the residue at 1068 is, then we use it in all references to that site.

Similarly, we may see the statement "GRB2 binds phosphorylated EGFR." Again, at assembly time the binding statement should key this back to the appropriate site modifications. This one is more advanced because the assembly process would need to identify the issue that it is unresolved whether phospho 1068 or 1172 is sufficient, or both must be phosphorylated.

Biopax processor modification positions are incorrect

The query

bp = biopax_api.process_pc_pathsbetween(['MAP2K1', 'MAPK1'], neighbor_limit=1)
bp.get_phosphorylation()

returns

Phosphorylation(YWHAE, MAPK1, PhosphorylationThreonine, 182),
Phosphorylation(YWHAE, MAPK1, PhosphorylationTyrosine, 184),

which are TxY positions on mouse MAPK1 (http://www.uniprot.org/uniprot/P63085) whereas the TxY motif is at 185-187 on human MAPK1 (http://www.uniprot.org/uniprot/P28482).

Some modification positions are invalid (or are printed incorrectly):

Phosphorylation(MEK, ERK, Phosphorylation, -2147483648)

Biopax API/pyjnius won't run on Travis, causing tests to time out and fail

Typical output from Travis:

$ nosetests --with-doctest
EFSLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

The build has been terminated

Changing ActivityModification to ActivityCondition

It could make sense to generalize ActivityModification to a statement called ActivityCondition. ActivityConditions could have an arbitrary mixture of ModCondition, BoundCondition and MutCondition conditions associated with them. We could then describe, for instance, that "KRAS bound to GTP is active". Or that "BRAF-V600E not bound to Vemurafenib is active". Does this make sense? If yes, it's an easy addition.

Check for duplicate BelPy statements

So far the corpora we've looked at don't have identically overlapping rules, though these are certainly a possibility that should be checked for. In particular these should be checked for after case normalization, to make sure that the same statement doesn't go in twice with the different capitalizations of the gene names (see issue #2).

This would be the first step in a more general process of comparing the content of statements for possible inconsistencies (e.g., "phosphoserine is activating" in conflict with "phosphoserine is inactivating").

Need to unpack BEL protein family statements into the relevant family members

In parsing statements in the BEL large corpus, we frequently find statements involving protein families, e.g. "AKT_FAMILY". Currently these enter the model as monomers named by the family. Instead, we should locate the mapping between the family and its specific members (and their HGNC names) and replace the statements involving families with all of the relevant family members.

test_preassembler.test_combine_duplicates sometimes fails

When running the tests repeatedly, this line assert(pa.unique_stmts[0] == p5) # MEK phos ERK
sometimes fails and other times doesn't.

Policies in make_model instead of PysbAssembler constructor?

At some point policies became an argument in the constructor of the PysbAssembler. Wouldn't it make sense to put that back as an argument of make_model? The assembler collects statements and we would want to be able to generate multiple models from the same statements using the assembler by applying different policies.

Preassembler changes statements passed as argument

Assume we have a processor called rp with len(rp.statements) > 0.

pa = Preassembler(eh, mh)
pa.add_statements(rp.statements)

At this point rp.statements is unchanged.

duplicate_stmts = pa.combine_duplicates()

At this point rp.statements has changed! This obviously shouldn't happen.

Error when an HGNC ID lookup fails

In processing the results from the query generated by the following script:

from indra.biopax import biopax_api as ba
bp = ba.process_pc_neighborhood(['TP63'])
bp.get_phosphorylation()

The following error occurs:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Sending Pathway Commons query...
Pathway Commons query returned model...
indra/biopax/processor.py:251: UserWarning: Cannot handle complex enzymes.
  warnings.warn('Cannot handle complex enzymes.')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/p63_test.py in <module>()
      2 
      3 bp = ba.process_pc_neighborhood(['TP63'])
----> 4 bp.get_phosphorylation()

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/biopax/processor.pyc in get_phosphorylation(self, force_contains)
     53     def get_phosphorylation(self, force_contains=None):
     54         stmts = self._get_generic_modification('phospho', 
---> 55                                                force_contains=force_contains)
     56         for s in stmts:
     57             self.statements.append(Phosphorylation(*s))

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/biopax/processor.pyc in _get_generic_modification(self, mod_filter, mod_gain, force_contains)
    274             enzs = BiopaxProcessor._get_agents_from_entity(controller_pe)
    275             subs = BiopaxProcessor._get_agents_from_entity(input_spe,
--> 276                                                            expand_pe=False)
    277             for enz, sub in itertools.product(listify(enzs), listify(subs)):
    278                 # If neither the required enzyme nor the substrate is

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/biopax/processor.pyc in _get_agents_from_entity(bpe, expand_pe, expand_er)
    413         # If it is a single entity, we get its name and database
    414         # references
--> 415         name = BiopaxProcessor._get_element_name(bpe)
    416         db_refs = BiopaxProcessor._get_db_refs(bpe)
    417         agent = Agent(name, db_refs=db_refs, mods=mods, mod_sites=mod_sites)

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/biopax/processor.pyc in _get_element_name(bpe)
    503 
    504         # Canonicalize name
--> 505         name = re.sub(r'[^\w]', '_', name)
    506         if re.match('[0-9]', name) is not None:
    507             name = 'p' + name

/Users/johnbachman/.virtualenvs/p27env/lib/python2.7/re.pyc in sub(pattern, repl, string, count, flags)
    153     a callable, it's passed the match object and must return
    154     a replacement string to be used."""
--> 155     return _compile(pattern, flags).sub(repl, string, count)
    156 
    157 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or buffer

The immediate source of the error is that the string in the call to sub is None; however, the reason for this is that in the _get_element_name static method of the Biopax processor, the HGNC lookup fails. The HGNC ID reported by the call to _get_hgnc_id is 13473, but the subsequent call to _get_hgnc_name(hgnc_id) using that ID returns None, which is assigned to name, and is subsequently than passed to the regex.

The display name for the element causing the problem is "p63". A lookup at genenames.org indicates that the actual HGNC ID for TP63 is 15979.

Perhaps the right thing to do here is to simply check if the HGNC ID lookup fails, and if so, use the getDisplayName string instead?

We might want to keep a record of cases where the HGNC IDs have turned out to be erroneous so we can fix them.

Phosphorylation sites for different protein isoforms

I get the following phosphorylation statements from Pathway Commons:

 Phosphorylation(EGFR(), SHC1(), PhosphorylationTyrosine, 194),
 Phosphorylation(EGFR(), SHC1(), PhosphorylationTyrosine, 195),
 Phosphorylation(EGFR(), SHC1(), PhosphorylationTyrosine, 272),
 Phosphorylation(EGFR(), SHC3(), PhosphorylationTyrosine, 218),
 Phosphorylation(EGFR(), SHC3(), PhosphorylationTyrosine, 219),
 Phosphorylation(EGFR(), SHC3(), PhosphorylationTyrosine, 256),
 Phosphorylation(EGFR(), SHC3(), PhosphorylationTyrosine, 257),
 Phosphorylation(EGFR(), SHC3(), PhosphorylationTyrosine, 283),
 Phosphorylation(EGFR(), SHC3(), PhosphorylationTyrosine, 301)]

These all raise errors against the Uniprot reference sequence. However, for SHC3, there is another isoform, the "p52" isoform, with a difference upstream sequence:

http://www.uniprot.org/uniprot/Q92529#sequences

In phosphosite it's clear that all of the tyrosines mentioned are correct for this isoform:

http://www.phosphosite.org/proteinAction.action?id=4936&showAllSites=true (click "show isoforms" in the chart below).

The same is true for SHC1, but this time it's isoform p46:

http://www.uniprot.org/uniprot/P29353#sequences

I also found the same to thing for RPS6KB1, where the T389 listed in Pathway Commons is correct, but for the "Isoform Alpha 2" which is missing the first 23 amino acids.

Is it possible that Pathway Commons has information on the isoforms involved in the statements and we are not extracting it?

If not, we could write check_statements to not only determine if the sequence is not a match, but also to see if it is a match for a different isoform of the protein.

Complex statements appear to be extracted from BEL/Biopax without evidence objects

An observation from the Ras 220 model. None of the Complex statements have evidence associated with them.

Error in Unicode/ASCII conversion problem in BEL API


In [1]: from indra.bel import bel_api as ba
INFO:rdflib:RDFLib Version: 4.2.1

In [2]: bp = ba.process_ndex_neighborhood(['BRAF', 'MAP2K1'])
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): services.bigmech.ndexbio.org
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
http://www.openbel.org/vocabulary/Phosphorylation
Getting all direct statements...

Checking for 'degenerate' statements...

Protein -> Protein/Activity statements:
---------------------------------------

Total indirect statements: 166
Total direct statements: 40
Converted statements: 36
Degenerate statements: 0
>> Total unhandled statements: 4

--- Unhandled statements ---------
complex_p_HGNC_BRAF_p_HGNC_PRKCE_p_HGNC_RPS6KB2_DirectlyIncreases_kin_p_HGNC_PRKCE
p_HGNC_BRAF_sub_V_600_E_DirectlyIncreases_kin_p_HGNC_BRAF
a_CHEBI_sorafenib_tosylate_DirectlyDecreases_kin_p_HGNC_BRAF
complex_p_HGNC_BRAF_p_HGNC_RAF1_DirectlyIncreases_kin_p_HGNC_RAF1

--- Converted BelPy Statements -------------
0: Complex(['EIF4G1', 'EIF4E'], [])
1: Complex(['BRAF', 'RAF1'], [])
2: Complex(['MAP2K1', 'PTPN7'], [])
3: Complex(['RPS6KB2', 'PRKCE', 'BRAF'], [])
4: Complex(['MAP2K1', 'GIT1'], [])
5: Complex(['MAP2K1', 'RAF1'], [])
6: Phosphorylation(PRKA_FAMILY, BRAF, Phosphorylation, 429, [Evidence(bel, http://bio2rdf.org/pubmed:15212693, [], 11510412;10869359)])
7: Phosphorylation(AKT_FAMILY, BRAF, Phosphorylation, 429, [Evidence(bel, http://bio2rdf.org/pubmed:15212693, [], 11510412;10869359)])
8: Phosphorylation(MAP2K1, MAPK3, Phosphorylation, 202, [Evidence(bel, http://bio2rdf.org/pubmed:15212693, [], 8626767;93330262;15466476;11971971)])
9: Phosphorylation(RAF1, MAP2K1, Phosphorylation, 218, [Evidence(bel, http://bio2rdf.org/pubmed:12937126, [], MEK1/2 is activated through phosphorylation at Ser 217/Ser 221 catalyzed by Raf.)])
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-2-a69375dadcfa> in <module>()
----> 1 bp = ba.process_ndex_neighborhood(['BRAF', 'MAP2K1'])

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/bel/bel_api.py in process_ndex_neighborhood(gene_names, rdf_out)
     13     with open(rdf_out, 'wt') as fh:
     14         fh.write(rdf.encode('utf-8'))
---> 15     bp = process_belrdf(rdf)
     16     bp.print_statements()
     17     return bp

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/bel/bel_api.py in process_belrdf(rdf_str)
     35     bp.print_statement_coverage()
     36     print "\n--- Converted BelPy Statements -------------"
---> 37     bp.print_statements()
     38     return bp
     39 

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/bel/processor.py in print_statements(self)
    662     def print_statements(self):
    663         for i, stmt in enumerate(self.statements):
--> 664             print "%s: %s" % (i, stmt)

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/statements.py in __str__(self)
    215         s = ("%s(%s, %s, %s, %s, %s)" %
    216                   (type(self).__name__, self.enz.name, self.sub.name, self.mod,
--> 217                    self.mod_pos, self.evidence))
    218         return s
    219 

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 127: ordinal not in range(128)

ActivityModification picks up a site already modified

Here's a statement I came across from Biopax:

ActivityModification(MAP2K1(mods: ['PhosphorylationSerine'], mod_sites: [298]),
['PhosphorylationSerine', 'PhosphorylationSerine', 'PhosphorylationSerine'], ['218', '298', '222'],
increases, Activity)

MAP2K1 is listed as starting out as being phosphorylated at serine 298, but the activity modification includes that site again as being activating. Presumably this results from an error in excluding serine 298 from the modification list when examining the BioPax reaction.

Activity flags remaining in monomers

Activity flags are still generated into monomer signatures, even if they don't appear in any rules.
E.g. all rules where MAP2K1 is active, it is represented as
MAP2K1(S218='p',S222='p'),
but the monomer has the signature
Monomer(u'MAP2K1', ['S218', 'S222', 'Kinase', 'Active'], {'Active': ['inactive', 'active'], 'S218': ['u', 'p'], 'Kinase': ['inactive', 'active'], 'S222': ['u', 'p']})

BEL processor should handle statements with complexes as enzymes

Currently the BEL processor ignores statements about protein modifications where the left-hand side (enzyme term) is the abundance of a complex (rather than a protein abundance or the activity of a protein abundance). This should be appropriately parsed into a BEL term with the agent having the bound conditions indicating the members of the complex.

However, one challenge of doing this is that the INDRA statement representation currently requires one of the proteins to be indicated as the enzyme, whereas BEL statements involving complexes do not indicate which protein is the enzyme responsible for the modification.

TRIPS tests failing on Travis

======================================================================

FAIL: test_trips.test_phosphorylation

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/miniconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest

    self.test(*self.arg)

  File "/home/travis/build/johnbachman/indra/indra/tests/test_trips.py", line 12, in test_phosphorylation

    assert(len(tp.statements) == 1)

AssertionError: 

-------------------- >> begin captured stdout << ---------------------

All events by type

------------------

------------------

--------------------- >> end captured stdout << ----------------------

======================================================================

FAIL: test_trips.test_phosphorylation_noresidue

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/miniconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest

    self.test(*self.arg)

  File "/home/travis/build/johnbachman/indra/indra/tests/test_trips.py", line 20, in test_phosphorylation_noresidue

    assert(len(tp.statements) == 1)

AssertionError: 

-------------------- >> begin captured stdout << ---------------------

All events by type

------------------

------------------

--------------------- >> end captured stdout << ----------------------

======================================================================

FAIL: test_trips.test_phosphorylation_nosite

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/miniconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest

    self.test(*self.arg)

  File "/home/travis/build/johnbachman/indra/indra/tests/test_trips.py", line 28, in test_phosphorylation_nosite

    assert(len(tp.statements) == 1)

AssertionError: 

-------------------- >> begin captured stdout << ---------------------

All events by type

------------------

------------------

--------------------- >> end captured stdout << ----------------------

======================================================================

FAIL: test_trips.test_actmod

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/miniconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest

    self.test(*self.arg)

  File "/home/travis/build/johnbachman/indra/indra/tests/test_trips.py", line 36, in test_actmod

    assert(len(tp.statements) == 1)

AssertionError: 

-------------------- >> begin captured stdout << ---------------------

All events by type

------------------

------------------

--------------------- >> end captured stdout << ----------------------

======================================================================

FAIL: test_trips.test_actmods

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/miniconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest

    self.test(*self.arg)

  File "/home/travis/build/johnbachman/indra/indra/tests/test_trips.py", line 57, in test_actmods

    assert(len(tp.statements) == 1)

AssertionError: 

-------------------- >> begin captured stdout << ---------------------

All events by type

------------------

------------------

--------------------- >> end captured stdout << ----------------------

----------------------------------------------------------------------

Ran 252 tests in 43.860s

FAILED (failures=5)

Parsing Biopax reactions into Statements introduces errors in identity of enzyme/substrate

Example::

In [16]: bp = biopax_api.process_pc_pathsfromto(['KSR1'], ['BRAF'])

In [17]: bp.get_phosphorylation()
/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/biopax/processor.py:164: UserWarning: Unknown modification type residue modification, inactive
warnings.warn('Unknown modification type %s' % mf_type)

In [18]: bp.statements
Out[18]:
[Phosphorylation(KSR1, BRAF, PhosphorylationSerine, 151, [Evidence(biopax, [], [], None)]),
Phosphorylation(KSR1, BRAF, PhosphorylationSerine, 753, [Evidence(biopax, [], [], None)]),
Phosphorylation(KSR1, BRAF, PhosphorylationSerine, 578, [Evidence(biopax, [], [], None)]),
Phosphorylation(KSR1, BRAF, PhosphorylationSerine, 750, [Evidence(biopax, [], [], None)]),
Phosphorylation(KSR1, BRAF, PhosphorylationThreonine, 401, [Evidence(biopax, [], [], None)])]

This makes it appear that KSR1 is the kinase involved in the phosphorylation. Is this because KSR1 appears as part of a complex on the left hand side but is not itself the kinase? Can we disambiguate which one is the kinase?

Protein families to add to the entity_hierarchy

From Pathway Commons:

RAS_family
MKK4_MKK7
RAF1_BRAF
MEK1_2
MEK1_2_active
Raf
RAC1_CDC42
MAPK
p14_3_3_family

Handling duplicate statements in the various processors, or in the PySB assembler

Queries of BEL or BioPax can (and do) result in multiple equivalent statements, as in the following. They should be handled in some way, either by incrementing the count, or perhaps adding to a list of papers/references/contexts or other information that may differ between the different instances.

bp = biopax_api.process_pc_neighborhood(['BRAF'])
bp.get_phosphorylation()
...
Phosphorylation(NRAS, RAF1, PhosphorylationThreonine, 268)
Phosphorylation(NRAS, RAF1, PhosphorylationThreonine, 268)

Wrapper around literature clients to find full text

We can currently get full text for any PMC OAI paper using the PMC client. We can only get abstracts for papers from the PubMed client. The Elsevier client allows us to access full text based on DOI for papers under the publisher. It would be good to write a top level wrapper that can do ID mapping (PMCID - PMID - DOI) and determine if the full text is accessible through some resource and get the full text through that resource.

Preassembler issues

Some of the statements in the groups below should have been preassembled out.

Phosphorylation(MAP2K1(mods: (phosphorylation), (phosphorylation)), MAPK1()),
Phosphorylation(MAP2K1(mods: (phosphorylation, S, 218), (phosphorylation, S, 298), (phosphorylation, S, 222)), MAPK1(), T, 185),
Phosphorylation(MAP2K1(mods: (phosphorylation, S, 218), (phosphorylation, S, 298), (phosphorylation, S, 222)), MAPK1(), Y, 187),

Phosphorylation(RAF1(), MAP2K1(mods: (phosphorylation, S, 298)), S, 218),
Phosphorylation(RAF1(), MAP2K1(mods: (phosphorylation, S, 298)), S, 222),
Phosphorylation(RAF1(mods: (phosphorylation), (phosphorylation), (phosphorylation)), MAP2K1()),
Phosphorylation(RAF1(mods: (phosphorylation), (phosphorylation)), MAP2K1()),
Phosphorylation(RAF1(mods: (phosphorylation, S, 296), (phosphorylation, T, 269), (phosphorylation, S, 471), (phosphorylation, S, 289), (phosphorylation, S, 338), (phosphorylation, S, 301)), MAP2K1(), S, 218),
Phosphorylation(RAF1(mods: (phosphorylation, S, 296), (phosphorylation, T, 269), (phosphorylation, S, 471), (phosphorylation, S, 289), (phosphorylation, S, 338), (phosphorylation, S, 301)), MAP2K1(), S, 222),
Phosphorylation(RAF1(mods: (phosphorylation, T, 269), (phosphorylation, S, 621), (phosphorylation, S, 471), (phosphorylation, S, 338)), MAP2K1(), S, 218),
Phosphorylation(RAF1(mods: (phosphorylation, T, 269), (phosphorylation, S, 621), (phosphorylation, S, 471), (phosphorylation, S, 338)), MAP2K1(), S, 222),
Phosphorylation(RAF1(mods: (phosphorylation, S, 621), (phosphorylation, T, 268)), MAP2K1(), S, 218),
Phosphorylation(RAF1(mods: (phosphorylation, S, 621), (phosphorylation, T, 268)), MAP2K1(), S, 222),
Phosphorylation(RAF1(mods: (phosphorylation, S, 621), (phosphorylation, T, 268)), MAP2K1(), T, 286),
Phosphorylation(RAF1(mods: (phosphorylation, S, 621), (phosphorylation, T, 268)), MAP2K1(), T, 292),
Phosphorylation(RAF1(mods: (phosphorylation, S, 621), (phosphorylation, T, 268)), MAP2K1(), T, 386),
Phosphorylation(RAF1(mods: (phosphorylation, T, 269), (phosphorylation, T, 268)), MAP2K1(), S, 218),
Phosphorylation(RAF1(mods: (phosphorylation, T, 269), (phosphorylation, T, 268)), MAP2K1(), S, 222),
Phosphorylation(RAF1(mods: (phosphorylation, S, 338)), MAPK1(), T, 185),
Phosphorylation(RAF1(mods: (phosphorylation, S, 338)), MAPK1(), Y, 187),

Puzzling or problematic statements pulled out of Pathway Commons by biopax_api

I'm going to use this issue to keep an ongoing record of suspicious statements pulled out of Pathway Commons. Many (most, all?) of these may be either errors or deliberate curation shortcuts put in the database itself, but I'll list these as an issue in case they are appearing because we are doing something incorrectly in the api.

GTP as a kinase:

Phosphorylation(GTP, MAPK10, PhosphorylationThreonine, 221),
 Phosphorylation(GTP, MAPK10, PhosphorylationTyrosine, 223),
 Phosphorylation(GTP, MAPK8, PhosphorylationThreonine, 183),
 Phosphorylation(GTP, MAPK8, PhosphorylationTyrosine, 185),
 Phosphorylation(GTP, MAPK9, PhosphorylationThreonine, 183),
 Phosphorylation(GTP, MAPK9, PhosphorylationTyrosine, 185),

H, K, and NRAS as kinases (more understandable):

Phosphorylation(HRAS, BRAF, PhosphorylationSerine, 578),
 Phosphorylation(HRAS, BRAF, PhosphorylationThreonine, 373),
 Phosphorylation(HRAS, RAF1, PhosphorylationSerine, 338),
 Phosphorylation(HRAS, RAF1, PhosphorylationSerine, 621),
 Phosphorylation(HRAS, RAF1, PhosphorylationThreonine, 268),
 Phosphorylation(HRAS, RAF1, PhosphorylationThreonine, 269),
Phosphorylation(KRAS, BRAF, PhosphorylationSerine, 578),
 Phosphorylation(KRAS, BRAF, PhosphorylationThreonine, 373),
 Phosphorylation(KRAS, RAF1, PhosphorylationSerine, 621),
 Phosphorylation(KRAS, RAF1, PhosphorylationThreonine, 268),
 Phosphorylation(KRAS, RAF1, PhosphorylationThreonine, 269),
Phosphorylation(NRAS, BRAF, PhosphorylationSerine, 578),
 Phosphorylation(NRAS, BRAF, PhosphorylationThreonine, 373),
 Phosphorylation(NRAS, RAF1, PhosphorylationSerine, 621),
 Phosphorylation(NRAS, RAF1, PhosphorylationThreonine, 268),
 Phosphorylation(NRAS, RAF1, PhosphorylationThreonine, 269),

IL2 (ligand) as a kinase (perhaps this is meant as indirect?):

Phosphorylation(IL2, MAPK1, PhosphorylationThreonine, 182),
 Phosphorylation(IL2, MAPK1, PhosphorylationTyrosine, 184),
 Phosphorylation(IL2, MAPK3, PhosphorylationThreonine, 201),
 Phosphorylation(IL2, MAPK3, PhosphorylationTyrosine, 203),

KSR1 as a kinase:

Phosphorylation(KSR1, MAPK1, PhosphorylationThreonine, 182),
 Phosphorylation(KSR1, MAPK1, PhosphorylationTyrosine, 184),
 Phosphorylation(KSR1, MAPK3, PhosphorylationThreonine, 201),
 Phosphorylation(KSR1, MAPK3, PhosphorylationTyrosine, 203),

14-3-3 proteins as kinases. Could it be these are involved in some kind of reaction describing negative regulation that's not getting parsed correctly?

Phosphorylation(YWHAB, MAPK1, PhosphorylationThreonine, 182),
 Phosphorylation(YWHAB, MAPK1, PhosphorylationTyrosine, 184),
 Phosphorylation(YWHAB, MAPK3, PhosphorylationThreonine, 201),
 Phosphorylation(YWHAB, MAPK3, PhosphorylationTyrosine, 203),
 Phosphorylation(YWHAE, MAPK1, PhosphorylationThreonine, 182),
 Phosphorylation(YWHAE, MAPK1, PhosphorylationTyrosine, 184),
...
Phosphorylation(p14_3_3_family, MAPK1, PhosphorylationThreonine, 182),
 Phosphorylation(p14_3_3_family, MAPK1, PhosphorylationTyrosine, 184),
 Phosphorylation(p14_3_3_family, MAPK3, PhosphorylationThreonine, 201),
 Phosphorylation(p14_3_3_family, MAPK3, PhosphorylationTyrosine, 203),

Inconsistency in sequence numbering for MAPK1. Also note that these two statements are not only inconsistent with each other, but also with the phosphorylation statements shown above for MAPK1, which lists the residues 182 and 184:

ActivityModification(MAPK1, ['PhosphorylationThreonine', 'PhosphorylationTyrosine'], [183, 185], increases, Activity),
 ActivityModification(MAPK1, ['PhosphorylationThreonine', 'PhosphorylationTyrosine'], [185, 187], increases, Activity),

RAS mediates dephosphorylation of RAF1?

Dephosphorylation(HRAS, RAF1, PhosphorylationSerine, 621),
 Dephosphorylation(KRAS, RAF1, PhosphorylationSerine, 621),
 Dephosphorylation(NRAS, RAF1, PhosphorylationSerine, 621),
 Dephosphorylation(RAS_family, RAF1, PhosphorylationSerine, 621),

Amino acids available in central data structure

Currently amino acids and their various abbreviations e.g. serine, ser, S are used as strings with no restrictions. It would be better to have a pre-defined dictionary with all amino acid names and their abbreviations in statements.py.

In processors: if an amino-acid name in a new Statement that a processor constructs is invalid, a warning or error could be thrown from statements.py.

In assemblers: All assemblers should refer to amino acid abbreviations in statements.py and not use their own copy of the same lookup table.

In CX assembler, links between agents in Complex statements should be undirected if possible

Currently the links between agents are directed, presumably by the order of the agents in the Complex statement.

Fix BEL Complex queries that are not picking up all members

It appears that the query for complexes is not picking up all protein members, leading to Complex statements that contain only one member. Anecdotally it appears that the problem may be when one member of the complex is a ModifiedProteinAbundance rather than a ProteinAbundance.

Mutations represented as Agent condition

I started implementing MutConditions - a named tuple to represent amino acid substitutions on INDRA Agents. I suggest the following format:
MutCondition = namedtuple('MutCondition', ['pos', 'residue_from', 'residue_to'])

There are several important questions that we need to consider. For instance:

Is it enough to represent mutations that involve a single amino acid change? Or is a more general representation worth the effort?
Should we allow more than one MutCondition on an Agent?
What is the assumption if mutation is not given? Should we assume that it refers to WT or should it be interpreted as applying to both WT and all mutated forms?
How to handle refinements (in preassembler)? Is a mutated form a refinement of an agent with unspecified mutation status?
If a statement refers explicitly to WT, how do we represent it?

BEL API fails on term "Phosphatidylinositol_3,4,5_trisphosphate"

When attempting to process the rdf from data/large_corpus.rdf:

In [5]: bp = bel_api.process_belrdf(rdf)
---------------------------------------------------------------------------
InvalidNameError                          Traceback (most recent call last)
<ipython-input-5-d77cea19fd57> in <module>()
----> 1 bp = bel_api.process_belrdf(rdf)

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/bel/bel_api.py in process_belrdf(rdf_str)
     27     # Build BelPy statements from RDF
     28     bp = BelProcessor(g)
---> 29     bp.get_complexes()
     30     bp.get_activating_subs()
     31     bp.get_modifications()

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/bel/processor.py in get_complexes(self)
    350         for stmt in res_cmplx:
    351             cmplx_name = term_from_uri(stmt[0])
--> 352             child_name = gene_name_from_uri(stmt[1])
    353             child = Agent(child_name)
    354             cmplx_dict[cmplx_name].append(child)

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/bel/processor.py in gene_name_from_uri(uri)
     53 
     54 def gene_name_from_uri(uri):
---> 55     return name_from_uri(uri).upper()
     56 
     57 def term_from_uri(uri):

/Users/johnbachman/Dropbox/1johndata/Knowledge File/Biology/Research/Big Mechanism/indra/indra/bel/processor.py in name_from_uri(uri)
     48         pass
     49     else:
---> 50         raise InvalidNameError(name)
     51 
     52     return name

InvalidNameError: Not a valid name: Phosphatidylinositol_3,4,5_trisphosphate

Update biopax processor to use correct object model for evidence information

Currently the processor fills in a few fields of a single Evidence object, whereas in reality a single biopax reaction might have a large number of references. The list of evidence information returned from biopax should be converted into a list of Evidence objects that are added to the statement.

ActiveForm statements are not sitemapped

 ActiveForm(MAPK1(mods: (phosphorylation, T, 185)), kinase, True),
 ActiveForm(MAPK1(mods: (phosphorylation, Y, 185)), kinase, True),

Gene/protein names should be case normalized

Gene names are parsed in from BELscript/RDF using whatever case they have, which leads to different rules having, e.g., "Braf" and others "BRAF". When converted into PySB, these lead to different monomers.

Possible solutions:

Store all names as ALL CAPS
Store all names as first letter Caps
Use the first appearance of the gene name as the convention, store it, and then canonicalize all other-case appearances of the gene to that
Look up the name in a database somewhere.

Preassembler gets confused by unicode

Here are two Statements with their matches_keys:

Complex(CCND1(), CDK4()) 
"(<class 'indra.statements.Complex'>, ("('CCND1', [], [], None, 0, ())", "('CDK4', [], [], None, 0, ())"))"
Complex(CCND1(), CDK4())
"(<class 'indra.statements.Complex'>, ("(u'CCND1', [], [], None, 0, ())", "(u'CDK4', [], [], None, 0, ())"))"

The only difference is that in one case the member names are unicode and in the other case they aren't. This breaks the logic of finding duplicates by matching by matches_key strings.

Handle RegulateActivity statements in biopax processor

There appear to be many reactions in BioPax of the form

From: MEK()
Catalyzed by: Raf(Active state,R)
To: MEK(Active state,R)

i.e., an ActivityActivity statement involving Raf and Mek. We should be able to process these in BioPax as we do for BEL.

The ActivityActivity statements may prove to be particularly useful when assembling them alongside other statements where the identity of the modifying enzyme is actually not clear, for example when the reaction is catalyzed by a complex (e.g., 14-3-3 % BRAF % KSR).

Model annotations

We keep track of evidence at the level of Statements but the generated model itself is not annotated. We could annotate each PySB rule by, for instance, the natural language sentence, the BioPax reaction, or the BEL statement that it was processed from.

sorgerlab / indra Goto Github PK

indra's People

Contributors

Stargazers

Watchers

Forkers

indra's Issues

Recommend Projects

Recommend Topics

Recommend Org