geneontology / pathways2go Goto Github PK

View Code? Open in Web Editor NEW

8.0 14.0 0.0 39.91 MB

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)

Java 100.00%

ontology geneontology pathways go-cam biopax

pathways2go's People

Contributors

Stargazers

Watchers

pathways2go's Issues

bug - 🐛- MF to localization process rule

The reaction 'Phosphorylated MKK3/MKK6 migrates to nucleus' ought to be converted to a localization process, but its not. Investigate and fix. In 'activated TAK1 mediates p38 MAPK activation' pathway

generate ontology inference diff report with explanations

Input: 1 ontologies (e.g. GOPlus and GOPlus with the addition of axioms for MFs generated from RHEA)
Output: table showing new inferences resulting from addition of new axioms along with explanations

Remove labels from chebi entities

When labels exist, Noctua doesn't show labels from imported ontologies. Don't put them in unless this is desired...

Automate review of manual GO term assignments

For @ukemi :
The high level goal is to check a biological process node in a GO-CAM model with a manually assigned type to see if the manual assignment agrees with the term's logical definition in GO.

If the definition is complete and the model is complete, then it ought to be possible to simply take out the manually assigned type and see if the reasoner could recapitulate it or not. If not, then we either have an insufficient definition (very likely and common) or a problem with the Reactome data (model or content).

Provides input for OR positive regulates

Currently, the converter code maps each 'nextStep' relation asserted in the input BioPAX to a 'provides_direct_input_for' relation between the reactions/MFs. nextStep is overly generic for that specific of a mapping. For example, when the relationship is more of a signaling relation than a metabolic relation, then directly_positively_regulates in a more appropriate relation.

convert nextStep relations from BioPAX into RO 'Causally Upstream Of'.
Add a new rule: If reaction1 has_output E and reaction2 has_input E and reaction1 nextStep reaction2 then reaction1 provides_direct_input_for reaction2
Add another new rule: If reaction1 has_output E and reaction2 enabled_by E and reaction1 nextStep reaction2 then reaction1 directly_positively_regulates reaction2

add provided by statement on the model level

Fix auto-generated coordinates - with folding in Noctua in mind.

See ABC Transporter Disorders
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:-1448293181

Add complexes with protein and small molecule part_ofs back into GO-CAMs

In the GO-CAMs generated from Reactome that are currently (circa Sept 2018) visible in the noctua-dev server, the relationships between complexes and their parts have been removed from the OWL models (though you can see them in the comments). This was done because some complexes have a very large number of parts and models containing these complexes became illegible and effectively useless in the Noctua editor. (see geneontology/noctua#581 )

In order to support continued development of the biopax import/export code as well as generally supporting models that incorporate complexes, these need to be added back into the GO-CAMs.

add test and filter for enable_by relations

Ensure that only macromolecules (protein, RNA) (not small molecules) can be the domain of the RO:enables relation.

close the issue by:

Adding a test to the make sure enabled_by relations are valid. (maybe this should be done in an RO definition for a subProperty of 'enabled by' ?)
Ensuring that all Reactome-generated GO-CAMs pass the test.

🐛 in conversion of bmp related to regulation rule

http://noctua-dev.berkeleybop.org/editor/graph/gomodel:1847456024

Plan for integration with BEL

See e.g.
https://github.com/OpenBEL
http://openbel.org
http://openbel.org/language/version_2.0/bel_specification_version_2.0.html#_terms
https://www.biorxiv.org/content/early/2018/09/03/353235

create reactome model collection viewable in Noctua1.0

Likely means that most entities need to be taken out of generated data. (locations, inputs, outputs, etc.) (Maybe try first by dropping locations and making inputs and outputs shared across linked reactions into single nodes instead of the one node per reaction pattern we have now.)
Pack parts that await a Noctua 2.0 display… into annotations or just leave out.

Infer transport reactions as biological processes

For transport reactions, like “beta catenin translocates to the nucleus”, you can identify these since they have the same input and output entity, but presumably the entity is in different locations. If there’s no controller, you should not try to infer an enabled_by, and you can label it as a GO biological process instead of a molecular function. In this particular case, they’ve annotated it to the GO process term “regulation of canonical Wnt…”, which is a strange choice but you can still use it in the conversion.

And previous thoughts on this:
If transport reactions can be detected, convert has_location relations to either has_target_end_location or has_target_start_location and add a general class along the lines of ‘establishment of protein localization. This will allow for the inference of biological process classes such as Establishment of protein localization to mitochondrial membrane. For example,

The GO OWL Class: Establishment of protein localization to mitochondrial membrane is equivalent to the intersection of classes:
Establishment of protein localization
And (‘has target end location’ some ‘mitochondrial membrane’)

Consider a transport reaction from Reactome
Could make assertions according to the rule:

R instance_of ‘establishment of protein localization’
R has_target_end_location location2
R has_target_start_location location1
<=
R has output P2,
P2 occurs_in location2,
R has input P1
P1 occurs_in location1
P1 = P2
location1 != location2

Then reasoner should add more specific biological process - e.g.
R instance_of ‘establishment of protein localization to mitochondrial membrane’
Based on its definition
'establishment of protein localization'
and ('has target end location' some 'mitochondrial membrane')

Document information loss in conversion

Capture the useful information that is lost when moving from BioPAX to GO-CAM and vice versa.

add contributed_by statements as Reactome wants them

This is pending decisions about what Reactome curators to see here.

Convert pathway models into GO-CAMs ready for use without further curation

The objective here is the automatic assembly of a GO-CAM knowledge base from BioPAX (and later BEL) pathway resources. Limitations of Noctua are ignored, all the focus is on getting the models as close to we want them automatically as possible.

So far the spec is as we've done before plus this most recent declaration (which needs some clarification): "This will not attempt to decompose complexes, and for binding functions, exactly one input must be selected to be the enabler (all other complexes or proteins that are inputs, will be connected via the has_input relation). Only one occurs_in edge can be connected to an activity, and complexes cannot have relationships to locations. If the inputs and outputs of a reaction have different locations, please collect the set of combinations, as well as examples of each combination, and we can try to see if there are rules for inferring a single location, from a combination of input and output locations."

Confirm correct directionality of Physical Entity regulates Function assertions

I’m not sure where the “involved in positive regulation of [entity]” triplets are coming from. In GO-CAM a process/function can be regulated but not an entity, so we’d need a way to fix that if we want to keep it. But I can’t see how this comes out of the Reactome representation.

See e.g. 'Beta-catenin is released from the destruction complex' in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:-80976963

Implement pattern for stable URIs

If at some point we share the OWL models generated by the pathways2GO code, it would be ideal if we could do so following good linked data practices. Resolvable, stable URIs are the key here.

get rid of rdfs:labels in go-cam output

wherever the label should be coming from neo, ensure its not put directly into the go-cam as this results in a random selection of labels. Check especially for uniprot ids e.g. http://noctua-dev.berkeleybop.org/editor/graph/gomodel:-1280847518 .

Infer Molecular Function = Binding

If Reactome did not provide a GO molecular function term, and there’s either 1) an output protein complex but no input protein complex, or 2) there’s an output protein complex that is different from all the input complexes, you can infer that the activity is a protein binding function. So you can label it that way (GO:0005515) instead of the root molecular function (GO:0003674).

Systematic Review of Pathway Conversion

@vanaukenk, @huaiyumi, and I will work with @deustp01 on a curatorial review of the conversion of specific pathways to see if we can identify systematic issues.
The goal will be to compare the Reactome representation, the Reactome-based GO-CAM and potentially a curator generated GO-CAM to see if the models are faithfully represented. We will have virtual meetings to discuss the models each week. For each meeting one person will report their findings and then we can discuss any issues. I will start with glycolysis.

There is a folder in the GO-CAM and Noctua folder for the reactome project. I suggest we put notes and presentations there.

Pathways claimed:

@huaiyumi - Signaling by BMP (R-HSA-201451.4); BMP signaling pathway (GO:0030509)
@huaiyumi - MAPK1/MAPK3 signaling (R-HSA-5684996.4)
@ukemi and @deustp01 - Glycolysis (R-HSA-70171); canonical glycolysis (GO:0061621)
@ukemi and @deustp01 - Gluconeogenesis (R-HSA-70263); gluconeogenesis (GO:0006094)
@vanaukenk - TCF dependent signaling in response to WNT (R-HSA-201681); canonical Wnt signaling pathway (GO:0060070) Not xref'd but generic pathway is.
@vanaukenk - Unfolded Protein Response (UPR) (R-HSA-381119); endoplasmic reticulum unfolded protein response (GO:0030968)
@ukemi and @deustp01 - GABA degradation (R-HSA-916853); gamma-aminobutyric acid catabolic process (GO:0009450)
@ukemi and @deustp01 - PINK/PARKIN Mediated Autophagy (R-HSA-5205685); mitophagy (GO:0000423) or macroautophagy (GO:0016236).

focus enabled_by edges on gene products rather than complexes

If a complex has only one gene product in it, you can remove the small molecules/ions in the complex and just make the node a single gene product instead of a complex. For example, see the pathway “glycolysis”, the reaction “PGM:Mg2+ isomerise G6P to G1P” should just be enabled_by PGM (one gene product).

do away with complexes altogether?

@thomaspd has proposed that we take all of the complexes out of the GO-CAMs generated from Reactome and in their place, put all of the gene products that they contain. So, if a complex 'enables' a reaction r and the complex has proteins a,b,c as parts, then create the triples a enables r, b enables r, and c enables r.

The underlying logic is that GO-CAMs are gene-product centric. Even when a complex is the active entity, the thinking is that there is a gene product within it that is more causal and should be the focus of the annotation. By not showing complexes, it forces the curators to think about which of the genes are crucial to executing the function in question. The curator's workflow would then be to review the assertions and delete the ones that weren't really causal.

This partially alleviates the pain of using the Noctua interface as the folding problems related to complexes go away. Unfortunately it doesn't completely as we still have the issue of complexes with lots of pieces a la geneontology/noctua#581
Its also quite a bit farther afield from the Reactome view, making it harder to provide them with useful feedback in an automated way. For example, one of the nice kinds inferences we can give them are inferences about the kinds of protein complexes they are working with (from the GO CC perspective). e.g., automatic detection of 'DNA Repair Complex' https://docs.google.com/presentation/d/1T6QmR7MeXWH6Q3hMgpp-A09zDwTUplWcrVSZRhS_InU/edit#slide=id.g39eed66d0b_0_11

how to represent post translational modifications

Reactome has entities like p-2T-MAP2K1 that are combinations of proteins with different modifications. For p-2T-MAP2K1 it is actually a set of two proteins with the same id (UniProt:Q02750) but with different PTMs.

Currently the GO-CAM view of this entity is a single instance of a UniProt:Q02750 without any PTM information. This leads to representations of function/reaction nodes that have the same entities as both input and output (apart from ATP->ADP).

Is this a desirable representation for GO-CAM models?

Note that the BioPAX export does contain information about the specific sequence locations that are modified. In principal, this information could be captured in the GO-CAM model.

Build go-rhea-reactome xref table(s)

Something along the lines of:
GO:ID|GO term name|Reactome:ID|Reactome term name|Rhea ID||Rhea term name

(but noting that many reactomes will point to a few gos.)

Add active unit information from Reactome

For some reactions, like 'activated human TAK1 phosphorylates MKK3/MKK6', Reactome says that a protein complex (or set) catalyzes the reaction, and then indicates a specific member of that complex (or set) as the 'active unit'. In this case, the modified protein 'p-T184,T187-MAP3K7' is the active unit and 'Activated TAK complexes' is the 'complex'. Note that several of the members of 'Activated TAK complexes' contain a MAP3K7.

To close this issue:

identify the active unit (which will require accessing Reactome in another way than the BioPAX as this information is not in the BP level 3 export)
Make the 'enabled by' edge for the reaction (MF) link to this active unit protein.
Link the catalytic complex(es) to the reaction (MF) via 'contributes to' relationship(s).

Make protein sets OWL Union entities in GO-CAMs

Reactome represents group entities like 'glucokinase and hexokinases'. So far these have been treated as complexes in the GO-CAM converter which is incorrect.

To close issue:

in the GO-CAM, create an OWL union class containing each member to represent the group
add any relations previously linked to the complex representation to the union entity
confirm that this renders in Noctua, if not, generate a new Noctua ticket

Define and implement pathway2go curation workflow

Presuming that the overarching goal here is a migration of most knowledge from external pathway databases into GO-CAM models.

We assume that the conversion from Reactome pathways to GO-CAMs can not be completely automated to our satisfaction, hence we need to decide how the work of adapting these models to suit the GO will be accomplished.

Define what the goal is. How should pathways look as completed GO-CAMs? Generate manual exemplars for several. (link to them here). This is a general need for all GO-CAM work, not limited to the pathway problem. We need more exemplar models.
Define a curatorial workflow that allows curators to efficiently do the work needed for the conversion to be complete. Define what the individual tasks are (e.g. select a specific member of a protein complex to assign as the enabler of a reaction/function). Decide if these require new UI components for Noctua and, if so, spec them out.
Define what happens (automated updates, alerts for humans, etc.) when pathways are modified at their sources.

clean up extraneous edges in BioPAX import

examples in the pathway: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:-80976963

Only make an inference about which has_input edge should be enabled_by, when there’s no controller. For example, the reaction “phosphorylation of LRP5/6 cytoplasmic domain by CSNKI” currently has multiple enabled_by edges, and it should only have one, pointing to CSNKI (P78368)
There are extraneous “enabled_by (protein-containing complex)” triplets. Same example as #1 above. I’m guessing this comes from the reasoner. These should be removed.
When there’s a GO term from Reactome, you can remove the placeholder root molecular function (GO:0003674). Right now there are two function terms for the same activity in these cases, same example as 1 above.

cleanse generated models of tbox insertions wherever possible - work with neo..

I've put it off long enough. Now that inferences are starting to matter more, time to get rid of the tbox assertions in the converted models (just subclassOfs..) and use goplus and neo etc. directly. Early tests look promising.

Need to make some more test cases to make sure this doesn't screw up the sparql-based rules.

test GO OWL inference for mapping unknown terms

See related issue geneontology/go-ontology#14984 , approach (in slides) for importing reactions from Rhea https://docs.google.com/presentation/d/1QZ96mL1PRE0cLw0pPT5K-R9wdfd07HM4b2OFpCSSELU/edit#slide=id.p24

Initial work on Rhea conversion http://viewvc.geneontology.org/viewvc/GO-SVN/trunk/experimental/rhea2go/

Also discussion towards the bottom of mapping document
https://docs.google.com/document/d/1xKr4sMBMioEYnHFpnFl0lL2gQzpeHX-eY3wT0fyZXEA/edit?ts=5ac40c2e#heading=h.5pt8ui1xdcc

complexes (without active sites) can enable functions again

When no active site is annotated, and the complex (or protein set) catalyzes a MF, then add the complex as the enabler of that function. The members of the complex (part_ofs) should get contributes_to relations to the MF, though these may be inferable via geneontology/minerva#110

validate consistency of reactome location curation

One of the rules applied during import to go-cam infers. reaction location from entity locations

Possible QA bonus: Reactome is supposed to do this manually and record the result as an attribute of the reaction, so the inference should match the annotation and discrepancies are likely curation mistakes.

The work in progress at Reactome to sort out "encloses" relationships among GO cell_component terms used to annotate locations of entities and reactions should be usable also to check plausibility of reactions with multiple locations.

assemble sparql endpoint for pathway membership query

Objective is to provide SPARQL access to results of important conversions to GO-CAM (starting with Reactome and then perhaps PathwayCommons)

Add additional info to the README

E.g. link to Rocky presentation, link to noctua-dev

rewrite the conversion process to Python and implement using straightforward conversion rules

The mapping process follows a set of rules that could probably be expressed in a logic programming framework such as prolog or SWRL rules. This would be beneficial as it would be easier to collaboratively develop the rules if they were first class entities in github rather than chunks of code buried in the converter app. Would also make it feasible to apply them in other contexts.

Investigate formalizations is the ticket here.

Notes from @balhoff on how to get started:
SWRLtab (in Protege) comes with its onw reasoner; you don’t really need it just edit in the Rules tab
it’s built in look at some of the rules that are in RO already. then you need to run them - you an use the Arachne Protege plugin, which is kind of hacky or just save your rules in an OWL file and use like the other OWL files in your program

Convert Demo Pathways from Pathway Commons

Want to move beyond Reactome to incorporate pathways from other databases. Pathway Commons provides an aggregated resource available in BioPAX.

Pathway that originated in Reactome
Pathway that originated in WikiPathways
Pathway that originated in KEGG

take out cross pathway links

This is an interim step before finishing #43

The view in Noctua will only show contents from the individual pathway. No links to super/sub pathways nor links to events in other pathways will be shown. These will all eventually be captured in the bridge model graph(s). Fairly significant work on the client will be needed to support the use of the bridge model in the UI. (Won't happen unless it is prioritized.)

Prepare lists of missing reactome-go mappings

Pathway->BP
Reaction->MF

Convert Entity_involved_in_regulation_MF to MF_regulates_MF

Reactome often connects physical entities (complexes, compounds, etc.) to reactions using regulates relationships. E.g. see https://reactome.org/PathwayBrowser/#/R-HSA-201681&SEL=R-HSA-201685&PATH=R-HSA-162582,R-HSA-195721 . Currently these come through the first pass of go-camification as: Entity involved_in_regulation_of MF triples (seen in grey). There is currently a rule that converts these into MF regulates MF relationships when the upstream MF that produces the physical entity as an output is present in the pathway. See e.g. : "Beta-catenin is released from the destruction complex" is positively regulated by 'Phosphorylation of LRP5/6 cytoplasmic domain by CSNKI'� (See also #18 )

But sometimes, the upstream function is either unknown or not in the current pathway. In these cases, create a simple 'binding' function node in its place and assert that that function regulates the targeted downstream function. Remove the entity involved in function relationship.

Build BioPAX exporter

Input any valid GO-CAM model
Output a BioPAX level 3 file

test usefulness of this output using BioPAX compliant tools (BioPAX validator, ChiBE, Cytoscape import, more?)

Generate report for Reactome

Show what GO classes can be inferred - show where they match existing annotations, where they differ, and where they differ if they are 'deeper'.

ping Nathan Dunn when reactome conversion in a public rdf endpoint

Test 'Bridge Model' pattern for cross-model linking.

For edges between nodes in different pathways (nextEvent, PrevEvent) add these to a separate 'Bridge' model. Test how this can be used for cross-model queries in complete RDF store.

This can be used by @kltm and @balhoff to develop client code and reasoner code respectively.

add startsWith and endsWith to models

This will make it possible to automate type inference and checking with OWL definitions for biological processes like 'GO:0007213' 'G protein-coupled acetylcholine receptor signaling pathway'. Defined as:
'signal transduction'
and ('starts with' some 'G protein-coupled acetylcholine receptor activity')

each edge in a reactome model should have one identical evidence statement linking it to the source pathway

Each edge in the GO-CAM should be linked to an evidence statement with the source being the reactome unique identifier for the corresponding pathway record. e.g. reactome:R-HSA-450302
(Noctua ought to expand that the way it does for PMIDs to e.g. https://reactome.org/content/detail/R-HSA-450302 ).

Define Victory

How should we decide when biopax 2 go-cam or go-cam 2 biopax is good enough to stop (and then move on to providing access to the results)

Infer locations for activities based on their actors

Try to add locations for each activity, and connect them with the occurs_in edge. If Reactome labels the reactants and products with a GO cellular component term, and it’s the same for all reactants and products, then you can use that GO term for the activity.

Convert BioPAX to GO-CAMs intended for curation in Noctua

Input BioPAX model (Reactome, Pathway Commons, etc.)
Output GO-CAM model

Add info about biopax properties mapped to RO to RO

e.g. ro:has_input = bp:left. ro:has_output = bp:right

geneontology / pathways2go Goto Github PK

pathways2go's People

Contributors

Stargazers

Watchers

pathways2go's Issues

Recommend Projects

Recommend Topics

Recommend Org