Giter VIP home page Giter VIP logo

pathways2go's Introduction

Build Status

Pathways to GO

Code for converting between biological pathways (e.g. Reactome, Wikipathways) expressed using the BioPAX standard and the Gene Ontology Causal Activity OWL Model (GO-CAM) structure.

Additional documentation

Related Code Repositories

Related Websites

Related Presentations

Related Publications

Building

  • Clone repo, Maven build (developing with Eclipse and Maven plugin)

How it works - ontologies

  • GO-CAMs are "OWL instance graphs" aka little ontologies where instances are created and linked to one another to represent how we think gene products enable biology to function. In semantic web terminology, these little ontologies define the "Abox" or "assertional component" of the GO knowledge base. They depend on a "Tbox" or "terminological component". The Tbox contains all of the definitions of the classes and properties that the Abox refers to. In other words, the Tbox provides the language and the Abox provides the sentences. For GO-CAMs, the Tbox is defined by a conglomerate ontology called go-lego http://purl.obolibrary.org/obo/go/extensions/go-lego.owl which contains GO, a number of species-specific anatomy ontologies, portions of CHEBI, parts of NCBI taxonomy, and various other bits and pieces needed to express GO-CAMs. The import list for go-lego can be found here: https://github.com/geneontology/go-ontology/blob/master/src/ontology/extensions/go-lego-edit.ofn
  • Right now, go-lego does not contain class-level representations of all of the physical entities that are used in pathway databases (namely, gene products, complexes, and chemicals), Reactome in particular. So, to generate semantically complete GO-CAMs, this base ontology needs to be extended with ontologies that capture the physical entities used.
    • The NEO ontology is meant to capture all gene products that the GO consortium members may want to use in their GO-CAMs. It is available here: http://purl.obolibrary.org/obo/go/noctua/neo.owl and built with code from https://github.com/geneontology/neo Note that its size (>3gb at last count) makes it challenging to work with in the context of standard semantic web tools.
    • The 1.0.0 pathways2go framework is primarily oriented towards the conversion of pathways from the Reactome knowledge base (https://reactome.org). The Reactome entity representations include knowledge that NEO does not including: the locations of the physical entity, modifications to proteins, and a large number of unique complexes. To accomodate these entities with GO-CAM models, pathways2go produces a new ontology called 'reacto.owl' that captures all of this information and needs to be imported for Reactome-generated GO-CAMs to be terminologically (Tbox) complete. It is generated automatically and can be accessed at http://purl.obolibrary.org/obo/go/extensions/reacto.owl Note that reacto contains references to chebi, GO, PRO, and MOD from the OBO collection. For complete reasoning, these could be imported but are left out of the build because of size issues.
  • In 1.1.0 the pathways2go framework was extended to support conversion of biopax pathways from https://yeastgenome.org . If the parameter -e YeastCyc is added to the command line execution, the conversion will not use the REACTO entity ontology pattern described above. Instead, it will make direct reference to OWL Classes in the go-lego and neo ontologies and, where these are missing, use appropriate upper level classes as defaults. In addition, this release introduces the -sssom parameter. If a sssom (https://github.com/OBOFoundry/SSSOM/blob/master/SSSOM.md) mapping file is provided, the converter will use the mapping to add class assignments where they are missing from the biopax file. (It uses the best match with confidence above 0.5).

Running

  • Build using Maven install, or download a release jar file from https://github.com/geneontology/pathways2GO/releases
  • the biopax2go.jar executable has 2 purposes. It can generate an OWL ontology containing the physical entities in a biopax file (e.g. reacto.owl) and it can convert a biopax file (e.g. https://reactome.org/download/current/biopax.zip) into a set of GO-CAMs corresponding to the pathways in the biopax.
  • For example, to generate a physical entity ontology called reacto.owl from a biopax file:
  • To generate GO-CAMS from a biopax file, minimally:
    • java -jar biopax2go.jar -b some_biopax.owl -o ./output_dir/ -lego go-lego.owl -e REACTO
  • To generate GO-CAMS from a YeastCyc biopax file, with a provided sssom mapping file
    • java -jar biopax2go.jar -b some_biopax.owl -o ./output_dir/ -lego go-lego.owl -e YeastCyc -sssom ./yeastpathway.sssom.tsv

pathways2go's People

Contributors

balhoff avatar cmungall avatar dependabot[bot] avatar dustine32 avatar goodb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pathways2go's Issues

Build BioPAX exporter

Input any valid GO-CAM model
Output a BioPAX level 3 file

  • test usefulness of this output using BioPAX compliant tools (BioPAX validator, ChiBE, Cytoscape import, more?)

Infer transport reactions as biological processes

For transport reactions, like “beta catenin translocates to the nucleus”, you can identify these since they have the same input and output entity, but presumably the entity is in different locations. If there’s no controller, you should not try to infer an enabled_by, and you can label it as a GO biological process instead of a molecular function. In this particular case, they’ve annotated it to the GO process term “regulation of canonical Wnt…”, which is a strange choice but you can still use it in the conversion.

And previous thoughts on this:
If transport reactions can be detected, convert has_location relations to either has_target_end_location or has_target_start_location and add a general class along the lines of ‘establishment of protein localization. This will allow for the inference of biological process classes such as Establishment of protein localization to mitochondrial membrane. For example,

The GO OWL Class: Establishment of protein localization to mitochondrial membrane is equivalent to the intersection of classes:
Establishment of protein localization
And (‘has target end location’ some ‘mitochondrial membrane’)

Consider a transport reaction from Reactome
Could make assertions according to the rule:

R instance_of ‘establishment of protein localization’
R has_target_end_location location2
R has_target_start_location location1
<=
R has output P2,
P2 occurs_in location2,
R has input P1
P1 occurs_in location1
P1 = P2
location1 != location2

Then reasoner should add more specific biological process - e.g.
R instance_of ‘establishment of protein localization to mitochondrial membrane’
Based on its definition
'establishment of protein localization'
and ('has target end location' some 'mitochondrial membrane')

Add active unit information from Reactome

For some reactions, like 'activated human TAK1 phosphorylates MKK3/MKK6', Reactome says that a protein complex (or set) catalyzes the reaction, and then indicates a specific member of that complex (or set) as the 'active unit'. In this case, the modified protein 'p-T184,T187-MAP3K7' is the active unit and 'Activated TAK complexes' is the 'complex'. Note that several of the members of 'Activated TAK complexes' contain a MAP3K7.

To close this issue:

  • identify the active unit (which will require accessing Reactome in another way than the BioPAX as this information is not in the BP level 3 export)
  • Make the 'enabled by' edge for the reaction (MF) link to this active unit protein.
  • Link the catalytic complex(es) to the reaction (MF) via 'contributes to' relationship(s).

take out cross pathway links

This is an interim step before finishing #43

The view in Noctua will only show contents from the individual pathway. No links to super/sub pathways nor links to events in other pathways will be shown. These will all eventually be captured in the bridge model graph(s). Fairly significant work on the client will be needed to support the use of the bridge model in the UI. (Won't happen unless it is prioritized.)

clean up extraneous edges in BioPAX import

examples in the pathway: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:-80976963

  • Only make an inference about which has_input edge should be enabled_by, when there’s no controller. For example, the reaction “phosphorylation of LRP5/6 cytoplasmic domain by CSNKI” currently has multiple enabled_by edges, and it should only have one, pointing to CSNKI (P78368)
  • There are extraneous “enabled_by (protein-containing complex)” triplets. Same example as #1 above. I’m guessing this comes from the reasoner. These should be removed.
  • When there’s a GO term from Reactome, you can remove the placeholder root molecular function (GO:0003674). Right now there are two function terms for the same activity in these cases, same example as 1 above.

Implement pattern for stable URIs

If at some point we share the OWL models generated by the pathways2GO code, it would be ideal if we could do so following good linked data practices. Resolvable, stable URIs are the key here.

Build go-rhea-reactome xref table(s)

Something along the lines of:
GO:ID|GO term name|Reactome:ID|Reactome term name|Rhea ID||Rhea term name

(but noting that many reactomes will point to a few gos.)

Confirm correct directionality of Physical Entity regulates Function assertions

I’m not sure where the “involved in positive regulation of [entity]” triplets are coming from. In GO-CAM a process/function can be regulated but not an entity, so we’d need a way to fix that if we want to keep it. But I can’t see how this comes out of the Reactome representation.

See e.g. 'Beta-catenin is released from the destruction complex' in http://noctua-dev.berkeleybop.org/editor/graph/gomodel:-80976963

add startsWith and endsWith to models

This will make it possible to automate type inference and checking with OWL definitions for biological processes like 'GO:0007213' 'G protein-coupled acetylcholine receptor signaling pathway'. Defined as:
'signal transduction'
and ('starts with' some 'G protein-coupled acetylcholine receptor activity')

Automate review of manual GO term assignments

For @ukemi :
The high level goal is to check a biological process node in a GO-CAM model with a manually assigned type to see if the manual assignment agrees with the term's logical definition in GO.

If the definition is complete and the model is complete, then it ought to be possible to simply take out the manually assigned type and see if the reasoner could recapitulate it or not. If not, then we either have an insufficient definition (very likely and common) or a problem with the Reactome data (model or content).

Make protein sets OWL Union entities in GO-CAMs

Reactome represents group entities like 'glucokinase and hexokinases'. So far these have been treated as complexes in the GO-CAM converter which is incorrect.

To close issue:

  • in the GO-CAM, create an OWL union class containing each member to represent the group
  • add any relations previously linked to the complex representation to the union entity
  • confirm that this renders in Noctua, if not, generate a new Noctua ticket

do away with complexes altogether?

@thomaspd has proposed that we take all of the complexes out of the GO-CAMs generated from Reactome and in their place, put all of the gene products that they contain. So, if a complex 'enables' a reaction r and the complex has proteins a,b,c as parts, then create the triples a enables r, b enables r, and c enables r.

The underlying logic is that GO-CAMs are gene-product centric. Even when a complex is the active entity, the thinking is that there is a gene product within it that is more causal and should be the focus of the annotation. By not showing complexes, it forces the curators to think about which of the genes are crucial to executing the function in question. The curator's workflow would then be to review the assertions and delete the ones that weren't really causal.

  • This partially alleviates the pain of using the Noctua interface as the folding problems related to complexes go away. Unfortunately it doesn't completely as we still have the issue of complexes with lots of pieces a la geneontology/noctua#581
  • Its also quite a bit farther afield from the Reactome view, making it harder to provide them with useful feedback in an automated way. For example, one of the nice kinds inferences we can give them are inferences about the kinds of protein complexes they are working with (from the GO CC perspective). e.g., automatic detection of 'DNA Repair Complex' https://docs.google.com/presentation/d/1T6QmR7MeXWH6Q3hMgpp-A09zDwTUplWcrVSZRhS_InU/edit#slide=id.g39eed66d0b_0_11

Convert pathway models into GO-CAMs ready for use without further curation

The objective here is the automatic assembly of a GO-CAM knowledge base from BioPAX (and later BEL) pathway resources. Limitations of Noctua are ignored, all the focus is on getting the models as close to we want them automatically as possible.

So far the spec is as we've done before plus this most recent declaration (which needs some clarification): "This will not attempt to decompose complexes, and for binding functions, exactly one input must be selected to be the enabler (all other complexes or proteins that are inputs, will be connected via the has_input relation). Only one occurs_in edge can be connected to an activity, and complexes cannot have relationships to locations. If the inputs and outputs of a reaction have different locations, please collect the set of combinations, as well as examples of each combination, and we can try to see if there are rules for inferring a single location, from a combination of input and output locations."

Define and implement pathway2go curation workflow

Presuming that the overarching goal here is a migration of most knowledge from external pathway databases into GO-CAM models.

We assume that the conversion from Reactome pathways to GO-CAMs can not be completely automated to our satisfaction, hence we need to decide how the work of adapting these models to suit the GO will be accomplished.

  • Define what the goal is. How should pathways look as completed GO-CAMs? Generate manual exemplars for several. (link to them here). This is a general need for all GO-CAM work, not limited to the pathway problem. We need more exemplar models.
  • Define a curatorial workflow that allows curators to efficiently do the work needed for the conversion to be complete. Define what the individual tasks are (e.g. select a specific member of a protein complex to assign as the enabler of a reaction/function). Decide if these require new UI components for Noctua and, if so, spec them out.
  • Define what happens (automated updates, alerts for humans, etc.) when pathways are modified at their sources.

rewrite the conversion process to Python and implement using straightforward conversion rules

The mapping process follows a set of rules that could probably be expressed in a logic programming framework such as prolog or SWRL rules. This would be beneficial as it would be easier to collaboratively develop the rules if they were first class entities in github rather than chunks of code buried in the converter app. Would also make it feasible to apply them in other contexts.

Investigate formalizations is the ticket here.

Notes from @balhoff on how to get started:
SWRLtab (in Protege) comes with its onw reasoner; you don’t really need it just edit in the Rules tab
it’s built in look at some of the rules that are in RO already. then you need to run them - you an use the Arachne Protege plugin, which is kind of hacky or just save your rules in an OWL file and use like the other OWL files in your program

focus enabled_by edges on gene products rather than complexes

If a complex has only one gene product in it, you can remove the small molecules/ions in the complex and just make the node a single gene product instead of a complex. For example, see the pathway “glycolysis”, the reaction “PGM:Mg2+ isomerise G6P to G1P” should just be enabled_by PGM (one gene product).

Infer Molecular Function = Binding

If Reactome did not provide a GO molecular function term, and there’s either 1) an output protein complex but no input protein complex, or 2) there’s an output protein complex that is different from all the input complexes, you can infer that the activity is a protein binding function. So you can label it that way (GO:0005515) instead of the root molecular function (GO:0003674).

validate consistency of reactome location curation

One of the rules applied during import to go-cam infers. reaction location from entity locations

Possible QA bonus: Reactome is supposed to do this manually and record the result as an attribute of the reaction, so the inference should match the annotation and discrepancies are likely curation mistakes.

The work in progress at Reactome to sort out "encloses" relationships among GO cell_component terms used to annotate locations of entities and reactions should be usable also to check plausibility of reactions with multiple locations.

how to represent post translational modifications

Reactome has entities like p-2T-MAP2K1 that are combinations of proteins with different modifications. For p-2T-MAP2K1 it is actually a set of two proteins with the same id (UniProt:Q02750) but with different PTMs.

Currently the GO-CAM view of this entity is a single instance of a UniProt:Q02750 without any PTM information. This leads to representations of function/reaction nodes that have the same entities as both input and output (apart from ATP->ADP).

Is this a desirable representation for GO-CAM models?

Note that the BioPAX export does contain information about the specific sequence locations that are modified. In principal, this information could be captured in the GO-CAM model.

Systematic Review of Pathway Conversion

@vanaukenk, @huaiyumi, and I will work with @deustp01 on a curatorial review of the conversion of specific pathways to see if we can identify systematic issues.
The goal will be to compare the Reactome representation, the Reactome-based GO-CAM and potentially a curator generated GO-CAM to see if the models are faithfully represented. We will have virtual meetings to discuss the models each week. For each meeting one person will report their findings and then we can discuss any issues. I will start with glycolysis.

There is a folder in the GO-CAM and Noctua folder for the reactome project. I suggest we put notes and presentations there.

Pathways claimed:

  • @huaiyumi - Signaling by BMP (R-HSA-201451.4); BMP signaling pathway (GO:0030509)
  • @huaiyumi - MAPK1/MAPK3 signaling (R-HSA-5684996.4)
  • @ukemi and @deustp01 - Glycolysis (R-HSA-70171); canonical glycolysis (GO:0061621)
  • @ukemi and @deustp01 - Gluconeogenesis (R-HSA-70263); gluconeogenesis (GO:0006094)
  • @vanaukenk - TCF dependent signaling in response to WNT (R-HSA-201681); canonical Wnt signaling pathway (GO:0060070) Not xref'd but generic pathway is.
  • @vanaukenk - Unfolded Protein Response (UPR) (R-HSA-381119); endoplasmic reticulum unfolded protein response (GO:0030968)
  • @ukemi and @deustp01 - GABA degradation (R-HSA-916853); gamma-aminobutyric acid catabolic process (GO:0009450)
  • @ukemi and @deustp01 - PINK/PARKIN Mediated Autophagy (R-HSA-5205685); mitophagy (GO:0000423) or macroautophagy (GO:0016236).

test GO OWL inference for mapping unknown terms

create reactome model collection viewable in Noctua1.0

Likely means that most entities need to be taken out of generated data. (locations, inputs, outputs, etc.) (Maybe try first by dropping locations and making inputs and outputs shared across linked reactions into single nodes instead of the one node per reaction pattern we have now.)
Pack parts that await a Noctua 2.0 display… into annotations or just leave out.

Generate report for Reactome

Show what GO classes can be inferred - show where they match existing annotations, where they differ, and where they differ if they are 'deeper'.

Convert Demo Pathways from Pathway Commons

Want to move beyond Reactome to incorporate pathways from other databases. Pathway Commons provides an aggregated resource available in BioPAX.

  • Pathway that originated in Reactome
  • Pathway that originated in WikiPathways
  • Pathway that originated in KEGG

Infer locations for activities based on their actors

Try to add locations for each activity, and connect them with the occurs_in edge. If Reactome labels the reactants and products with a GO cellular component term, and it’s the same for all reactants and products, then you can use that GO term for the activity.

Define Victory

How should we decide when biopax 2 go-cam or go-cam 2 biopax is good enough to stop (and then move on to providing access to the results)

Provides input for OR positive regulates

Currently, the converter code maps each 'nextStep' relation asserted in the input BioPAX to a 'provides_direct_input_for' relation between the reactions/MFs. nextStep is overly generic for that specific of a mapping. For example, when the relationship is more of a signaling relation than a metabolic relation, then directly_positively_regulates in a more appropriate relation.

  • convert nextStep relations from BioPAX into RO 'Causally Upstream Of'.
  • Add a new rule: If reaction1 has_output E and reaction2 has_input E and reaction1 nextStep reaction2 then reaction1 provides_direct_input_for reaction2
  • Add another new rule: If reaction1 has_output E and reaction2 enabled_by E and reaction1 nextStep reaction2 then reaction1 directly_positively_regulates reaction2

Convert Entity_involved_in_regulation_MF to MF_regulates_MF

Reactome often connects physical entities (complexes, compounds, etc.) to reactions using regulates relationships. E.g. see https://reactome.org/PathwayBrowser/#/R-HSA-201681&SEL=R-HSA-201685&PATH=R-HSA-162582,R-HSA-195721 . Currently these come through the first pass of go-camification as: Entity involved_in_regulation_of MF triples (seen in grey). There is currently a rule that converts these into MF regulates MF relationships when the upstream MF that produces the physical entity as an output is present in the pathway. See e.g. : "Beta-catenin is released from the destruction complex" is positively regulated by 'Phosphorylation of LRP5/6 cytoplasmic domain by CSNKI'� (See also #18 )

But sometimes, the upstream function is either unknown or not in the current pathway. In these cases, create a simple 'binding' function node in its place and assert that that function regulates the targeted downstream function. Remove the entity involved in function relationship.

Add complexes with protein and small molecule part_ofs back into GO-CAMs

In the GO-CAMs generated from Reactome that are currently (circa Sept 2018) visible in the noctua-dev server, the relationships between complexes and their parts have been removed from the OWL models (though you can see them in the comments). This was done because some complexes have a very large number of parts and models containing these complexes became illegible and effectively useless in the Noctua editor. (see geneontology/noctua#581 )

In order to support continued development of the biopax import/export code as well as generally supporting models that incorporate complexes, these need to be added back into the GO-CAMs.

Test 'Bridge Model' pattern for cross-model linking.

For edges between nodes in different pathways (nextEvent, PrevEvent) add these to a separate 'Bridge' model. Test how this can be used for cross-model queries in complete RDF store.

This can be used by @kltm and @balhoff to develop client code and reasoner code respectively.

add test and filter for enable_by relations

Ensure that only macromolecules (protein, RNA) (not small molecules) can be the domain of the RO:enables relation.

close the issue by:

  • Adding a test to the make sure enabled_by relations are valid. (maybe this should be done in an RO definition for a subProperty of 'enabled by' ?)
  • Ensuring that all Reactome-generated GO-CAMs pass the test.

cleanse generated models of tbox insertions wherever possible - work with neo..

I've put it off long enough. Now that inferences are starting to matter more, time to get rid of the tbox assertions in the converted models (just subclassOfs..) and use goplus and neo etc. directly. Early tests look promising.

Need to make some more test cases to make sure this doesn't screw up the sparql-based rules.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.