geneontology / neo Goto Github PK
View Code? Open in Web Editor NEWnoctua entity ontology
noctua entity ontology
Hi - I'm unable to find a C. elegans gene, cyk-7, that I'm trying to annotate in Noctua.
It is not available in the autocomplete in the form or graph editor.
Here is its entry in our gpi file:
WB WBGene00015591 cyk-7 CYtoKinesis defect CELE_C08C3.4 gene taxon:6239 UniProtKB:P34325
That can be found here:
ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz
Is the C. elegans gpi file being loaded into NEO?
We have discussed a similar issue with another gene in a separate ticket, but I don't think this ever got resolved:
#580
I'm not sure which is the best tracker for this issue, but am starting with NEO.
If a mouse ncRNA gene is used as an enabling entity or an input to a BP or MF, the nodes are being flagged by the ShEx validator because these ncRNA gene identifiers are not recognized as valid annotation objects.
In the MGI gpi file, these genes are typed as ncRNA genes using SO:0001263 according to the GPI2.0 spec.
@hdrabkin
@ukemi
@kltm
@balhoff
Note: if I check one of these gene ids, e.g. MGI:2676885, in noctua-amigo, the graph view seems to show the correct parentage.
To be compliant with the ShEX specifications for 'happens during' and to allow for curation using various life stage ontologies, we want to make sure that we are importing external life stage ontologies, e.g. WBls, PO, etc. where needed.
For reference, see: geneontology/go-shapes#137
In the Makefile, it looks like rgd is included in the list of sources for annotatable entity identifiers.
However, not all identifiers in the RGD GAF seem to be available for annotation in Noctua.
For example, RGD:1309181 has eight annotations in the current RGD GAF, but isn't in the autocomplete menu as an option. The eight annotations are seven IEAs and one ISO.
Do all entries for a group's GAF get included in NEO or is there some filtering step somewhere that excludes some?
Note that RGD does not submit gpad/gpi yet according to the rgd.yaml file.
This repo has code for taking all GPIs, converting to obo/rdf/owl and publishing via https://build.berkeleybop.org/job/build-noctua-entity-ontology/
This should all be done as part of the main go annotation pipeline
I want to create an annotation to linc.terminator (ZFIN:ZDB-LINCRNAG-190911-1), but this gene is not available in Noctua. I am using the new Noctua form, but the issue is the same in the graph editor.
Due to recent upstream changes, NEO may now be getting a much larger number of GOA tRNA IEA annotation. This is related to the immediate handling of the GOA tRNA increase: geneontology/go-site#1185
It may be that NEO needs to pre-filter or we get a fix upstream sooner rather than later.
Tagging @cmungall
Sometime between Nov 21st and Nov 27th, a change occurred in NEO (or something it brings in) that prevents the build with:
12:15:44 Exception in thread "main" org.semanticweb.owlapi.model.OWLOntologyStorageException: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:Q06787-11 id( UniProtKB:Q06787-11)name( FMR1 Hsap)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProductIsoform)synonym( FMR1 RELATED)synonym( FMR1 BROAD)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)name( Fmr1 isoform 11 Rnor)synonym( Q06787-11 RELATED)relationship( in_taxon NCBITaxon:9606)relationship( in_taxon NCBITaxon:10116)relationship( has_gene_template UniProtKB:Q06787)is_a( RGD:2623)is_a( CHEBI:36080))
12:15:44 at org.semanticweb.owlapi.oboformat.OBOFormatRenderer.render(OBOFormatRenderer.java:90)
12:15:44 at org.semanticweb.owlapi.oboformat.OBOFormatStorer.storeOntology(OBOFormatStorer.java:42)
12:15:44 at org.semanticweb.owlapi.util.AbstractOWLStorer.storeOntology(AbstractOWLStorer.java:155)
12:15:44 at org.semanticweb.owlapi.util.AbstractOWLStorer.storeOntology(AbstractOWLStorer.java:119)
12:15:44 at uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.saveOntology(OWLOntologyManagerImpl.java:1525)
12:15:44 at uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.saveOntology(OWLOntologyManagerImpl.java:1502)
12:15:44 at owltools.io.ParserWrapper.saveOWL(ParserWrapper.java:289)
12:15:44 at owltools.io.ParserWrapper.saveOWL(ParserWrapper.java:209)
12:15:44 at owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:3712)
12:15:44 at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:76)
12:15:44 at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:68)
12:15:44 at owltools.cli.CommandLineInterface.main(CommandLineInterface.java:12)
12:15:44 Caused by: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:Q06787-11 id( UniProtKB:Q06787-11)name( FMR1 Hsap)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProductIsoform)synonym( FMR1 RELATED)synonym( FMR1 BROAD)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)name( Fmr1 isoform 11 Rnor)synonym( Q06787-11 RELATED)relationship( in_taxon NCBITaxon:9606)relationship( in_taxon NCBITaxon:10116)relationship( has_gene_template UniProtKB:Q06787)is_a( RGD:2623)is_a( CHEBI:36080))
12:15:44 at org.obolibrary.oboformat.model.Frame.checkMaxOneCardinality(Frame.java:424)
12:15:44 at org.obolibrary.oboformat.model.Frame.check(Frame.java:405)
12:15:44 at org.obolibrary.oboformat.model.OBODoc.check(OBODoc.java:390)
12:15:44 at org.obolibrary.oboformat.writer.OBOFormatWriter.write(OBOFormatWriter.java:183)
12:15:44 at org.semanticweb.owlapi.oboformat.OBOFormatRenderer.render(OBOFormatRenderer.java:88)
12:15:44 ... 11 more
12:15:44 Makefile:27: recipe for target 'neo.obo' failed
12:15:44 make: *** [neo.obo] Error 1
https://build.geneontology.org/job/geneontology/job/pipeline/job/issue-35-neo-test/97/console
Tagging @balhoff
Notice to @vanaukenk
From @pgaudet
Patrick pointed out that while the chains are available, their labels don’t show up – only the identifier. Can this be fixed?
From @balhoff :
This will be successful if the NEO ontology contains a term with label "nsp4 Scov2" and its IRI looks like http://identifiers.org/uniprot/P0DTD1-PRO_0000449622
instead of http://purl.obolibrary.org/obo/UniProtKB#_P0DTD1-PRO_0000449622
.
Also, when this is loaded in Minerva, this model should no longer have gene products without labels.
RNAC provides various downloads. The GFF seems most complete. However, this doesn't seem to include MOD mappings. Where do these come from.
When transforming gpi to neo, we normalize the labels. This is because in the past unusual non-ascii characters have slipped in messing up everything (need to report these upstream)
This is currently too strict, e.g. we strip /
, resulting in:
id: PR:000037785
name: mEPRSPhos1 Mmus
There should be a slash in the label
For now @ukemi, just type the string without the slash (sorry)
We're using, e.g. http://identifiers.org/wormbase/WBGene00000001, however this doesn't resolve. Biolink model is using https://identifiers.org/wb/WBGene00000001, which does resolve. Changing this would probably require a batch replacement within Noctua models.
0.5m entries, we are already at 0.75m
Messaging with @hdrabkin
It appears that MGI miRNA identifiers are not available in Noctua.
I've checked on noctua-amigo and can't find them there either.
Here's an example:
MGI MGI:3711324 Mir291a microRNA 291a mmu-mir-291a|Mirn291a gene taxon:10090
This will make it easier to link the UniProt data with the GO (A) data on RDF and OWL level.
Mostly, it will make it easier for us to introduce Noctea compatible modelling for UniProt->GO term Relations. With the benefit of users loading both data not getting duplicate triples just because we don't use the same IRIs.
@hdrabkin showed me a model this morning where the ShEx is not validating PRO identifiers as chemical entities, i.e. 'has input' PR:nnnnnnnnnnn, for a MF gives a ShEx validation.
How are PRO identifiers currently typed in neo?
From @hdrabkin
Newer PRO ids are still not getting into Noctua although they are in the GPI file.
Examples:
===
cc @lpalbou
after discussion with @thomaspd this is the way to go
Currently, the NEO ontology build no longer succeeds on errors like:
11:34:02 [Fatal Error] :1:1: Content is not allowed in prolog.
11:34:02 2021-05-24 11:34:02,183 ERROR (CommandRunner:4815) could not parse:target/neo-wb.obo
11:34:02 org.semanticweb.owlapi.io.UnparsableOntologyException: Problem parsing file:/var/lib/jenkins/workspace/ology_pipeline_issue-35-neo-test/neo/target/neo-wb.obo
For examination, I've grabbed temporarily grabbed that neo-wb.obo file and made it available here: http://skyhook.berkeleybop.org/neo-wb.obo
It seems like there may be a WormBase issues that is related (an expansion to the WB GPI that happened in the right timeframe), but I've been unable to find it again; in my notes I have "WormBase/website/issues/8222", but this doesn't seem to correspond to anything. @vanaukenk , would you maybe know the correct public reference for this?
Tagging @balhoff @vanaukenk
E.g. in the MGI GAF
MGI MGI:1917015 1500004F05Rik GO:0008150 MGI:MGI:2156816|GO_REF:0000015 ND P RIKEN cDNA 1500004F05 gene gene taxon:10090 20120430 MGI
MGI MGI:1923755 1500009C09Rik GO:0003674 MGI:MGI:2156816|GO_REF:0000015 ND F RIKEN cDNA 1500009C09 gene protein taxon:10090 20100209 MGI VEGA:OTTMUSP00000045521
What makes the 2nd one a protein and the 1st a gene?
The page on JAX is kind of odd
http://www.informatics.jax.org/marker/MGI:1923755
"Feature Type protein coding gene"
Yet it's an ortholog of a lincRNA
I guess that's what conflict means
Looks like if there is a conflict, that results in the field in GO being 'protein' rather than 'gene'. But this is weird as the conflict apparently arises as the fact this is ncRNA...?
Either way: don't trust the type field in the MGI GAF
In NEO, terms use OBO PURLs like http://purl.obolibrary.org/obo/MGI_MGI%3A1336172, but the autocomplete in Noctua produces terms like http://www.informatics.jax.org/accession/MGI:MGI:1336172 (which I assume comes from Golr). This is a problem for using NEO to get taxon metadata or to query models for instances of molecular entity
.
NEO now fails on:
12:05:04 gzip -dc mirror/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz | ./gpi2obo.pl -s Cele -n wb > target/neo-wb.obo.tmp && mv target/neo-wb.obo.tmp target/neo-wb.obo
12:05:04 make: *** No rule to make target 'target/neo-gramene_oryza.obo', needed by 'all_obo'. Stop.
[Pipeline] }
12:05:04 ERROR: script returned exit code 2
cc @hdietze
@cmungall I mentioned during the hackathon that some GPs have several recommended names (rdfs:label), which should not be the case (at least given the same language), since we have synonyms (oboInOwl:hasExact/BroadSynonym) for that.
Example from RGD (NEO metadata generated during GAF conversion):
SELECT * WHERE { <http://identifiers.org/rgd/1304707> rdfs:label ?label }
-> has Lrfn1 Rnor and Lrfn1
Example from MGI (NEO metadata generated using GPI):
SELECT * WHERE { <http://identifiers.org/mgi/MGI:3588192> rdfs:label ?label }
-> has 3 rdfs:label (Rtl4 Mmus, Rtl4, zcchc16 Mmus)
In the case of this MGI, the GPI file indicates Rtl4 for the name, and other things are synonyms:
MGI MGI:3588192 Rtl4 retrotransposon Gag like 4 C230031A03Rik|Mar4|Zcchc16 gene taxon:10090 UniProtKB:Q3URY0
Fixing that will ensure that we retrieve a single (and correct) recommended name for each GP as for the moment it's not certain.
SGD is currently only providing UniProt in the GPI in the metadata--taken directly from protein2go upstream as a "stub". At the time, SGD was not currently using Noctua beyond basic experimentation and it was decided that the stub was more information than none. Now that SGD is giving Noctua more use, the obvious identifier issue has come up and needs to be fixed as we proceed with more serious annotation.
As an example, in the Noctua Form, eg "STE3 Scer” pops up with the UniProt ID instead of SGD ID, so the “search database” doesn’t work.
The potential fix in this case: GO can derive a GPI from some other file.
Currently, NEO integrates metadata from Gene Product either through GPI files (when provided) or through GAF.
During the data ingestion stage, NEO should ensure that each Gene Product has a uniprot xref link for rapid access to additional meta data. This is especially useful for displaying tooltips on mouseover that can instantly fetch data from the uniprot REST API or SPARQL endpoint.
All of PRO IDS should be available in Noctua (ids for human, pombe, etc); can these be loaded from PRO directly (could (PRO to supply a GPI file).
For the ontology autocomplete fields in Noctua it would be good to just use is-a closure.
This would prevent potential confusion about what terms show up in the autocomplete menu and also possible annotation errors.
An example is what currently happens when typing in 'mechano' in the BP field of the form:
MF terms are also returned in this search due to the 'part of' relation between some MF and BP terms in the ontology.
See also:
geneontology/noctua-form#34
geneontology/noctua-form#19
Generally, we'd like to have some more objective measure of whether a build of NEO (and the autocomplete/total entity space) has what we think it has. This applies to both the ontology builds and the load into Solr.
As a starting point, we'd like to:
Ideally, this would slot into having some script to execute them against a product (owl? solr index?) in a pipeline so that failure can prevent publication
While not strictly NEO, we can start there and get a lot of work done. We can start with identifiers listed in #51 #52 #53.
Tagging @vanaukenk @goodb for feedback.
See also #9
Currently there is no way to autocomplete off of non-gene types in the enabled-by field in the annoton box in Noctua. This is because the field is (rightly) pinned to subclasses of CHEBI:23367 ! molecular entity
Currently neo does not import the connecting axioms from PRO:
/ BFO:0000040 ! material entity
is_a BFO:0000030 ! object
is_a CHEBI:23367 ! molecular entity
is_a CHEBI:50047 ! organic amino compound
is_a PR:000018263 ! amino acid chain
is_a PR:000000001 ! protein ***
The bridging axioms should be added to neo
until then, @ukemi should use the "Add individual" box (sorry)
Note: we currently use the ref gaf, which gives us the desired GCRP. We will need to merge the ref gaf plus RNAs from the complete GAF.
Didn't mean to open this without comment. Oops.
Anyway, I have 2 gafs that have the following for the first 3 columns of the gaf file that I'm trying to use neo to create a combined owl file to load in noctua:
GR_gene GR:0101186 Zep1
GR_gene GR:0101186 ZEP1
The only difference is the case. If I manually change one of them, make successfully completes.
The error I get on a failure:
Exception in thread "main" org.semanticweb.owlapi.model.OWLOntologyStorageException: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(http://purl.obolibrary.org/obo/GR_gene_GR%3A0101186 id( http://purl.obolibrary.org/obo/GR_gene_GR%3A0101186{}[])relationship( in_taxon NCBITaxon:4530{}[])name( ZEP1 Oryz{}[])synonym( ZEP1 BROAD{}[NCBITaxon:4530 ])synonym( Os04g0448900 Oryz EXACT{}[])synonym( micro RNA 806a Oryz EXACT{}[])synonym( Zeaxanthin epoxidase 1 Oryz EXACT{}[])synonym( Zep1 BROAD{}[NCBITaxon:4530 ])name( Zep1 Oryz{}[])is_a( CHEBI:23367{}[]))
at org.coode.owlapi.oboformat.OBOFormatRenderer.render(OBOFormatRenderer.java:79)
at org.coode.owlapi.oboformat.OBOFormatStorer.storeOntology(OBOFormatStorer.java:74)
at org.semanticweb.owlapi.util.AbstractOWLOntologyStorer.storeOntology(AbstractOWLOntologyStorer.java:211)
at uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.saveOntology(OWLOntologyManagerImpl.java:1040)
at uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.saveOntology(OWLOntologyManagerImpl.java:1021)
at owltools.io.ParserWrapper.saveOWL(ParserWrapper.java:265)
at owltools.io.ParserWrapper.saveOWL(ParserWrapper.java:213)
at owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:2922)
at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:76)
at owltools.cli.CommandRunner.run(CommandRunner.java:237)
at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:68)
at owltools.cli.CommandRunner.run(CommandRunner.java:237)
at owltools.cli.CommandLineInterface.main(CommandLineInterface.java:12)
Caused by: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(http://purl.obolibrary.org/obo/GR_gene_GR%3A0101186 id( http://purl.obolibrary.org/obo/GR_gene_GR%3A0101186{}[])relationship( in_taxon NCBITaxon:4530{}[])name( ZEP1 Oryz{}[])synonym( ZEP1 BROAD{}[NCBITaxon:4530 ])synonym( Os04g0448900 Oryz EXACT{}[])synonym( micro RNA 806a Oryz EXACT{}[])synonym( Zeaxanthin epoxidase 1 Oryz EXACT{}[])synonym( Zep1 BROAD{}[NCBITaxon:4530 ])name( Zep1 Oryz{}[])is_a( CHEBI:23367{}[]))
at org.obolibrary.oboformat.model.Frame.checkMaxOneCardinality(Frame.java:383)
at org.obolibrary.oboformat.model.Frame.check(Frame.java:357)
at org.obolibrary.oboformat.model.OBODoc.check(OBODoc.java:344)
at org.obolibrary.oboformat.writer.OBOFormatWriter.write(OBOFormatWriter.java:205)
at org.coode.owlapi.oboformat.OBOFormatRenderer.render(OBOFormatRenderer.java:76)
... 12 more
Makefile:28: recipe for target 'neo.obo' failed
Let me know if I need to supply any more info.
We need to ensure that CURIEs are synced between noctua entities and other projects
cc @vanaukenk
We recurated the uniprot goa GPI for sars-cov-2 for another project, we should use this one in Neo
https://github.com/Knowledge-Graph-Hub/kg-covid-19/blob/master/curated/ORFs/uniprot_sars-cov-2.gpi
Background:
geneontology/go-site#1431
Update with new XAO ontology available in go-lego.
http://purl.obolibrary.org/obo/go/extensions/go-lego.owl
seems to break tools, links and my brain ;)
As discussed in Berkeley October 2019, define a new neo build that contains the upper-level classes required to support inferences in Minerva.
@cmungall fill in details from board...
Failing - switchto docker?
The NEO build depends on the remote asset datasets.json, pushed from the now defunct build.berkeleybop.org.
Problematic:
datasets.json: trigger
wget http://s3.amazonaws.com/go-public/metadata/datasets.json -O $@ && touch $@
noticed by @rachhuntley
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.