pathwaycommons / grounding-search Goto Github PK
View Code? Open in Web Editor NEWA biological entity grounding search service
License: MIT License
A biological entity grounding search service
License: MIT License
Background: There's a few good reasons to re-examine indexing UniProt, but first there are some bugs to iron out...
Issue: Indexing fails because it is assumed that various name fields exist
Reproduce: npm run update:uniprot
It's too big and slow.
The datasource import tests are failing. See https://travis-ci.org/PathwayCommons/grounding-search/jobs/526196412
ID handling has been improved so any hardcoded Chebi IDs need to be updated in the tests. Chebi IDs that are formatted like 'CHEBI:123' should just use { namespace: 'chebi', id: '123' }
from now on. The _id
is now CHEBI:123
or NCBI:456
.
It looks like some other tests for the datasources are failing for reasons unknown to me. Maybe the tests are too strict / brittle? They shouldn't check for deep equality on the element JSON, for example.
For e. coli and yeast strains, merge all entities that have the exact same name.
For example, there are many entries for "CcdB". Each entry is the same basically except for the taxonomy ID. Here is a sample:
{
"namespace": "ncbi",
"type": "protein",
"id": "39521901",
"organism": "562",
"name": "ccdB",
"synonyms": [
"C7V14_00585",
"type II toxin-antitoxin system toxin CcdB"
],
"esScore": 12.311079,
"defaultOrganismIndex": 361,
"organismIndex": 361,
"combinedOrganismIndex": 361,
"distance": 0,
"nameDistance": 0,
"overallDistance": 36100000
},
{
"namespace": "ncbi",
"type": "protein",
"id": "39524440",
"organism": "562",
"name": "ccdB",
"synonyms": [
"EJC48_00625",
"type II toxin-antitoxin system toxin CcdB"
],
"esScore": 12.311079,
"defaultOrganismIndex": 361,
"organismIndex": 361,
"combinedOrganismIndex": 361,
"distance": 0,
"nameDistance": 0,
"overallDistance": 36100000
},
{
"namespace": "ncbi",
"type": "protein",
"id": "39529410",
"organism": "562",
"name": "ccdB",
"synonyms": [
"U14A_A00031",
"type II toxin-antitoxin system toxin CcdB"
],
"esScore": 12.311079,
"defaultOrganismIndex": 361,
"organismIndex": 361,
"combinedOrganismIndex": 361,
"distance": 0,
"nameDistance": 0,
"overallDistance": 36100000
},
{
"namespace": "ncbi",
"type": "protein",
"id": "8877686",
"organism": "573",
"name": "ccdB",
"synonyms": [
"CcdB toxin protein"
],
"esScore": 12.311079,
"defaultOrganismIndex": 361,
"organismIndex": 361,
"combinedOrganismIndex": 361,
"distance": 0,
"nameDistance": 0,
"overallDistance": 36100000
},
{
"namespace": "ncbi",
"type": "protein",
"id": "39650970",
"organism": "621",
"name": "ccdB",
"synonyms": [
"type II toxin-antitoxin system toxin CcdB"
],
"esScore": 12.311079,
"defaultOrganismIndex": 361,
"organismIndex": 361,
"combinedOrganismIndex": 361,
"distance": 0,
"nameDistance": 0,
"overallDistance": 36100000
},
{
"namespace": "ncbi",
"type": "protein",
"id": "39651896",
"organism": "622",
"name": "ccdB",
"synonyms": [
"type II toxin-antitoxin system toxin CcdB"
],
"esScore": 12.311079,
"defaultOrganismIndex": 361,
"organismIndex": 361,
"combinedOrganismIndex": 361,
"distance": 0,
"nameDistance": 0,
"overallDistance": 36100000
},
{
"namespace": "ncbi",
"type": "protein",
"id": "9538168",
"organism": "562",
"name": "ccdB",
"synonyms": [
"plasmid maintenance protein",
"toxin component",
"plasmid maintenance protein; toxin component"
],
"esScore": 12.311079,
"defaultOrganismIndex": 361,
"organismIndex": 361,
"combinedOrganismIndex": 361,
"distance": 0,
"nameDistance": 0,
"overallDistance": 36100000
}
For each entry from NCBI that is associated with a strain taxon ID (e.g. 562):
entry.organisms
, avoiding duplicates.entry.ids
, avoiding duplicates.organism
field normalised.Update the ranking algorithm: The ranking w.r.t. organismOrdering
should consider the best match of entry.organism
and entry.organisms
.
Use the namespace registered with MIRIAM for each of the collections:
Name | Namespace |
---|---|
NCBI Gene | ncbigene |
NCBI Protein | ncbiprotein |
UniProt Knowledgebase | uniprot |
ChEBI | CHEBI |
This will reduce confusion and increase the 'linkability' of data across projects.
If a quality test case fails, that info could inform improvements, especially if there is a pattern. My suggestion of how to report:
I've re-indexed the updated master branch and noticed that most of the tests for NCBI top genes for E. coli (taxon id: 83333) are failing for 'search' and/or 'get'.
For instance, null
is returned from POST to /get
the information for E. coli "ssb" in NCBI:
{
"id": "1263584",
"namespace": "ncbi"
}
Possibly related? When I try to /get
E. coli "Ccdb", something comes back but the organisms value are a mix of types:
POST body:
{
"id": "1263593",
"namespace": "ncbi"
}
response:
{
"namespace": "ncbi",
"type": "protein",
"id": "1263593",
"organism": "83333",
"organismName": "Escherichia coli",
"name": "ccdB(letD)",
"synonyms": [
"Fpla045",
"hypothetical protein"
],
"organisms": [
562,
"83333"
],
"ids": [
"1263593"
]
}
To append to patches
namespace | name | id | synonyms |
---|---|---|---|
ncbi | NSP3 | 1802476807 | PLpro |
ncbi | NSP5 | 1802476809 | 3CLpro |
They're a bit out of date as compared to config.js
search()
and get()
will not need an index to differentiate datasources, but it will need a namespace
filter.clear()
and update()
need to be aware of other data. clear()
should only remove the data for the particular namespace rather than the whole index.uniprot.update()
, we should probably be inserting to Elasticsearch each entry, or entries in small batches, so that we don't maintain a large entries
array._score
result for the organism tie-breaking logic. Because the result size is relatively small (20-50, say), it's probably fine to do this outside of Elasticsearch. I don't see a straightforward way to incorporate our organism tie-breaking into an Elasticsearch query.GitHub actions provides for (semi)-automated workflows to accomplish tasks following events (e.g. tagging or pushing). We can leverage this to (semi)-automate our software deployments, namely through Docker images.
The following attempts to summarize the different cases and features we'd like:
Instance | Event | Git Reference | Jobs |
---|---|---|---|
development | push (i.e. merge PR) | master branch | npm: lint; Docker: build, push (Docker Hub), refresh host |
production | workflow_dispatch (i.e. manual) | tag | npm: lint; Docker: build, push (Docker Hub), refresh host |
See:
There is a license file but none declared.
Bump to 0.4.0
0.2.1
The test cases need to be reviewed. There are 898 test cases that evaluate the quality of the search results. @jvwong has done a great job on the first pass of this data, but it's a lot of data and it's easy to make small mistakes.
So, we should have each of the test cases reviewed to make sure that we have a solid basis for moving forward with improving the quality of the search. So let's divvy it up:
Steps:
test/util/data/molecular-cell.json
.organismOrdering
is specified for the paper. If the paper is about one organism, say human, it should be "organismOrdering": [9606]
. If the paper is about human mostly but also a little about mouse, then it should be "organismOrdering": [9606, 10090]
.Currently, we prefer to list uncharged entities rather than charged one. This preference should be reversed for elements that are almost always ions in a user's paper:
Background: The xml-parser provides an omitList
that intends to skip certain tags by name
Issue: The parser considers the omitList
upon encountering an open tag, but not the closing tag. This leads to skipping large sections of the tree.
Prerequisite for #88
This was something that was supposed to be done in context of #38, but I forget to implement that last step in the related PR.
Update the ranking algorithm: The ranking w.r.t. organismOrdering should consider the best match of entry.organism and entry.organisms.
e.g. https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=562
Some observations lately:
/search
with { "q": "ssb" }
We have support for the organismOrdering
preference, which allows the user to specify his preference w.r.t. close matches (currently only ties). It would be good to also allow for a typeOrdering
preference, which would work in a similar way: A tie can be broken by the specified ordering of entity types.
Consider two matches, both with the name "X":
They both match the search string "X" perfectly, so we can use the typeOrdering: ['protein', 'chemical']
preference to break the tie by putting the protein first. Text mining systems like REACH are good at providing type hints like this, but they aren't good at providing the correct grounding. So we can use the information provided by REACH for typeOrdering
on a per-entity basis, similar to how we use the information provided by REACH for organismOrdering
on a per-paper basis.
There are several entities that end in "ate" that a user would type when he really means the corresponding acid. For example, a user might type "lactate" to mean "lactic acid". Or he might type "citrate" to mean "citric acid".
Since this seems to be a one-way issue, we should allow for the transformation only in the ate-to-acid direction. The acid-to-ate direction should not be allowed.
This should probably be something like type: 'ggp'
for "gene or gene product".
A similar change would have to be made in factoid.
Goal is to capture additional NCBI Gene metadata fields that can be useful for downstream data consumers including data export (BioPAX) and software (Factoid).
In particular, the NCBI gene_info
file contains
dbXrefs
: ID-mappings to external organism databases (e.g. HGNC - human, Araport - Arabadopsis, FLYBASE - Drosophila)type_of_gene
: This is a description of the gene's products (e.g. tRNA, ncRNA, protein-coding etc.)Example: The non-coding RNA gene
tax_id | GeneID | Symbol | dbXrefs | type_of_gene |
---|---|---|---|---|
9606 | 102800311 | TP53COR1 | MIM:616343|HGNC:HGNC:43652 | ncRNA |
Docs:
https://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.mod.dtd
<!ELEMENT Entrezgene_type (%INTEGER;)>
<!ATTLIST Entrezgene_type value (
unknown |
tRNA |
rRNA |
snRNA |
scRNA |
snoRNA |
protein-coding |
pseudo |
transposon |
miscRNA |
ncRNA |
biological-region |
other
) #IMPLIED >
Lock down the latest version as release and trigger the DockerHub hook.
The requirement is to provide grounding support for Severe acute respiratory syndrome coronavirus 2 (NCBI:txid2697049).
In this case, it is much more valuable to include the mature protein products, rather than the genes/open reading frames (ORFs).
NCBI LINK: https://www.ncbi.nlm.nih.gov/datasets/coronavirus/proteins/
Factoid
Notes:
* what the root organism should be, * what the display name of the organism family should be, * what filters may be needed, and * edge cases (e.g. does "S" work well when the organism is indexed, even though it's only one character).
The chebi.owl file contains a 'definition' property that essentially provides a short, human-readable description of the entry. The suggestion is to include this.
Example for 2'-3'-cGAMP:
...
<obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">A cyclic purine dinucleotide that consists of AMP and GMP units cyclised via 3',5'- and 2',5'-linkages respectively.</obo:IAO_0000115>
Refs @cannin feedback on chemical information.
There's a list somewhere that has the top N protein/gene names by popularity. We could use that list in our tests. Having good coverage for popular proteins is important.
@jvwong Do you remember where we can find this list?
The docs should be shown on the main index.html on the server.
ftp://ftp.ebi.ac.uk/pub/databases/Pfam/
Once this is ready to be integrated with Factoid: It would be great to have this set up on Dockerhub.
We are experiencing problems with the grounding search retrieving fresh data files from source (UniProt, NCBI, ChEBI).
Two options:
Refs:
I want to update the aggregate-quality.js
code to make reporting a little easier.
In particular, I want to get rid of the 'loose' values which tend to bury useful information (i.e. rank). Rather, all of this can be assessed at the reporting stage by outputting things like rank (somewhere).
Also need some sort of custom reporter, since most times want report in csv form for a spreadsheet.
Issues related to Biofactoid support:
Need to support genes from many viruses
Most of what is known relates to other, previously studied coronaviruses (MERS, SARS-Cov)and often other related and un-related viruses, which may or may not extrapolate
Currently, there are two sets of organism data: One exists in this project, the other in the factoid repo. Because this grounding service does not return organism names, factoid has to do its own lookup. Organisms not in the list of top model organisms are displayed as "Other".
To solve this, we should add a feature to have all the organism information stored only in this project. All organism information that factoid requires should be provided by the grounding service.
This feature is necessary for the following feature to work well: Merge entities that are associated with strains #38
NCBI has a taxon db we can use: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
We should import this data when we do an index
. The organisms should be stored separately from the entity entries. Each organism entry should be like the following example:
{
"id": "4932",
"name": "Saccharomyces cervisiae",
"descendantIds": [ '1337652', '1158204', '765312' /*, ... */ ]
}
The following operations should be added in the db:
getOrganismById(idToMatch)
: resolves a promise to the organism whose id
or descendantIds
matches idToMatch
The following operations should be added in the indexing procedure:
getOrganismById()
should be used to put the organismName
field in the entry
from organism.name
.Next steps:
Grounding service updates:
1. Reenable Uniprot within the grounding service. [JVW] 2. Add an option to the aggregate search (i.e. `/search`) s.t. namespaces can be blocked (i.e. uniprot namespaces aren't returned to Biofactoid UI). This can accept an option to set the block list manually, but we will have a default value s.t. uniprot is blocked by default. [MF] 3. In the uniprot importer, the xrefs need to be included in the returned grounding json. [JVW]
Convertor pipeline: [MCS]
1. Within the PC import, instead of doing a `/search` (today), do a `/get` (e.g. `/get` for `uniprot:P01234`). 2. Now, you have a uniprot grounding. So, within it, get the NCBI xref. 3. Query `/get` with the NCBI xref. 4. Use the NCBI grounding for the factoid entity in the `association`.
Once done: Run the import on the master.factoid.baderlab.org instance. Merge the data into
path.join()
to create paths. Do not concat strings yourself.cross-env
to set environment variables in npm scripts.
NODE_ENV=test
. Make the ci
script set env var for the local uniprot test file. Maybe rename the ci
script to travis
if you want.uniprot.get(id)
Once the above are addressed, I think we'll be ready to move on to other uniprot functions and the aggregate service.
Error encountered in update from ChEBI when applied to download (ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz).
Reproduce:
/input/chebi.owl
)npm run update:chebi
Console out:
info: Applying update on source 'chebi'...
info: Downloading ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz to chebi.owl
info: Processing chebi data from input/chebi.owl
TypeError: Cannot read property 'replace' of undefined
at processEntry (.../grounding-search/src/server/datasource/chebi.js:64:43)
at Array.map (<anonymous>)
at .../grounding-search/src/server/datasource/processing.js:11:34
...
Other:
Our current merging strategy pulls all the relevant entries into memory at once. We could improve our memory usage by using a chunking strategy similar to our update operations:
Basically, we always do our queries for finding entries that should be merged into another entry by reading a single page (w.r.t. query pagination). So, we only query for results 0..N for a chunk size of N.
As we process each chunk, we delete (or mark) the entries that we no longer want to come up in the next chunk. This includes deleting descendants in the case of descendant-to-root merging. It also includes marking alternative root entries in the case of root-to-alternative-root merging.
I've summarised the process in the following page. It includes pseudocode and an example graph.
@metincansiper Let me know whether this makes sense or whether I've missed any details. Thanks
TGF-beta receptor
Factoid uses NCBI Gene uids for the purposes of grounding genes. There's a few reasons why it would be nice to have a helper to map to NCBI Gene to/from UniProt:
Design considerations
Proposed API
Request
{
"db": "uniprot",
"dbfrom": "ncbigene",
"id": [
"9158"
]
}
Response
[{
"dbfrom": "ncbigene",
"id": "9158",
"dbXrefs": [
{
"db": "uniprot",
"id": "Q99988"
}
]
}]
or even better:
[{
"dbfrom": "ncbigene",
"id": "9158",
"dbXrefs": [
{
"namespace": "uniprot",
"type": "protein",
"dbName": "UniProt Knowledgebase",
"dbPrefix": "uniprot",
"id": "Q99988",
"organism": "9606",
"name": "GDF-15",
"geneNames": [
"GDF15",
"MIC1",
"PDF",
"PLAB",
"PTGFB"
],
"proteinNames": [
"GDF-15",
"Growth/differentiation factor 15",
"MIC-1",
"NAG-1",
"NRG-1",
"Placental TGF-beta",
"Placental bone morphogenetic protein",
"Prostate differentiation factor"
],
"synonyms": [
"Growth/differentiation factor 15",
"MIC-1",
"NAG-1",
"NRG-1",
"Placental TGF-beta",
"Placental bone morphogenetic protein",
"Prostate differentiation factor",
"GDF15",
"MIC1",
"PDF",
"PLAB",
"PTGFB"
],
"dbXrefs": [...]
}
]
}]
Built and executed as in the readme, not populated with any data, I get the following error when connecting to the root (localhost:3000
):
> [email protected] start /usr/src/app
> node ./src/server
info: GET / 302 10.926 ms - 62
events.js:183
throw er; // Unhandled 'error' event
^
Error: EACCES: permission denied, open 'out.log'
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] start: `node ./src/server`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
npm ERR! A complete log of this run can be found in:
npm ERR! /home/node/.npm/_logs/2019-04-08T16_52_50_021Z-debug.log
This most likely the fault: https://github.com/PathwayCommons/grounding-search/blob/master/src/server/logger.js#L7
Background: When data sources are indexed, they mail fail either partially or entirely. For instance, the source file may not be downloaded. In this case, the build continues to the next source regardless of the integrity of the data in the index.
Goal: It would be helpful to signal that a failure has occurred so that further action can be performed to correct the issue (i.e. rerun or inspect the problem)
Details: The issue can occur with build locally, but more importantly, Docker images, which 'fail silently'.
A miscellaneous list of entities that it would be nice to test and possibly boost manually.
A lot of this is motivated by the fact that ChEBI does in fact contain a bunch of useful 'generic' chemicals (e.g. messenger RNA) that come up in articles as participants. I wasn't aware of this earlier and may have dismissed articles for capture based on this presumption.
text | dbPrefix | dbId | comments |
---|---|---|---|
cap structure | ChEBI | CHEBI:10596 | 7-methylguanylate cap |
mRNA | ChEBI | CHEBI:33699 | messenger RNA |
18S rRNA | NCBI Gene | 100008588 | RNA, (18S) ribosomal |
Example for TP53 (human): 7157
In Uniprot XML: <dbReference type="GeneID" id="7157"/>
In NCBI tab-delimited lines (second field): 9606 7157 TP53 - BCC7|BMFS5|LFS1|P53|TRP53 MIM:191170|HGNC:HGNC:11998|Ensembl:ENSG00000141510 17 17p13.1 tumor protein p53 protein-coding TP53 tumor protein p53 O cellular tumor antigen p53|antigen NY-CO-13|mutant tumor protein 53|p53 tumor suppressor|phosphoprotein p53|transformation-related protein 53|tumor protein 53|tumor supressor p53 20190330 -
NCBI tab header:
#tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location description type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclature_authority Nomenclature_status Other_designations Modification_date Feature_type
Reason: chebi.owl was empty. (Sep 9 2020)
For quick-and-dirty revisions to the database (e.g. add a missing synonym), use a patch file that specifies a diff to be applied to the data.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.