The grounding-search's discuss from pathwaycommons

UniProt indexing: Account for optional names

Background: There's a few good reasons to re-examine indexing UniProt, but first there are some bugs to iron out...

Issue: Indexing fails because it is assumed that various name fields exist

Reproduce: npm run update:uniprot

See http://www.uniprot.org/docs/uniprot.xsd

Reduce the result size from ES

It's too big and slow.

Datasource tests failing

@metincansiper

The datasource import tests are failing. See https://travis-ci.org/PathwayCommons/grounding-search/jobs/526196412

ID handling has been improved so any hardcoded Chebi IDs need to be updated in the tests. Chebi IDs that are formatted like 'CHEBI:123' should just use { namespace: 'chebi', id: '123' } from now on. The _id is now CHEBI:123 or NCBI:456.

It looks like some other tests for the datasources are failing for reasons unknown to me. Maybe the tests are too strict / brittle? They shouldn't check for deep equality on the element JSON, for example.

Merge entities that are associated with strains

For e. coli and yeast strains, merge all entities that have the exact same name.

For example, there are many entries for "CcdB". Each entry is the same basically except for the taxonomy ID. Here is a sample:

{
    "namespace": "ncbi",
    "type": "protein",
    "id": "39521901",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "C7V14_00585",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39524440",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "EJC48_00625",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39529410",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "U14A_A00031",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "8877686",
    "organism": "573",
    "name": "ccdB",
    "synonyms": [
      "CcdB toxin protein"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39650970",
    "organism": "621",
    "name": "ccdB",
    "synonyms": [
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39651896",
    "organism": "622",
    "name": "ccdB",
    "synonyms": [
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "9538168",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "plasmid maintenance protein",
      "toxin component",
      "plasmid maintenance protein; toxin component"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  }

For each entry from NCBI that is associated with a strain taxon ID (e.g. 562):

Check if there is an ancestor entry for the top-level taxon ID (e.g. 83333) with the same name (e.g. ccdB).
If an entry was found, then merge the descendant entry with the ancestor entry.
1. Add the synonyms from the descendant entry into the ancestor entry, avoiding duplicates.
2. Add the taxon IDs for the ancestor entry and the descendant entry into entry.organisms, avoiding duplicates.
3. Add the grounding ID for the ancestor and the descendant= to entry.ids, avoiding duplicates.
If there is no ancestor entry, then replace the descendant entry taxon ID with the ancestor taxon ID to make the organism field normalised.

Update the ranking algorithm: The ranking w.r.t. organismOrdering should consider the best match of entry.organism and entry.organisms.

Data collection: Use MIRIAM-compliant namespace

Use the namespace registered with MIRIAM for each of the collections:

Name	Namespace
NCBI Gene	`ncbigene`
NCBI Protein	`ncbiprotein`
UniProt Knowledgebase	`uniprot`
ChEBI	`CHEBI`

This will reduce confusion and increase the 'linkability' of data across projects.

Refs PathwayCommons/factoid#798

Fill in aggregate APIs from Factoid

aggregate
- search()
  - Sorting (by quality of match, tie breaking by organism scores)
- get()
- Ref : https://github.com/PathwayCommons/factoid/blob/master/src/server/routes/api/element-association/aggregate.js

Quality test failures

If a quality test case fails, that info could inform improvements, especially if there is a pattern. My suggestion of how to report:

State data:
- text (e.g. 'phospho-p53'), namespace ('ncbi'), id ('7157')
Provide a reason (not necessarily mutually exclusive):
- text is a variation on synonyms
- text has no match in synonyms
- text is a common term
- result is among the top results
- organisms ordering
- not in database
- other: explain

Tests: Top NCBI genes for E. coli (taxon id: 83333) are failing

I've re-indexed the updated master branch and noticed that most of the tests for NCBI top genes for E. coli (taxon id: 83333) are failing for 'search' and/or 'get'.

For instance, null is returned from POST to /get the information for E. coli "ssb" in NCBI:

{
	"id": "1263584",
	"namespace": "ncbi"
}

Possibly related? When I try to /get E. coli "Ccdb", something comes back but the organisms value are a mix of types:

POST body:

{
	"id": "1263593",
	"namespace": "ncbi"
}

response:

{
  "namespace": "ncbi",
  "type": "protein",
  "id": "1263593",
  "organism": "83333",
  "organismName": "Escherichia coli",
  "name": "ccdB(letD)",
  "synonyms": [
    "Fpla045",
    "hypothetical protein"
  ],
  "organisms": [
    562,
    "83333"
  ],
  "ids": [
    "1263593"
  ]
}

More SARS-CoV-2 synonyms

To append to patches

namespace	name	id	synonyms
ncbi	NSP3	1802476807	PLpro
ncbi	NSP5	1802476809	3CLpro

Update env vars in readme

They're a bit out of date as compared to config.js

Review 2

Workflow: Redeploy Docker image to development & production instances

GitHub actions provides for (semi)-automated workflows to accomplish tasks following events (e.g. tagging or pushing). We can leverage this to (semi)-automate our software deployments, namely through Docker images.

The following attempts to summarize the different cases and features we'd like:

Instance	Event	Git Reference	Jobs
development	push (i.e. merge PR)	master branch	npm: lint; Docker: build, push (Docker Hub), refresh host
production	workflow_dispatch (i.e. manual)	tag	npm: lint; Docker: build, push (Docker Hub), refresh host

See:

Declare license

There is a license file but none declared.

Bump to v0.4.0

Bump to 0.4.0

Tag minor release

0.2.1

Change `require()` to ES `import`

Search quality test cases

The test cases need to be reviewed. There are 898 test cases that evaluate the quality of the search results. @jvwong has done a great job on the first pass of this data, but it's a lot of data and it's easy to make small mistakes.

So, we should have each of the test cases reviewed to make sure that we have a solid basis for moving forward with improving the quality of the search. So let's divvy it up:

Steps:

Make a separate branch or fork for your work.
In your branch or fork, make commits to revise erroneous expected groundings in test/util/data/molecular-cell.json.
Make sure the organismOrdering is specified for the paper. If the paper is about one organism, say human, it should be "organismOrdering": [9606]. If the paper is about human mostly but also a little about mouse, then it should be "organismOrdering": [9606, 10090].
Make a PR when you're done.

Common elements should be ions by default

Currently, we prefer to list uncharged entities rather than charged one. This preference should be reversed for elements that are almost always ions in a user's paper:

H 1+
Na 1+
K 1+
Mg 2+
Ca 2+
Fe 3+ then 2+
Cu 2+
Zn 2+
O 2-
S 2-
P 3-
Cl 1-

xml-parser: omitList considers only open tag and not closing

Background: The xml-parser provides an omitList that intends to skip certain tags by name

Issue: The parser considers the omitList upon encountering an open tag, but not the closing tag. This leads to skipping large sections of the tree.

Prerequisite for #88

Add support for model organisms other than human

Update ranking algorithm for strains

This was something that was supposed to be done in context of #38, but I forget to implement that last step in the related PR.

Update the ranking algorithm: The ranking w.r.t. organismOrdering should consider the best match of entry.organism and entry.organisms.

Are we capturing all of the e coli strains?

e.g. https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=562

Some observations lately:

All the top NCBI genes for E coli (83333) are failing because the ids seem to have been deprecated
I would have expected all genes with same name from E coli to have one record, but this doesn't seem to be the case e.g.

POST /search with { "q": "ssb" }
- ssb for Escherichia coli K12 (taxonid=511145) - not found
- ssb in Escherichia coli (taxonid=562) - not found
- ssb in Escherichia coli O157:H7 str. Sakai (taxonid=386585) - rank 4

`typeOrdering ` preference

We have support for the organismOrdering preference, which allows the user to specify his preference w.r.t. close matches (currently only ties). It would be good to also allow for a typeOrdering preference, which would work in a similar way: A tie can be broken by the specified ordering of entity types.

Consider two matches, both with the name "X":

X (protein)
X (chemical)

They both match the search string "X" perfectly, so we can use the typeOrdering: ['protein', 'chemical'] preference to break the tie by putting the protein first. Text mining systems like REACH are good at providing type hints like this, but they aren't good at providing the correct grounding. So we can use the information provided by REACH for typeOrdering on a per-entity basis, similar to how we use the information provided by REACH for organismOrdering on a per-paper basis.

Allow entities with "-ate" suffix to match to "-ic acid"

There are several entities that end in "ate" that a user would type when he really means the corresponding acid. For example, a user might type "lactate" to mean "lactic acid". Or he might type "citrate" to mean "citric acid".

Since this seems to be a one-way issue, we should allow for the transformation only in the ate-to-acid direction. The acid-to-ate direction should not be allowed.

Rename `type:'protein'`

This should probably be something like type: 'ggp' for "gene or gene product".

A similar change would have to be made in factoid.

NCBI: Capture additional metadata fields

Goal is to capture additional NCBI Gene metadata fields that can be useful for downstream data consumers including data export (BioPAX) and software (Factoid).

In particular, the NCBI gene_info file contains

dbXrefs: ID-mappings to external organism databases (e.g. HGNC - human, Araport - Arabadopsis, FLYBASE - Drosophila)
type_of_gene: This is a description of the gene's products (e.g. tRNA, ncRNA, protein-coding etc.)

Example: The non-coding RNA gene

tax_id	GeneID	Symbol	dbXrefs	type_of_gene
9606	102800311	TP53COR1	MIM:616343\|HGNC:HGNC:43652	ncRNA

Docs:

https://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.mod.dtd

<!ELEMENT Entrezgene_type (%INTEGER;)>
<!ATTLIST Entrezgene_type value (
        unknown |
        tRNA |
        rRNA |
        snRNA |
        scRNA |
        snoRNA |
        protein-coding |
        pseudo |
        transposon |
        miscRNA |
        ncRNA |
        biological-region |
        other
        ) #IMPLIED >

Create release.

Lock down the latest version as release and trigger the DockerHub hook.

SARS-CoV-2 Support

The requirement is to provide grounding support for Severe acute respiratory syndrome coronavirus 2 (NCBI:txid2697049).

In this case, it is much more valuable to include the mature protein products, rather than the genes/open reading frames (ORFs).

Details

NCBI LINK: https://www.ncbi.nlm.nih.gov/datasets/coronavirus/proteins/
Factoid
- Probably need compensatory updates to the related papers system

Notes:

* what the root organism should be,

* what the display name of the organism family should be,

* what filters may be needed, and

* edge cases (e.g. does "S" work well when the organism is indexed, even though it's only one character).

Refs PathwayCommons/factoid#699

ChEBI: Include 'definition' field

The chebi.owl file contains a 'definition' property that essentially provides a short, human-readable description of the entry. The suggestion is to include this.

Example for 2'-3'-cGAMP:

...
<obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">A cyclic purine dinucleotide that consists of AMP and GMP units cyclised via 3&apos;,5&apos;- and 2&apos;,5&apos;-linkages respectively.</obo:IAO_0000115>

Refs @cannin feedback on chemical information.

Add more protein/gene test cases from a popularity list

There's a list somewhere that has the top N protein/gene names by popularity. We could use that list in our tests. Having good coverage for popular proteins is important.

@jvwong Do you remember where we can find this list?

Swagger docs

The docs should be shown on the main index.html on the server.

Add support for PFAM (protein families)

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/

Set up on Dockerhub

Once this is ready to be integrated with Factoid: It would be great to have this set up on Dockerhub.

One image with the data already included (snapshot by date)
One image without the data but configured to download the data during image init

Indexing source: Stash backup data

We are experiencing problems with the grounding search retrieving fresh data files from source (UniProt, NCBI, ChEBI).

Two options:

Stash source files
Stash ElasticSearch index data, (docker volume).

Refs:

Zenodo: https://zenodo.org/
https://github.com/elasticsearch-dump/elasticsearch-dump
Old commit with scripts for stashing/restoring volumes: https://github.com/PathwayCommons/factoid/tree/26c629ba5c3ffa7ef59a44344c9f402ff4ba36cf/docker

Quality tests: Remove 'loose'; make output reporting easier

I want to update the aggregate-quality.js code to make reporting a little easier.

In particular, I want to get rid of the 'loose' values which tend to bury useful information (i.e. rank). Rather, all of this can be assessed at the reporting stage by outputting things like rank (somewhere).

Also need some sort of custom reporter, since most times want report in csv form for a spreadsheet.

SARS-CoV support

Issues related to Biofactoid support:
Need to support genes from many viruses
Most of what is known relates to other, previously studied coronaviruses (MERS, SARS-Cov)and often other related and un-related viruses, which may or may not extrapolate

Refs PathwayCommons/factoid#699 (comment)

Use NCBI taxonomy database for `entry.organismName` property

Currently, there are two sets of organism data: One exists in this project, the other in the factoid repo. Because this grounding service does not return organism names, factoid has to do its own lookup. Organisms not in the list of top model organisms are displayed as "Other".

To solve this, we should add a feature to have all the organism information stored only in this project. All organism information that factoid requires should be provided by the grounding service.

This feature is necessary for the following feature to work well: Merge entities that are associated with strains #38

NCBI has a taxon db we can use: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

We should import this data when we do an index. The organisms should be stored separately from the entity entries. Each organism entry should be like the following example:

{
  "id": "4932",
  "name": "Saccharomyces cervisiae",
  "descendantIds": [ '1337652', '1158204', '765312' /*, ... */ ]
}

The following operations should be added in the db:

getOrganismById(idToMatch) : resolves a promise to the organism whose id or descendantIds matches idToMatch

The following operations should be added in the indexing procedure:

When adding an entity entry into the index, getOrganismById() should be used to put the organismName field in the entry from organism.name.

Index UniProt: Xrefs; namepaces

Next steps:

Grounding service updates:

1. Reenable Uniprot within the grounding service. [JVW]

2. Add an option to the aggregate search (i.e. `/search`) s.t. namespaces can be blocked (i.e. uniprot namespaces aren't returned to Biofactoid UI).  This can accept an option to set the block list manually, but we will have a default value s.t. uniprot is blocked by default. [MF]

3. In the uniprot importer, the xrefs need to be included in the returned grounding json.  [JVW]

Convertor pipeline: [MCS]

1. Within the PC import, instead of doing a `/search` (today), do a `/get` (e.g. `/get` for `uniprot:P01234`).

2. Now, you have a uniprot grounding.  So, within it, get the NCBI xref.

3. Query `/get` with the NCBI xref.

4. Use the NCBI grounding for the factoid entity in the `association`.

Once done: Run the import on the master.factoid.baderlab.org instance. Merge the data into

Refs: PathwayCommons/factoid#881 (comment)

Add support for Interpro (protein families)

Review

@metincansiper

Once the above are addressed, I think we'll be ready to move on to other uniprot functions and the aggregate service.

Update fail: New ChEBI source files

Error encountered in update from ChEBI when applied to download (ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz).

Reproduce:

delete chebi data file (/input/chebi.owl)
npm run update:chebi

Console out:

info:    Applying update on source 'chebi'...
info:    Downloading ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz to chebi.owl
info:    Processing chebi data from input/chebi.owl
TypeError: Cannot read property 'replace' of undefined
    at processEntry (.../grounding-search/src/server/datasource/chebi.js:64:43)
    at Array.map (<anonymous>)
    at .../grounding-search/src/server/datasource/processing.js:11:34    
  ...

Other:

The fail above left the index in a semi-complete state: some chemicals but nothing from ncbi

Chunked merging

Our current merging strategy pulls all the relevant entries into memory at once. We could improve our memory usage by using a chunking strategy similar to our update operations:

Basically, we always do our queries for finding entries that should be merged into another entry by reading a single page (w.r.t. query pagination). So, we only query for results 0..N for a chunk size of N.

As we process each chunk, we delete (or mark) the entries that we no longer want to come up in the next chunk. This includes deleting descendants in the case of descendant-to-root merging. It also includes marking alternative root entries in the case of root-to-alternative-root merging.

I've summarised the process in the following page. It includes pseudocode and an example graph.

@metincansiper Let me know whether this makes sense or whether I've missed any details. Thanks

Real users: Misc observations

Why is the correct grounding for query not returned in the search results?
- Query term: TGF-beta receptor
- Observed: https://identifiers.org/ncbigene:30739
- Expected: https://www.ncbi.nlm.nih.gov/gene/7048; not present in returned list

Mapping betwee UniProt and NCBI Gene namespaces

Factoid uses NCBI Gene uids for the purposes of grounding genes. There's a few reasons why it would be nice to have a helper to map to NCBI Gene to/from UniProt:

Importing third-party data into Factoid: Sources like PhosphoSitePlus tag their entities with UniProt
INDRA queries: We use HGNC ID; For non-human, we end up using the node label (string). This leads to very messy, sometimes irrelevant results since it will tend to return human hits unless the label is not-used in humans
REACH typically uses UniProt IDs for grounding
Factoid model-enhancements: UniProt has a lot of desirable information that could be reused in the editor to allow authors to add detail in a constrained but fruitful manner including Sequence variants (aka mutations) and Modification features (aka sites)

Design considerations
- UniProt has a web service
- Roll our own
  - Use UniProt provided text mapping files
- Try this using existing data with some clever ElasticSearch?
Proposed API

Request

 {
    "db": "uniprot",
    "dbfrom": "ncbigene",
    "id": [
      "9158"
    ]
}

Response

[{
    "dbfrom": "ncbigene",
    "id": "9158",
    "dbXrefs": [
      {
        "db": "uniprot",
        "id": "Q99988"
      }
    ]
}]

or even better:

[{
    "dbfrom": "ncbigene",
    "id": "9158",
    "dbXrefs": [
      {
    "namespace": "uniprot",
    "type": "protein",
    "dbName": "UniProt Knowledgebase",
    "dbPrefix": "uniprot",
    "id": "Q99988",
    "organism": "9606",
    "name": "GDF-15",
    "geneNames": [
      "GDF15",
      "MIC1",
      "PDF",
      "PLAB",
      "PTGFB"
    ],
    "proteinNames": [
      "GDF-15",
      "Growth/differentiation factor 15",
      "MIC-1",
      "NAG-1",
      "NRG-1",
      "Placental TGF-beta",
      "Placental bone morphogenetic protein",
      "Prostate differentiation factor"
    ],
    "synonyms": [
      "Growth/differentiation factor 15",
      "MIC-1",
      "NAG-1",
      "NRG-1",
      "Placental TGF-beta",
      "Placental bone morphogenetic protein",
      "Prostate differentiation factor",
      "GDF15",
      "MIC1",
      "PDF",
      "PLAB",
      "PTGFB"
    ],
    "dbXrefs": [...]
   }
  ]
  
}]

EACCESS on out.log causes docker instance to exit

Built and executed as in the readme, not populated with any data, I get the following error when connecting to the root (localhost:3000):

> [email protected] start /usr/src/app
> node ./src/server

info:    GET / 302 10.926 ms - 62
events.js:183
      throw er; // Unhandled 'error' event
      ^

Error: EACCES: permission denied, open 'out.log'
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] start: `node ./src/server`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/node/.npm/_logs/2019-04-08T16_52_50_021Z-debug.log

This most likely the fault: https://github.com/PathwayCommons/grounding-search/blob/master/src/server/logger.js#L7

Signal if datasource update failure has occurred

Background: When data sources are indexed, they mail fail either partially or entirely. For instance, the source file may not be downloaded. In this case, the build continues to the next source regardless of the integrity of the data in the index.

Goal: It would be helpful to signal that a failure has occurred so that further action can be performed to correct the issue (i.e. rerun or inspect the problem)

Details: The issue can occur with build locally, but more importantly, Docker images, which 'fail silently'.

Entities of interest to highlight

A miscellaneous list of entities that it would be nice to test and possibly boost manually.

A lot of this is motivated by the fact that ChEBI does in fact contain a bunch of useful 'generic' chemicals (e.g. messenger RNA) that come up in articles as participants. I wasn't aware of this earlier and may have dismissed articles for capture based on this presumption.

text	dbPrefix	dbId	comments
cap structure	ChEBI	CHEBI:10596	7-methylguanylate cap
mRNA	ChEBI	CHEBI:33699	messenger RNA
18S rRNA	NCBI Gene	100008588	RNA, (18S) ribosomal

ID mapping for NCBI Gene ID and Uniprot ID

Example for TP53 (human): 7157

In Uniprot XML: <dbReference type="GeneID" id="7157"/>

NCBI tab header:

#tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location description type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclature_authority Nomenclature_status Other_designations Modification_date Feature_type

Unstable branch builds without issue, but no chemicals?

Reason: chebi.owl was empty. (Sep 9 2020)

Patch file

For quick-and-dirty revisions to the database (e.g. add a missing synonym), use a patch file that specifies a diff to be applied to the data.

pathwaycommons / grounding-search Goto Github PK

grounding-search's Issues

Details

Recommend Projects

Recommend Topics

Recommend Org