pathwaycommons / grounding-search Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 5.0 2.86 MB

A biological entity grounding search service

License: MIT License

Dockerfile 0.44% JavaScript 96.67% Shell 0.13% TeX 2.76%

grounding identifiers search biological-entities

grounding-search's Introduction

grounding-search

Description

The identification of sub-cellular biological entities is an important consideration in the use and creation of bioinformatics analysis tools and accessible biological research apps. When research information is uniquely and unambiguously identified, it enables data to be accurately retrieved, cross-referenced, and integrated. In practice, biological entities are “identified” when they are associated with a matching record from a knowledge base that specialises in collecting and organising information of that type (e.g. gene sequences). Our search service increases the efficiency and ease of use for identifying biological entities. This identification may be used to power research apps and tools where common entity synonyms may be provided as input.

For instance, Biofactoid uses this grounding service to allow users to simply specify their preferred synonyms to identify biological entities (e.g. proteins):

biofactoid-grounding.mp4

Citation

To cite the Pathway Commons Grounding Search Service in a paper, please cite the Journal of Open Source Software paper:

Franz et al., (2021). A flexible search system for high-accuracy identification of biological entities and molecules. Journal of Open Source Software, 6(67), 3756, https://doi.org/10.21105/joss.03756

View the paper at JOSS or view the PDF directly.

Maintenance

The Pathway Commons Grounding Search Service is an academic project built and maintained by: Bader Lab at the University of Toronto , Sander Lab at Harvard , and the Pathway and Omics Lab at the Oregon Health & Science University .

Funding

This project was funded by the US National Institutes of Health (NIH) [U41 HG006623, U41 HG003751, R01 HG009979 and P41 GM103504].

Quick start

Via Docker

Install Docker (>=20.10.0) and Docker Compose (>=1.29.0).

Clone this remote or at least the docker-compose.yml file then run:

docker-compose up

Swagger documentation can be accessed at http://localhost:3000.

NB: Server start will take some time in order for Elasticsearch to initialize and for the grounding data to be retrieved and the index restored. If it takes more than 10 minutes consider increasing the allocated memory for Docker: Preferences > Resources > Memory and remove this line in docker-compose.yml: ES_JAVA_OPTS=-Xms2g -Xmx2g

Via source

With Node.js (>=8) and Elasticsearch (>=6.6.0, <7) installed with default options, run the following in a cloned copy of the repository:

npm install: Install npm dependencies
npm run update: Download and index the data
npm start: Start the server (by default on port 3000)

Documentation

Swagger documentation is available on a publicly-hosted instance of the service at https://grounding.baderlab.org. You can run queries to test the API on this instance.

Please do not use https://grounding.baderlab.org for your production apps or scripts.

Example usage

Here, we provide usage examples in common languages for the main search API. For more details, please refer to the Swagger documentation at https://grounding.baderlab.org, which is also accessible when running a local instance.

Example search in JS

const response = await fetch('http://hostname:port/search', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ // search options here
    q: 'p53'
  })
});

const responseJSON = await response.json();

Example search in Python

import requests

url = 'http://hostname:port/search'
body = {'q': 'p53'}

response = requests.post(url, data = body)

responseJSON = response.json()

Example in shell script via curl

curl -X POST "http://hostname:port/search" -H  "accept: application/json" -H  "Content-Type: application/json" -d "{  \"q\": \"p53\" }"

Tool comparison

Here, we summarise a set of tools that overlap to some degree with the main use case of the Pathway Commons Grounding Search Service, where a user searches for a biological entity grounding by providing only a commonly-used synonym. This table was last updated on 25 October 2021 (2021-10-25).

If you have developed a new tool in this space or your tool supports new features, let us know by making a pull request, and we'll add your revision to this table.

	PC Grounding Search	GProfiler	GNormPlus (PubTator)	Gilda	BridgeDB
Allows for searching by synonym	●		●	●
Supports multiple organisms	●	●	●	●	●
Accepts organism ranking preference	●
Multiple organisms per query	●		●	Partial support (only one organism returned)
Multiple results per query	●			One per type (e.g. protein)	●
Multiple results are ranked based on relevance	●			●
Speed/Throughput	< 100 ms	< 100 ms	< 100ms	< 100 ms	< 1000 ms
Allows querying for a particular grounding by ID	●	●	●	●	●

Grounding data

grounding-search uses data files provided by three public databases:

NCBI Gene
- Information about genes
- Alias: ncbi
- Data file: gene_info.gz
ChEBI (chebi)
- Information about small molecules of biological interest
- Alias: chebi
- Data file: chebi.owl
UniProt (uniprot)
- Information about proteins
- Alias: uniprot
- Data file: uniprot_sprot.xml.gz

Build index from source database files

If you have followed the Quick Start ("Run from source"), you can download and index the data provided by the source databases ncbi, chebi and uniprot by running:

npm run update

Restore index from Elasticsearch dump files

Downloading and building the index from source ensures that the latest information is indexed. Alternatively, to quickly retrieve and recreate the index a dump of a previously indexed Elasticsearch instance has been published on Zenodo under the following DOI:

This data is published under the Creative Commons Zero v1.0 Universal license.

To restore, create a running Elasticsearch instance and run:

npm run restore

To both restore and start the grounding-search server run:

npm run boot

NB: Index dump published on Zenodo is offered for demonstration purposes only. We do not guarantee that this data will be up-to-date or that releases of grounding-search software will be compatible with any previously published version of the dump data. To ensure you are using the latest data compatible with grounding-search, follow instructions in "Build the index database from source database files".

Issues & feedback

To let us know about an issue in the software or to provide feedback, please file an issue on GitHub.

Contributing

To make a contribution to this project, please start by please filing an issue on GitHub that describes your proposal. Once your proposal is ready, you can make a pull request.

Configuration

The following environment variables can be used to configure the server:

NODE_ENV : the environment mode, either production or development (default)
LOG_LEVEL : the level for the log file (info, warn, error)
PORT : the port on which the server runs (default 3000)
ELASTICSEARCH_HOST : the host:port that points to elasticsearch
MAX_SEARCH_ES : the maximum number of results to return from elasticsearch
MAX_SEARCH_WS : the maximum number of results to return in json from the webservice
CHUNK_SIZE : how many grounding entries make up a chunk that gets bulk inserted into elasticsearch
MAX_SIMULT_CHUNKS : maximum number of chunks to insert simulteneously into elasticsearch
INPUT_PATH : the path to the input folder where data files are located
INDEX : the elasticsearch index name to store data from all data sources
UNIPROT_FILE_NAME : name of the file where uniprot data will be read from
UNIPROT_URL : url to download uniprot file from
CHEBI_FILE_NAME : name of the file where chebi data will be read from
CHEBI_URL : url to download chebi file from
NCBI_FILE_NAME : name of the file where ncbi data will be read from
NCBI_URL : url to download ncbi file from
NCBI_EUTILS_BASE_URL : url for NCBI EUTILS
NCBI_EUTILS_API_KEY : NCBI EUTILS API key
FAMPLEX_URL: url to download FamPlex remote from
FAMPLEX_FILE_NAME: name of the file where FamPlex data will be read from
FAMPLEX_TYPE_FILTER: entity type to include ('protein', 'complex', 'all' [default])
ESDUMP_LOCATION : The location (URL, file path) of elasticdump files (note: terminate with '/')
ZENODO_API_URL: base url for Zenodo
ZENODO_ACCESS_TOKEN: access token for Zenodo REST API (Scope: deposit:actions, deposit:write)
ZENODO_BUCKET_ID: id for Zenodo deposition 'bucket' (Files API)
ZENODO_DEPOSITION_ID: id for Zenodo deposition (for a published dataset)

Run targets

npm start : start the server
npm stop : stop the server
npm run watch : watch mode (debug mode enabled, autoreload)
npm run refresh : run clear, update, then start
npm test : run tests for read only methods (e.g. search and get) assuming that data is already existing
npm test:sample : run tests with sample data
npm run test:quality : run the search quality tests (expects full db)
npm run test:quality:csv : run the search quality tests and output a csv file
npm run lint : lint the project
npm run benchmark : run all benchmarking
npm run benchmark:source : run benchmarking for source (i.e. ncbi, chebi)
npm run clear : clear all data
npm run clear:source : clear data for source (i.e. ncbi, chebi)
npm run update : update all data (download then index)
npm run update:source : update data for source (i.e. ncbi, chebi) in elasticsearch
npm run download : download all data
npm run download:source download data for source (i.e. ncbi, chebi)
npm run index : index all data
npm run index:source : index data for source (i.e. ncbi, chebi) in elasticsearch
npm run test:inputgen : generate input test file for each source (i.e. uniprot, ...)
npm run test:inputgen : generate input test file for source (i.e. uniprot, ...)
npm run dump : dump the information for INDEX to ESDUMP_LOCATION
npm run restore : restore the information for INDEX from ESDUMP_LOCATION
npm run boot : run clear, restore then start; exit on errors

Using Zenodo to store index dumps

Zenodo lets you you to store and retrieve digital artefacts related to a scientific project or publication. Here, we use Zenodo to store Elasticsearch index dump data used to quickly recreate the index used by grounding-search.

Create and publish a new record deposition

Briefly, using their RESTful web service API, you can create a 'Deposition' for a record that has a 'bucket' referenced by a ZENODO_BUCKET_ID to which you can upload and download 'files' (i.e. <ZENODO_API_URL>api/files/<ZENODO_BUCKET_ID>/<filename>; list them with https://zenodo.org/api/deposit/depositions/<deposition id>/files). In particular, there are three files required to recreate an index, corresponding to the elasticsearch types: data; mapping and analyzer.

To setup follow these steps:

Get a ZENODO_ACCESS_TOKEN by creating a 'Personal access token' (see docs for details). Be sure to add the deposit:actions and deposit:write scopes.
Create a recrod 'Deposition' by POSTing to https://zenodo.org/api/deposit/depositions with at least the following information, keeping in mind to set the header Authorization = Bearer <ZENODO_ACCESS_TOKEN>:

{
	"metadata": {
		"title": "Elasticsearch data for biofactoid.org grounding-search service",
		"upload_type": "dataset",
		"description": "This deposition contains files with data describing an Elasticsearch index (https://github.com/PathwayCommons/grounding-search). The files were generated from the elasticdump npm package (https://www.npmjs.com/package/elasticdump). The data are the neccessary and sufficient information to populate an Elasticsearch index.",
		"creators": [
			{
				"name": "Biofactoid",
				"affiliation": "biofactoid.org"
			}
		],
		"access_right": "open",
		"license": "cc-zero"
	}
}

The POST response should have a 'bucket' (e.g. "bucket": "https://zenodo.org/api/files/<uuid>" ) within the links object. The variable ZENODO_BUCKET_ID is the value <uuid> in the example URL.
Publish. You'll want to dump the index and upload to Zenodo (npm run dump). You can publish this from the API by POSTing to https://zenodo.org/api/deposit/depositions/<deposition id>/actions/publish. Alternatively, log in to the Zenodo web page and click 'Publish' to make the deposition public.

Once published, a deposition cannot be updated or altered. However, you can create a new version of a record (below).

Create and publish a new version of a record

In this case, you already have a record which points to a published deposition (i.e. elasticsearch index files) and wish to create a new version for that record. Here, you'll create a new deposition under the same record:

Make a POST request to https://zenodo.org/api/deposit/depositions/<deposition id>/actions/newversion to create a new version. Alternatively, visit https://zenodo.org/record/<deposition id> where deposition id is that of the latest published version (default).
Fetch https://zenodo.org/api/deposit/depositions?all_versions to list all your depositions and identify the new deposition bucket id.
Proceed to upload (i.e. dump) your new files as described in "Create a new deposition", Step 3.

Notes:
- New version's files must differ from all previous versions
- See https://help.zenodo.org/#versioning and https://developers.zenodo.org/#new-version for more info

Testing

All files /test will be run by Mocha. You can npm test to run all tests, or you can run npm test -- -g specific-test-name to run specific tests.

Chai is included to make the tests easier to read and write.

Publishing a release

Make sure the tests are passing: npm test
Make sure the linting is passing: npm run lint
Bump the version number with npm version, in accordance with semver. The version command in npm updates both package.json and git tags, but note that it uses a v prefix on the tags (e.g. v1.2.3).
For a bug fix / patch release, run npm version patch.
For a new feature release, run npm version minor.
For a breaking API change, run npm version major.
For a specific version number (e.g. 1.2.3), run npm version 1.2.3.
Push the release: git push && git push --tags
Publish a GitHub release so that Zenodo creates a DOI for this version.

grounding-search's People

Contributors

Stargazers

Watchers

Forkers

jvwong cannin sacdallago alexanderpico galessiorob

grounding-search's Issues

Data collection: Use MIRIAM-compliant namespace

Use the namespace registered with MIRIAM for each of the collections:

Name	Namespace
NCBI Gene	`ncbigene`
NCBI Protein	`ncbiprotein`
UniProt Knowledgebase	`uniprot`
ChEBI	`CHEBI`

This will reduce confusion and increase the 'linkability' of data across projects.

Refs PathwayCommons/factoid#798

Datasource tests failing

@metincansiper

The datasource import tests are failing. See https://travis-ci.org/PathwayCommons/grounding-search/jobs/526196412

ID handling has been improved so any hardcoded Chebi IDs need to be updated in the tests. Chebi IDs that are formatted like 'CHEBI:123' should just use { namespace: 'chebi', id: '123' } from now on. The _id is now CHEBI:123 or NCBI:456.

It looks like some other tests for the datasources are failing for reasons unknown to me. Maybe the tests are too strict / brittle? They shouldn't check for deep equality on the element JSON, for example.

Reduce the result size from ES

It's too big and slow.

Bump to v0.4.0

Bump to 0.4.0

Tag minor release

0.2.1

Chunked merging

Our current merging strategy pulls all the relevant entries into memory at once. We could improve our memory usage by using a chunking strategy similar to our update operations:

Basically, we always do our queries for finding entries that should be merged into another entry by reading a single page (w.r.t. query pagination). So, we only query for results 0..N for a chunk size of N.

As we process each chunk, we delete (or mark) the entries that we no longer want to come up in the next chunk. This includes deleting descendants in the case of descendant-to-root merging. It also includes marking alternative root entries in the case of root-to-alternative-root merging.

I've summarised the process in the following page. It includes pseudocode and an example graph.

@metincansiper Let me know whether this makes sense or whether I've missed any details. Thanks

Swagger docs

The docs should be shown on the main index.html on the server.

Common elements should be ions by default

Currently, we prefer to list uncharged entities rather than charged one. This preference should be reversed for elements that are almost always ions in a user's paper:

H 1+
Na 1+
K 1+
Mg 2+
Ca 2+
Fe 3+ then 2+
Cu 2+
Zn 2+
O 2-
S 2-
P 3-
Cl 1-

SARS-CoV support

Issues related to Biofactoid support:
Need to support genes from many viruses
Most of what is known relates to other, previously studied coronaviruses (MERS, SARS-Cov)and often other related and un-related viruses, which may or may not extrapolate

Refs PathwayCommons/factoid#699 (comment)

UniProt indexing: Account for optional names

Background: There's a few good reasons to re-examine indexing UniProt, but first there are some bugs to iron out...

Issue: Indexing fails because it is assumed that various name fields exist

Reproduce: npm run update:uniprot

See http://www.uniprot.org/docs/uniprot.xsd

Workflow: Redeploy Docker image to development & production instances

GitHub actions provides for (semi)-automated workflows to accomplish tasks following events (e.g. tagging or pushing). We can leverage this to (semi)-automate our software deployments, namely through Docker images.

The following attempts to summarize the different cases and features we'd like:

Instance	Event	Git Reference	Jobs
development	push (i.e. merge PR)	master branch	npm: lint; Docker: build, push (Docker Hub), refresh host
production	workflow_dispatch (i.e. manual)	tag	npm: lint; Docker: build, push (Docker Hub), refresh host

See:

Create release.

Lock down the latest version as release and trigger the DockerHub hook.

Set up on Dockerhub

Once this is ready to be integrated with Factoid: It would be great to have this set up on Dockerhub.

One image with the data already included (snapshot by date)
One image without the data but configured to download the data during image init

Index UniProt: Xrefs; namepaces

Next steps:

Grounding service updates:

1. Reenable Uniprot within the grounding service. [JVW]

2. Add an option to the aggregate search (i.e. `/search`) s.t. namespaces can be blocked (i.e. uniprot namespaces aren't returned to Biofactoid UI).  This can accept an option to set the block list manually, but we will have a default value s.t. uniprot is blocked by default. [MF]

3. In the uniprot importer, the xrefs need to be included in the returned grounding json.  [JVW]

Convertor pipeline: [MCS]

1. Within the PC import, instead of doing a `/search` (today), do a `/get` (e.g. `/get` for `uniprot:P01234`).

2. Now, you have a uniprot grounding.  So, within it, get the NCBI xref.

3. Query `/get` with the NCBI xref.

4. Use the NCBI grounding for the factoid entity in the `association`.

Once done: Run the import on the master.factoid.baderlab.org instance. Merge the data into

Refs: PathwayCommons/factoid#881 (comment)

Add support for PFAM (protein families)

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/

Unstable branch builds without issue, but no chemicals?

Reason: chebi.owl was empty. (Sep 9 2020)

More SARS-CoV-2 synonyms

To append to patches

namespace	name	id	synonyms
ncbi	NSP3	1802476807	PLpro
ncbi	NSP5	1802476809	3CLpro

Declare license

There is a license file but none declared.

Are we capturing all of the e coli strains?

e.g. https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=562

Some observations lately:

All the top NCBI genes for E coli (83333) are failing because the ids seem to have been deprecated
I would have expected all genes with same name from E coli to have one record, but this doesn't seem to be the case e.g.

POST /search with { "q": "ssb" }
- ssb for Escherichia coli K12 (taxonid=511145) - not found
- ssb in Escherichia coli (taxonid=562) - not found
- ssb in Escherichia coli O157:H7 str. Sakai (taxonid=386585) - rank 4

Rename `type:'protein'`

This should probably be something like type: 'ggp' for "gene or gene product".

A similar change would have to be made in factoid.

Quality test failures

If a quality test case fails, that info could inform improvements, especially if there is a pattern. My suggestion of how to report:

State data:
- text (e.g. 'phospho-p53'), namespace ('ncbi'), id ('7157')
Provide a reason (not necessarily mutually exclusive):
- text is a variation on synonyms
- text has no match in synonyms
- text is a common term
- result is among the top results
- organisms ordering
- not in database
- other: explain

ChEBI: Include 'definition' field

The chebi.owl file contains a 'definition' property that essentially provides a short, human-readable description of the entry. The suggestion is to include this.

Example for 2'-3'-cGAMP:

...
<obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">A cyclic purine dinucleotide that consists of AMP and GMP units cyclised via 3&apos;,5&apos;- and 2&apos;,5&apos;-linkages respectively.</obo:IAO_0000115>

Refs @cannin feedback on chemical information.

Update fail: New ChEBI source files

Error encountered in update from ChEBI when applied to download (ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz).

Reproduce:

delete chebi data file (/input/chebi.owl)
npm run update:chebi

Console out:

info:    Applying update on source 'chebi'...
info:    Downloading ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz to chebi.owl
info:    Processing chebi data from input/chebi.owl
TypeError: Cannot read property 'replace' of undefined
    at processEntry (.../grounding-search/src/server/datasource/chebi.js:64:43)
    at Array.map (<anonymous>)
    at .../grounding-search/src/server/datasource/processing.js:11:34    
  ...

Other:

The fail above left the index in a semi-complete state: some chemicals but nothing from ncbi

Real users: Misc observations

Why is the correct grounding for query not returned in the search results?
- Query term: TGF-beta receptor
- Observed: https://identifiers.org/ncbigene:30739
- Expected: https://www.ncbi.nlm.nih.gov/gene/7048; not present in returned list

Fill in aggregate APIs from Factoid

aggregate
- search()
  - Sorting (by quality of match, tie breaking by organism scores)
- get()
- Ref : https://github.com/PathwayCommons/factoid/blob/master/src/server/routes/api/element-association/aggregate.js

Entities of interest to highlight

A miscellaneous list of entities that it would be nice to test and possibly boost manually.

A lot of this is motivated by the fact that ChEBI does in fact contain a bunch of useful 'generic' chemicals (e.g. messenger RNA) that come up in articles as participants. I wasn't aware of this earlier and may have dismissed articles for capture based on this presumption.

text	dbPrefix	dbId	comments
cap structure	ChEBI	CHEBI:10596	7-methylguanylate cap
mRNA	ChEBI	CHEBI:33699	messenger RNA
18S rRNA	NCBI Gene	100008588	RNA, (18S) ribosomal

Merge entities that are associated with strains

For e. coli and yeast strains, merge all entities that have the exact same name.

For example, there are many entries for "CcdB". Each entry is the same basically except for the taxonomy ID. Here is a sample:

{
    "namespace": "ncbi",
    "type": "protein",
    "id": "39521901",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "C7V14_00585",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39524440",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "EJC48_00625",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39529410",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "U14A_A00031",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "8877686",
    "organism": "573",
    "name": "ccdB",
    "synonyms": [
      "CcdB toxin protein"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39650970",
    "organism": "621",
    "name": "ccdB",
    "synonyms": [
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39651896",
    "organism": "622",
    "name": "ccdB",
    "synonyms": [
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "9538168",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "plasmid maintenance protein",
      "toxin component",
      "plasmid maintenance protein; toxin component"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  }

For each entry from NCBI that is associated with a strain taxon ID (e.g. 562):

Check if there is an ancestor entry for the top-level taxon ID (e.g. 83333) with the same name (e.g. ccdB).
If an entry was found, then merge the descendant entry with the ancestor entry.
1. Add the synonyms from the descendant entry into the ancestor entry, avoiding duplicates.
2. Add the taxon IDs for the ancestor entry and the descendant entry into entry.organisms, avoiding duplicates.
3. Add the grounding ID for the ancestor and the descendant= to entry.ids, avoiding duplicates.
If there is no ancestor entry, then replace the descendant entry taxon ID with the ancestor taxon ID to make the organism field normalised.

Update the ranking algorithm: The ranking w.r.t. organismOrdering should consider the best match of entry.organism and entry.organisms.

Add support for Interpro (protein families)

Indexing source: Stash backup data

We are experiencing problems with the grounding search retrieving fresh data files from source (UniProt, NCBI, ChEBI).

Two options:

Stash source files
Stash ElasticSearch index data, (docker volume).

Refs:

Zenodo: https://zenodo.org/
https://github.com/elasticsearch-dump/elasticsearch-dump
Old commit with scripts for stashing/restoring volumes: https://github.com/PathwayCommons/factoid/tree/26c629ba5c3ffa7ef59a44344c9f402ff4ba36cf/docker

Mapping betwee UniProt and NCBI Gene namespaces

Factoid uses NCBI Gene uids for the purposes of grounding genes. There's a few reasons why it would be nice to have a helper to map to NCBI Gene to/from UniProt:

Importing third-party data into Factoid: Sources like PhosphoSitePlus tag their entities with UniProt
INDRA queries: We use HGNC ID; For non-human, we end up using the node label (string). This leads to very messy, sometimes irrelevant results since it will tend to return human hits unless the label is not-used in humans
REACH typically uses UniProt IDs for grounding
Factoid model-enhancements: UniProt has a lot of desirable information that could be reused in the editor to allow authors to add detail in a constrained but fruitful manner including Sequence variants (aka mutations) and Modification features (aka sites)

Design considerations
- UniProt has a web service
- Roll our own
  - Use UniProt provided text mapping files
- Try this using existing data with some clever ElasticSearch?
Proposed API

Request

 {
    "db": "uniprot",
    "dbfrom": "ncbigene",
    "id": [
      "9158"
    ]
}

Response

[{
    "dbfrom": "ncbigene",
    "id": "9158",
    "dbXrefs": [
      {
        "db": "uniprot",
        "id": "Q99988"
      }
    ]
}]

or even better:

[{
    "dbfrom": "ncbigene",
    "id": "9158",
    "dbXrefs": [
      {
    "namespace": "uniprot",
    "type": "protein",
    "dbName": "UniProt Knowledgebase",
    "dbPrefix": "uniprot",
    "id": "Q99988",
    "organism": "9606",
    "name": "GDF-15",
    "geneNames": [
      "GDF15",
      "MIC1",
      "PDF",
      "PLAB",
      "PTGFB"
    ],
    "proteinNames": [
      "GDF-15",
      "Growth/differentiation factor 15",
      "MIC-1",
      "NAG-1",
      "NRG-1",
      "Placental TGF-beta",
      "Placental bone morphogenetic protein",
      "Prostate differentiation factor"
    ],
    "synonyms": [
      "Growth/differentiation factor 15",
      "MIC-1",
      "NAG-1",
      "NRG-1",
      "Placental TGF-beta",
      "Placental bone morphogenetic protein",
      "Prostate differentiation factor",
      "GDF15",
      "MIC1",
      "PDF",
      "PLAB",
      "PTGFB"
    ],
    "dbXrefs": [...]
   }
  ]
  
}]

NCBI: Capture additional metadata fields

Goal is to capture additional NCBI Gene metadata fields that can be useful for downstream data consumers including data export (BioPAX) and software (Factoid).

In particular, the NCBI gene_info file contains

dbXrefs: ID-mappings to external organism databases (e.g. HGNC - human, Araport - Arabadopsis, FLYBASE - Drosophila)
type_of_gene: This is a description of the gene's products (e.g. tRNA, ncRNA, protein-coding etc.)

Example: The non-coding RNA gene

tax_id	GeneID	Symbol	dbXrefs	type_of_gene
9606	102800311	TP53COR1	MIM:616343\|HGNC:HGNC:43652	ncRNA

Docs:

https://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.mod.dtd

<!ELEMENT Entrezgene_type (%INTEGER;)>
<!ATTLIST Entrezgene_type value (
        unknown |
        tRNA |
        rRNA |
        snRNA |
        scRNA |
        snoRNA |
        protein-coding |
        pseudo |
        transposon |
        miscRNA |
        ncRNA |
        biological-region |
        other
        ) #IMPLIED >

Add more protein/gene test cases from a popularity list

There's a list somewhere that has the top N protein/gene names by popularity. We could use that list in our tests. Having good coverage for popular proteins is important.

@jvwong Do you remember where we can find this list?

xml-parser: omitList considers only open tag and not closing

Background: The xml-parser provides an omitList that intends to skip certain tags by name

Issue: The parser considers the omitList upon encountering an open tag, but not the closing tag. This leads to skipping large sections of the tree.

Prerequisite for #88

Search quality test cases

The test cases need to be reviewed. There are 898 test cases that evaluate the quality of the search results. @jvwong has done a great job on the first pass of this data, but it's a lot of data and it's easy to make small mistakes.

So, we should have each of the test cases reviewed to make sure that we have a solid basis for moving forward with improving the quality of the search. So let's divvy it up:

Steps:

Make a separate branch or fork for your work.
In your branch or fork, make commits to revise erroneous expected groundings in test/util/data/molecular-cell.json.
Make sure the organismOrdering is specified for the paper. If the paper is about one organism, say human, it should be "organismOrdering": [9606]. If the paper is about human mostly but also a little about mouse, then it should be "organismOrdering": [9606, 10090].
Make a PR when you're done.

Patch file

For quick-and-dirty revisions to the database (e.g. add a missing synonym), use a patch file that specifies a diff to be applied to the data.

Quality tests: Remove 'loose'; make output reporting easier

I want to update the aggregate-quality.js code to make reporting a little easier.

In particular, I want to get rid of the 'loose' values which tend to bury useful information (i.e. rank). Rather, all of this can be assessed at the reporting stage by outputting things like rank (somewhere).

Also need some sort of custom reporter, since most times want report in csv form for a spreadsheet.

Review 2

Update env vars in readme

They're a bit out of date as compared to config.js

Review

@metincansiper

Once the above are addressed, I think we'll be ready to move on to other uniprot functions and the aggregate service.

Add support for model organisms other than human

`typeOrdering ` preference

We have support for the organismOrdering preference, which allows the user to specify his preference w.r.t. close matches (currently only ties). It would be good to also allow for a typeOrdering preference, which would work in a similar way: A tie can be broken by the specified ordering of entity types.

Consider two matches, both with the name "X":

X (protein)
X (chemical)

They both match the search string "X" perfectly, so we can use the typeOrdering: ['protein', 'chemical'] preference to break the tie by putting the protein first. Text mining systems like REACH are good at providing type hints like this, but they aren't good at providing the correct grounding. So we can use the information provided by REACH for typeOrdering on a per-entity basis, similar to how we use the information provided by REACH for organismOrdering on a per-paper basis.

SARS-CoV-2 Support

The requirement is to provide grounding support for Severe acute respiratory syndrome coronavirus 2 (NCBI:txid2697049).

In this case, it is much more valuable to include the mature protein products, rather than the genes/open reading frames (ORFs).

Details

NCBI LINK: https://www.ncbi.nlm.nih.gov/datasets/coronavirus/proteins/
Factoid
- Probably need compensatory updates to the related papers system

Notes:

* what the root organism should be,

* what the display name of the organism family should be,

* what filters may be needed, and

* edge cases (e.g. does "S" work well when the organism is indexed, even though it's only one character).

Refs PathwayCommons/factoid#699

Change `require()` to ES `import`

ID mapping for NCBI Gene ID and Uniprot ID

Example for TP53 (human): 7157

In Uniprot XML: <dbReference type="GeneID" id="7157"/>

NCBI tab header:

#tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location description type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclature_authority Nomenclature_status Other_designations Modification_date Feature_type

EACCESS on out.log causes docker instance to exit

Built and executed as in the readme, not populated with any data, I get the following error when connecting to the root (localhost:3000):

> [email protected] start /usr/src/app
> node ./src/server

info:    GET / 302 10.926 ms - 62
events.js:183
      throw er; // Unhandled 'error' event
      ^

Error: EACCES: permission denied, open 'out.log'
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] start: `node ./src/server`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/node/.npm/_logs/2019-04-08T16_52_50_021Z-debug.log

This most likely the fault: https://github.com/PathwayCommons/grounding-search/blob/master/src/server/logger.js#L7

Signal if datasource update failure has occurred

Background: When data sources are indexed, they mail fail either partially or entirely. For instance, the source file may not be downloaded. In this case, the build continues to the next source regardless of the integrity of the data in the index.

Goal: It would be helpful to signal that a failure has occurred so that further action can be performed to correct the issue (i.e. rerun or inspect the problem)

Details: The issue can occur with build locally, but more importantly, Docker images, which 'fail silently'.

Tests: Top NCBI genes for E. coli (taxon id: 83333) are failing

I've re-indexed the updated master branch and noticed that most of the tests for NCBI top genes for E. coli (taxon id: 83333) are failing for 'search' and/or 'get'.

For instance, null is returned from POST to /get the information for E. coli "ssb" in NCBI:

{
	"id": "1263584",
	"namespace": "ncbi"
}

Possibly related? When I try to /get E. coli "Ccdb", something comes back but the organisms value are a mix of types:

POST body:

{
	"id": "1263593",
	"namespace": "ncbi"
}

response:

{
  "namespace": "ncbi",
  "type": "protein",
  "id": "1263593",
  "organism": "83333",
  "organismName": "Escherichia coli",
  "name": "ccdB(letD)",
  "synonyms": [
    "Fpla045",
    "hypothetical protein"
  ],
  "organisms": [
    562,
    "83333"
  ],
  "ids": [
    "1263593"
  ]
}

Use NCBI taxonomy database for `entry.organismName` property

Currently, there are two sets of organism data: One exists in this project, the other in the factoid repo. Because this grounding service does not return organism names, factoid has to do its own lookup. Organisms not in the list of top model organisms are displayed as "Other".

To solve this, we should add a feature to have all the organism information stored only in this project. All organism information that factoid requires should be provided by the grounding service.

This feature is necessary for the following feature to work well: Merge entities that are associated with strains #38

NCBI has a taxon db we can use: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

We should import this data when we do an index. The organisms should be stored separately from the entity entries. Each organism entry should be like the following example:

{
  "id": "4932",
  "name": "Saccharomyces cervisiae",
  "descendantIds": [ '1337652', '1158204', '765312' /*, ... */ ]
}

The following operations should be added in the db:

getOrganismById(idToMatch) : resolves a promise to the organism whose id or descendantIds matches idToMatch

The following operations should be added in the indexing procedure:

When adding an entity entry into the index, getOrganismById() should be used to put the organismName field in the entry from organism.name.

Allow entities with "-ate" suffix to match to "-ic acid"

There are several entities that end in "ate" that a user would type when he really means the corresponding acid. For example, a user might type "lactate" to mean "lactic acid". Or he might type "citrate" to mean "citric acid".

Since this seems to be a one-way issue, we should allow for the transformation only in the ate-to-acid direction. The acid-to-ate direction should not be allowed.

Update ranking algorithm for strains

This was something that was supposed to be done in context of #38, but I forget to implement that last step in the related PR.

Update the ranking algorithm: The ranking w.r.t. organismOrdering should consider the best match of entry.organism and entry.organisms.