Giter VIP home page Giter VIP logo

grounding-search's Introduction

grounding-search

DOI status License Build Status

Description

The identification of sub-cellular biological entities is an important consideration in the use and creation of bioinformatics analysis tools and accessible biological research apps. When research information is uniquely and unambiguously identified, it enables data to be accurately retrieved, cross-referenced, and integrated. In practice, biological entities are “identified” when they are associated with a matching record from a knowledge base that specialises in collecting and organising information of that type (e.g. gene sequences). Our search service increases the efficiency and ease of use for identifying biological entities. This identification may be used to power research apps and tools where common entity synonyms may be provided as input.

For instance, Biofactoid uses this grounding service to allow users to simply specify their preferred synonyms to identify biological entities (e.g. proteins):

biofactoid-grounding.mp4

Citation

To cite the Pathway Commons Grounding Search Service in a paper, please cite the Journal of Open Source Software paper:

Franz et al., (2021). A flexible search system for high-accuracy identification of biological entities and molecules. Journal of Open Source Software, 6(67), 3756, https://doi.org/10.21105/joss.03756

View the paper at JOSS or view the PDF directly.

Maintenance

The Pathway Commons Grounding Search Service is an academic project built and maintained by: Bader Lab at the University of Toronto , Sander Lab at Harvard , and the Pathway and Omics Lab at the Oregon Health & Science University .

Funding

This project was funded by the US National Institutes of Health (NIH) [U41 HG006623, U41 HG003751, R01 HG009979 and P41 GM103504].

Quick start

Via Docker

Install Docker (>=20.10.0) and Docker Compose (>=1.29.0).

Clone this remote or at least the docker-compose.yml file then run:

docker-compose up

Swagger documentation can be accessed at http://localhost:3000.

NB: Server start will take some time in order for Elasticsearch to initialize and for the grounding data to be retrieved and the index restored. If it takes more than 10 minutes consider increasing the allocated memory for Docker: Preferences > Resources > Memory and remove this line in docker-compose.yml: ES_JAVA_OPTS=-Xms2g -Xmx2g

Via source

With Node.js (>=8) and Elasticsearch (>=6.6.0, <7) installed with default options, run the following in a cloned copy of the repository:

  • npm install: Install npm dependencies
  • npm run update: Download and index the data
  • npm start: Start the server (by default on port 3000)

Documentation

Swagger documentation is available on a publicly-hosted instance of the service at https://grounding.baderlab.org. You can run queries to test the API on this instance.

Please do not use https://grounding.baderlab.org for your production apps or scripts.

Example usage

Here, we provide usage examples in common languages for the main search API. For more details, please refer to the Swagger documentation at https://grounding.baderlab.org, which is also accessible when running a local instance.

Example search in JS

const response = await fetch('http://hostname:port/search', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ // search options here
    q: 'p53'
  })
});

const responseJSON = await response.json();

Example search in Python

import requests

url = 'http://hostname:port/search'
body = {'q': 'p53'}

response = requests.post(url, data = body)

responseJSON = response.json()

Example in shell script via curl

curl -X POST "http://hostname:port/search" -H  "accept: application/json" -H  "Content-Type: application/json" -d "{  \"q\": \"p53\" }"

Tool comparison

Here, we summarise a set of tools that overlap to some degree with the main use case of the Pathway Commons Grounding Search Service, where a user searches for a biological entity grounding by providing only a commonly-used synonym. This table was last updated on 25 October 2021 (2021-10-25).

If you have developed a new tool in this space or your tool supports new features, let us know by making a pull request, and we'll add your revision to this table.

PC Grounding Search GProfiler GNormPlus (PubTator) Gilda BridgeDB
Allows for searching by synonym
Supports multiple organisms
Accepts organism ranking preference
Multiple organisms per query Partial support (only one organism returned)
Multiple results per query One per type (e.g. protein)
Multiple results are ranked based on relevance
Speed/Throughput < 100 ms < 100 ms < 100ms < 100 ms < 1000 ms
Allows querying for a particular grounding by ID

Grounding data

grounding-search uses data files provided by three public databases:

Build index from source database files

If you have followed the Quick Start ("Run from source"), you can download and index the data provided by the source databases ncbi, chebi and uniprot by running:

npm run update

Restore index from Elasticsearch dump files

Downloading and building the index from source ensures that the latest information is indexed. Alternatively, to quickly retrieve and recreate the index a dump of a previously indexed Elasticsearch instance has been published on Zenodo under the following DOI:

Zenodo

This data is published under the Creative Commons Zero v1.0 Universal license.

To restore, create a running Elasticsearch instance and run:

npm run restore

To both restore and start the grounding-search server run:

npm run boot

NB: Index dump published on Zenodo is offered for demonstration purposes only. We do not guarantee that this data will be up-to-date or that releases of grounding-search software will be compatible with any previously published version of the dump data. To ensure you are using the latest data compatible with grounding-search, follow instructions in "Build the index database from source database files".

Issues & feedback

To let us know about an issue in the software or to provide feedback, please file an issue on GitHub.

Contributing

To make a contribution to this project, please start by please filing an issue on GitHub that describes your proposal. Once your proposal is ready, you can make a pull request.

Configuration

The following environment variables can be used to configure the server:

  • NODE_ENV : the environment mode, either production or development (default)
  • LOG_LEVEL : the level for the log file (info, warn, error)
  • PORT : the port on which the server runs (default 3000)
  • ELASTICSEARCH_HOST : the host:port that points to elasticsearch
  • MAX_SEARCH_ES : the maximum number of results to return from elasticsearch
  • MAX_SEARCH_WS : the maximum number of results to return in json from the webservice
  • CHUNK_SIZE : how many grounding entries make up a chunk that gets bulk inserted into elasticsearch
  • MAX_SIMULT_CHUNKS : maximum number of chunks to insert simulteneously into elasticsearch
  • INPUT_PATH : the path to the input folder where data files are located
  • INDEX : the elasticsearch index name to store data from all data sources
  • UNIPROT_FILE_NAME : name of the file where uniprot data will be read from
  • UNIPROT_URL : url to download uniprot file from
  • CHEBI_FILE_NAME : name of the file where chebi data will be read from
  • CHEBI_URL : url to download chebi file from
  • NCBI_FILE_NAME : name of the file where ncbi data will be read from
  • NCBI_URL : url to download ncbi file from
  • NCBI_EUTILS_BASE_URL : url for NCBI EUTILS
  • NCBI_EUTILS_API_KEY : NCBI EUTILS API key
  • FAMPLEX_URL: url to download FamPlex remote from
  • FAMPLEX_FILE_NAME: name of the file where FamPlex data will be read from
  • FAMPLEX_TYPE_FILTER: entity type to include ('protein', 'complex', 'all' [default])
  • ESDUMP_LOCATION : The location (URL, file path) of elasticdump files (note: terminate with '/')
  • ZENODO_API_URL: base url for Zenodo
  • ZENODO_ACCESS_TOKEN: access token for Zenodo REST API (Scope: deposit:actions, deposit:write)
  • ZENODO_BUCKET_ID: id for Zenodo deposition 'bucket' (Files API)
  • ZENODO_DEPOSITION_ID: id for Zenodo deposition (for a published dataset)

Run targets

  • npm start : start the server
  • npm stop : stop the server
  • npm run watch : watch mode (debug mode enabled, autoreload)
  • npm run refresh : run clear, update, then start
  • npm test : run tests for read only methods (e.g. search and get) assuming that data is already existing
  • npm test:sample : run tests with sample data
  • npm run test:quality : run the search quality tests (expects full db)
  • npm run test:quality:csv : run the search quality tests and output a csv file
  • npm run lint : lint the project
  • npm run benchmark : run all benchmarking
  • npm run benchmark:source : run benchmarking for source (i.e. ncbi, chebi)
  • npm run clear : clear all data
  • npm run clear:source : clear data for source (i.e. ncbi, chebi)
  • npm run update : update all data (download then index)
  • npm run update:source : update data for source (i.e. ncbi, chebi) in elasticsearch
  • npm run download : download all data
  • npm run download:source download data for source (i.e. ncbi, chebi)
  • npm run index : index all data
  • npm run index:source : index data for source (i.e. ncbi, chebi) in elasticsearch
  • npm run test:inputgen : generate input test file for each source (i.e. uniprot, ...)
  • npm run test:inputgen : generate input test file for source (i.e. uniprot, ...)
  • npm run dump : dump the information for INDEX to ESDUMP_LOCATION
  • npm run restore : restore the information for INDEX from ESDUMP_LOCATION
  • npm run boot : run clear, restore then start; exit on errors

Using Zenodo to store index dumps

Zenodo lets you you to store and retrieve digital artefacts related to a scientific project or publication. Here, we use Zenodo to store Elasticsearch index dump data used to quickly recreate the index used by grounding-search.

Create and publish a new record deposition

Briefly, using their RESTful web service API, you can create a 'Deposition' for a record that has a 'bucket' referenced by a ZENODO_BUCKET_ID to which you can upload and download 'files' (i.e. <ZENODO_API_URL>api/files/<ZENODO_BUCKET_ID>/<filename>; list them with https://zenodo.org/api/deposit/depositions/<deposition id>/files). In particular, there are three files required to recreate an index, corresponding to the elasticsearch types: data; mapping and analyzer.

To setup follow these steps:

  1. Get a ZENODO_ACCESS_TOKEN by creating a 'Personal access token' (see docs for details). Be sure to add the deposit:actions and deposit:write scopes.
  2. Create a recrod 'Deposition' by POSTing to https://zenodo.org/api/deposit/depositions with at least the following information, keeping in mind to set the header Authorization = Bearer <ZENODO_ACCESS_TOKEN>:
{
	"metadata": {
		"title": "Elasticsearch data for biofactoid.org grounding-search service",
		"upload_type": "dataset",
		"description": "This deposition contains files with data describing an Elasticsearch index (https://github.com/PathwayCommons/grounding-search). The files were generated from the elasticdump npm package (https://www.npmjs.com/package/elasticdump). The data are the neccessary and sufficient information to populate an Elasticsearch index.",
		"creators": [
			{
				"name": "Biofactoid",
				"affiliation": "biofactoid.org"
			}
		],
		"access_right": "open",
		"license": "cc-zero"
	}
}
  1. The POST response should have a 'bucket' (e.g. "bucket": "https://zenodo.org/api/files/<uuid>" ) within the links object. The variable ZENODO_BUCKET_ID is the value <uuid> in the example URL.
  2. Publish. You'll want to dump the index and upload to Zenodo (npm run dump). You can publish this from the API by POSTing to https://zenodo.org/api/deposit/depositions/<deposition id>/actions/publish. Alternatively, log in to the Zenodo web page and click 'Publish' to make the deposition public.

Once published, a deposition cannot be updated or altered. However, you can create a new version of a record (below).

Create and publish a new version of a record

In this case, you already have a record which points to a published deposition (i.e. elasticsearch index files) and wish to create a new version for that record. Here, you'll create a new deposition under the same record:

  1. Make a POST request to https://zenodo.org/api/deposit/depositions/<deposition id>/actions/newversion to create a new version. Alternatively, visit https://zenodo.org/record/<deposition id> where deposition id is that of the latest published version (default).
  2. Fetch https://zenodo.org/api/deposit/depositions?all_versions to list all your depositions and identify the new deposition bucket id.
  3. Proceed to upload (i.e. dump) your new files as described in "Create a new deposition", Step 3.

Testing

All files /test will be run by Mocha. You can npm test to run all tests, or you can run npm test -- -g specific-test-name to run specific tests.

Chai is included to make the tests easier to read and write.

Publishing a release

  1. Make sure the tests are passing: npm test
  2. Make sure the linting is passing: npm run lint
  3. Bump the version number with npm version, in accordance with semver. The version command in npm updates both package.json and git tags, but note that it uses a v prefix on the tags (e.g. v1.2.3).
  4. For a bug fix / patch release, run npm version patch.
  5. For a new feature release, run npm version minor.
  6. For a breaking API change, run npm version major.
  7. For a specific version number (e.g. 1.2.3), run npm version 1.2.3.
  8. Push the release: git push && git push --tags
  9. Publish a GitHub release so that Zenodo creates a DOI for this version.

grounding-search's People

Contributors

alexanderpico avatar galessiorob avatar jvwong avatar maxkfranz avatar metincansiper avatar sacdallago avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grounding-search's Issues

Datasource tests failing

@metincansiper

The datasource import tests are failing. See https://travis-ci.org/PathwayCommons/grounding-search/jobs/526196412

ID handling has been improved so any hardcoded Chebi IDs need to be updated in the tests. Chebi IDs that are formatted like 'CHEBI:123' should just use { namespace: 'chebi', id: '123' } from now on. The _id is now CHEBI:123 or NCBI:456.

It looks like some other tests for the datasources are failing for reasons unknown to me. Maybe the tests are too strict / brittle? They shouldn't check for deep equality on the element JSON, for example.

Chunked merging

Our current merging strategy pulls all the relevant entries into memory at once. We could improve our memory usage by using a chunking strategy similar to our update operations:

Basically, we always do our queries for finding entries that should be merged into another entry by reading a single page (w.r.t. query pagination). So, we only query for results 0..N for a chunk size of N.

As we process each chunk, we delete (or mark) the entries that we no longer want to come up in the next chunk. This includes deleting descendants in the case of descendant-to-root merging. It also includes marking alternative root entries in the case of root-to-alternative-root merging.

I've summarised the process in the following page. It includes pseudocode and an example graph.

grounding-merge-2

@metincansiper Let me know whether this makes sense or whether I've missed any details. Thanks

Swagger docs

The docs should be shown on the main index.html on the server.

Common elements should be ions by default

Currently, we prefer to list uncharged entities rather than charged one. This preference should be reversed for elements that are almost always ions in a user's paper:

  • H 1+
  • Na 1+
  • K 1+
  • Mg 2+
  • Ca 2+
  • Fe 3+ then 2+
  • Cu 2+
  • Zn 2+
  • O 2-
  • S 2-
  • P 3-
  • Cl 1-

SARS-CoV support

Issues related to Biofactoid support:
Need to support genes from many viruses
Most of what is known relates to other, previously studied coronaviruses (MERS, SARS-Cov)and often other related and un-related viruses, which may or may not extrapolate

Refs PathwayCommons/factoid#699 (comment)

Workflow: Redeploy Docker image to development & production instances

GitHub actions provides for (semi)-automated workflows to accomplish tasks following events (e.g. tagging or pushing). We can leverage this to (semi)-automate our software deployments, namely through Docker images.

The following attempts to summarize the different cases and features we'd like:

Instance Event Git Reference Jobs
development push (i.e. merge PR) master branch npm: lint; Docker: build, push (Docker Hub), refresh host
production workflow_dispatch (i.e. manual) tag npm: lint; Docker: build, push (Docker Hub), refresh host

See:

Create release.

Lock down the latest version as release and trigger the DockerHub hook.

Set up on Dockerhub

Once this is ready to be integrated with Factoid: It would be great to have this set up on Dockerhub.

  • One image with the data already included (snapshot by date)
  • One image without the data but configured to download the data during image init

Index UniProt: Xrefs; namepaces

Next steps:

Grounding service updates:

1. Reenable Uniprot within the grounding service. [JVW]

2. Add an option to the aggregate search (i.e. `/search`) s.t. namespaces can be blocked (i.e. uniprot namespaces aren't returned to Biofactoid UI).  This can accept an option to set the block list manually, but we will have a default value s.t. uniprot is blocked by default. [MF]

3. In the uniprot importer, the xrefs need to be included in the returned grounding json.  [JVW]

Convertor pipeline: [MCS]

1. Within the PC import, instead of doing a `/search` (today), do a `/get` (e.g. `/get` for `uniprot:P01234`).

2. Now, you have a uniprot grounding.  So, within it, get the NCBI xref.

3. Query `/get` with the NCBI xref.

4. Use the NCBI grounding for the factoid entity in the `association`.

Once done: Run the import on the master.factoid.baderlab.org instance. Merge the data into

Refs: PathwayCommons/factoid#881 (comment)

More SARS-CoV-2 synonyms

To append to patches

namespace name id synonyms
ncbi NSP3 1802476807 PLpro
ncbi NSP5 1802476809 3CLpro

Are we capturing all of the e coli strains?

e.g. https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=562

Some observations lately:

  1. All the top NCBI genes for E coli (83333) are failing because the ids seem to have been deprecated
  2. I would have expected all genes with same name from E coli to have one record, but this doesn't seem to be the case e.g.

Rename `type:'protein'`

This should probably be something like type: 'ggp' for "gene or gene product".

A similar change would have to be made in factoid.

Quality test failures

If a quality test case fails, that info could inform improvements, especially if there is a pattern. My suggestion of how to report:

  • State data:
    • text (e.g. 'phospho-p53'), namespace ('ncbi'), id ('7157')
  • Provide a reason (not necessarily mutually exclusive):
    • text is a variation on synonyms
    • text has no match in synonyms
    • text is a common term
    • result is among the top results
    • organisms ordering
    • not in database
    • other: explain

ChEBI: Include 'definition' field

The chebi.owl file contains a 'definition' property that essentially provides a short, human-readable description of the entry. The suggestion is to include this.

Example for 2'-3'-cGAMP:

...
<obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">A cyclic purine dinucleotide that consists of AMP and GMP units cyclised via 3&apos;,5&apos;- and 2&apos;,5&apos;-linkages respectively.</obo:IAO_0000115>

Refs @cannin feedback on chemical information.

Update fail: New ChEBI source files

Error encountered in update from ChEBI when applied to download (ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz).

Reproduce:

  • delete chebi data file (/input/chebi.owl)
  • npm run update:chebi

Console out:

info:    Applying update on source 'chebi'...
info:    Downloading ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz to chebi.owl
info:    Processing chebi data from input/chebi.owl
TypeError: Cannot read property 'replace' of undefined
    at processEntry (.../grounding-search/src/server/datasource/chebi.js:64:43)
    at Array.map (<anonymous>)
    at .../grounding-search/src/server/datasource/processing.js:11:34    
  ...

Other:

  • The fail above left the index in a semi-complete state: some chemicals but nothing from ncbi

Entities of interest to highlight

A miscellaneous list of entities that it would be nice to test and possibly boost manually.

A lot of this is motivated by the fact that ChEBI does in fact contain a bunch of useful 'generic' chemicals (e.g. messenger RNA) that come up in articles as participants. I wasn't aware of this earlier and may have dismissed articles for capture based on this presumption.

text dbPrefix dbId comments
cap structure ChEBI CHEBI:10596 7-methylguanylate cap
mRNA ChEBI CHEBI:33699 messenger RNA
18S rRNA NCBI Gene 100008588 RNA, (18S) ribosomal

Merge entities that are associated with strains

For e. coli and yeast strains, merge all entities that have the exact same name.

For example, there are many entries for "CcdB". Each entry is the same basically except for the taxonomy ID. Here is a sample:

{
    "namespace": "ncbi",
    "type": "protein",
    "id": "39521901",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "C7V14_00585",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39524440",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "EJC48_00625",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39529410",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "U14A_A00031",
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "8877686",
    "organism": "573",
    "name": "ccdB",
    "synonyms": [
      "CcdB toxin protein"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39650970",
    "organism": "621",
    "name": "ccdB",
    "synonyms": [
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "39651896",
    "organism": "622",
    "name": "ccdB",
    "synonyms": [
      "type II toxin-antitoxin system toxin CcdB"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  },
  {
    "namespace": "ncbi",
    "type": "protein",
    "id": "9538168",
    "organism": "562",
    "name": "ccdB",
    "synonyms": [
      "plasmid maintenance protein",
      "toxin component",
      "plasmid maintenance protein; toxin component"
    ],
    "esScore": 12.311079,
    "defaultOrganismIndex": 361,
    "organismIndex": 361,
    "combinedOrganismIndex": 361,
    "distance": 0,
    "nameDistance": 0,
    "overallDistance": 36100000
  }

For each entry from NCBI that is associated with a strain taxon ID (e.g. 562):

  1. Check if there is an ancestor entry for the top-level taxon ID (e.g. 83333) with the same name (e.g. ccdB).
  2. If an entry was found, then merge the descendant entry with the ancestor entry.
    1. Add the synonyms from the descendant entry into the ancestor entry, avoiding duplicates.
    2. Add the taxon IDs for the ancestor entry and the descendant entry into entry.organisms, avoiding duplicates.
    3. Add the grounding ID for the ancestor and the descendant= to entry.ids, avoiding duplicates.
  3. If there is no ancestor entry, then replace the descendant entry taxon ID with the ancestor taxon ID to make the organism field normalised.

Update the ranking algorithm: The ranking w.r.t. organismOrdering should consider the best match of entry.organism and entry.organisms.

Mapping betwee UniProt and NCBI Gene namespaces

Factoid uses NCBI Gene uids for the purposes of grounding genes. There's a few reasons why it would be nice to have a helper to map to NCBI Gene to/from UniProt:

  1. Importing third-party data into Factoid: Sources like PhosphoSitePlus tag their entities with UniProt
  2. INDRA queries: We use HGNC ID; For non-human, we end up using the node label (string). This leads to very messy, sometimes irrelevant results since it will tend to return human hits unless the label is not-used in humans
  3. REACH typically uses UniProt IDs for grounding
  4. Factoid model-enhancements: UniProt has a lot of desirable information that could be reused in the editor to allow authors to add detail in a constrained but fruitful manner including Sequence variants (aka mutations) and Modification features (aka sites)
  • Design considerations

    • UniProt has a web service
    • Roll our own
      • Use UniProt provided text mapping files
    • Try this using existing data with some clever ElasticSearch?
  • Proposed API

Request

 {
    "db": "uniprot",
    "dbfrom": "ncbigene",
    "id": [
      "9158"
    ]
}

Response

[{
    "dbfrom": "ncbigene",
    "id": "9158",
    "dbXrefs": [
      {
        "db": "uniprot",
        "id": "Q99988"
      }
    ]
}]

or even better:

[{
    "dbfrom": "ncbigene",
    "id": "9158",
    "dbXrefs": [
      {
    "namespace": "uniprot",
    "type": "protein",
    "dbName": "UniProt Knowledgebase",
    "dbPrefix": "uniprot",
    "id": "Q99988",
    "organism": "9606",
    "name": "GDF-15",
    "geneNames": [
      "GDF15",
      "MIC1",
      "PDF",
      "PLAB",
      "PTGFB"
    ],
    "proteinNames": [
      "GDF-15",
      "Growth/differentiation factor 15",
      "MIC-1",
      "NAG-1",
      "NRG-1",
      "Placental TGF-beta",
      "Placental bone morphogenetic protein",
      "Prostate differentiation factor"
    ],
    "synonyms": [
      "Growth/differentiation factor 15",
      "MIC-1",
      "NAG-1",
      "NRG-1",
      "Placental TGF-beta",
      "Placental bone morphogenetic protein",
      "Prostate differentiation factor",
      "GDF15",
      "MIC1",
      "PDF",
      "PLAB",
      "PTGFB"
    ],
    "dbXrefs": [...]
   }
  ]
  
}]

NCBI: Capture additional metadata fields

Goal is to capture additional NCBI Gene metadata fields that can be useful for downstream data consumers including data export (BioPAX) and software (Factoid).

In particular, the NCBI gene_info file contains

  1. dbXrefs: ID-mappings to external organism databases (e.g. HGNC - human, Araport - Arabadopsis, FLYBASE - Drosophila)
  2. type_of_gene: This is a description of the gene's products (e.g. tRNA, ncRNA, protein-coding etc.)

Example: The non-coding RNA gene

tax_id GeneID Symbol dbXrefs type_of_gene
9606 102800311 TP53COR1 MIM:616343|HGNC:HGNC:43652 ncRNA

Docs:

https://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.mod.dtd

<!ELEMENT Entrezgene_type (%INTEGER;)>
<!ATTLIST Entrezgene_type value (
        unknown |
        tRNA |
        rRNA |
        snRNA |
        scRNA |
        snoRNA |
        protein-coding |
        pseudo |
        transposon |
        miscRNA |
        ncRNA |
        biological-region |
        other
        ) #IMPLIED >

xml-parser: omitList considers only open tag and not closing

Background: The xml-parser provides an omitList that intends to skip certain tags by name

Issue: The parser considers the omitList upon encountering an open tag, but not the closing tag. This leads to skipping large sections of the tree.

Prerequisite for #88

Search quality test cases

The test cases need to be reviewed. There are 898 test cases that evaluate the quality of the search results. @jvwong has done a great job on the first pass of this data, but it's a lot of data and it's easy to make small mistakes.

So, we should have each of the test cases reviewed to make sure that we have a solid basis for moving forward with improving the quality of the search. So let's divvy it up:

Steps:

  • Make a separate branch or fork for your work.
  • In your branch or fork, make commits to revise erroneous expected groundings in test/util/data/molecular-cell.json.
  • Make sure the organismOrdering is specified for the paper. If the paper is about one organism, say human, it should be "organismOrdering": [9606]. If the paper is about human mostly but also a little about mouse, then it should be "organismOrdering": [9606, 10090].
  • Make a PR when you're done.

Patch file

For quick-and-dirty revisions to the database (e.g. add a missing synonym), use a patch file that specifies a diff to be applied to the data.

Quality tests: Remove 'loose'; make output reporting easier

I want to update the aggregate-quality.js code to make reporting a little easier.

In particular, I want to get rid of the 'loose' values which tend to bury useful information (i.e. rank). Rather, all of this can be assessed at the reporting stage by outputting things like rank (somewhere).

Also need some sort of custom reporter, since most times want report in csv form for a spreadsheet.

Review 2

  • @maxkfranz Manually review the results of the grounding service
    • Are the results comparable to the raw Uniprot requests?
    • Determine to what degree the aggregate service functionality in Factoid should be done manually versus by Elasticsearch in this project.
    • Determine whether we should use a common index or one index per data source. (I suspect a single index should be used, but let's see if I change my mind during review of other parts of the system.)
  • Things to revise
    • @maxkfranz Add more basic logging to give feedback when updating the data. It takes a long time, so it would be nice to know when
      • downloading starts,
      • downloading ends,
      • processing starts,
      • processing ends,
      • indexing starts, and
      • indexing ends.
    • @maxkfranz If possible, it would be nice to have a scrollbar for each phase using cli-progress. -- Using plain log messages is enough now that things are faster.
    • @metincansiper The code should be more general to use a single index.
      • search() and get() will not need an index to differentiate datasources, but it will need a namespace filter.
      • clear() and update() need to be aware of other data. clear() should only remove the data for the particular namespace rather than the whole index.
      • For uniprot.update(), we should probably be inserting to Elasticsearch each entry, or entries in small batches, so that we don't maintain a large entries array.
      • XML parsing is very slow. We should try a different lib (for example, saxes)
    • @maxkfranz We should use the _score result for the organism tie-breaking logic. Because the result size is relatively small (20-50, say), it's probably fine to do this outside of Elasticsearch. I don't see a straightforward way to incorporate our organism tie-breaking into an Elasticsearch query.

Review

@metincansiper

  • Use path.join() to create paths. Do not concat strings yourself.
  • Use cross-env to set environment variables in npm scripts.
    • Don't use NODE_ENV=test. Make the ci script set env var for the local uniprot test file. Maybe rename the ci script to travis if you want.
  • Use and test new download method. We should not be using bash commands within the js to download files. See latest commit.
    • Test forced/unforced downloads
    • Delete old curl code
  • Add uniprot.get(id)
  • Make sure linting is passing.
  • Enable Travis and add the badge.
  • Document env vars in readme

Once the above are addressed, I think we'll be ready to move on to other uniprot functions and the aggregate service.

`typeOrdering ` preference

We have support for the organismOrdering preference, which allows the user to specify his preference w.r.t. close matches (currently only ties). It would be good to also allow for a typeOrdering preference, which would work in a similar way: A tie can be broken by the specified ordering of entity types.

Consider two matches, both with the name "X":

  • X (protein)
  • X (chemical)

They both match the search string "X" perfectly, so we can use the typeOrdering: ['protein', 'chemical'] preference to break the tie by putting the protein first. Text mining systems like REACH are good at providing type hints like this, but they aren't good at providing the correct grounding. So we can use the information provided by REACH for typeOrdering on a per-entity basis, similar to how we use the information provided by REACH for organismOrdering on a per-paper basis.

SARS-CoV-2 Support

The requirement is to provide grounding support for Severe acute respiratory syndrome coronavirus 2 (NCBI:txid2697049).

In this case, it is much more valuable to include the mature protein products, rather than the genes/open reading frames (ORFs).

Details

Notes:

* what the root organism should be,

* what the display name of the organism family should be,

* what filters may be needed, and

* edge cases (e.g. does "S" work well when the organism is indexed, even though it's only one character).

Refs PathwayCommons/factoid#699

ID mapping for NCBI Gene ID and Uniprot ID

Example for TP53 (human): 7157

In Uniprot XML: <dbReference type="GeneID" id="7157"/>

In NCBI tab-delimited lines (second field): 9606 7157 TP53 - BCC7|BMFS5|LFS1|P53|TRP53 MIM:191170|HGNC:HGNC:11998|Ensembl:ENSG00000141510 17 17p13.1 tumor protein p53 protein-coding TP53 tumor protein p53 O cellular tumor antigen p53|antigen NY-CO-13|mutant tumor protein 53|p53 tumor suppressor|phosphoprotein p53|transformation-related protein 53|tumor protein 53|tumor supressor p53 20190330 -

NCBI tab header:

#tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location description type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclature_authority Nomenclature_status Other_designations Modification_date Feature_type

EACCESS on out.log causes docker instance to exit

Built and executed as in the readme, not populated with any data, I get the following error when connecting to the root (localhost:3000):

> [email protected] start /usr/src/app
> node ./src/server

info:    GET / 302 10.926 ms - 62
events.js:183
      throw er; // Unhandled 'error' event
      ^

Error: EACCES: permission denied, open 'out.log'
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] start: `node ./src/server`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/node/.npm/_logs/2019-04-08T16_52_50_021Z-debug.log

This most likely the fault: https://github.com/PathwayCommons/grounding-search/blob/master/src/server/logger.js#L7

Signal if datasource update failure has occurred

Background: When data sources are indexed, they mail fail either partially or entirely. For instance, the source file may not be downloaded. In this case, the build continues to the next source regardless of the integrity of the data in the index.

Goal: It would be helpful to signal that a failure has occurred so that further action can be performed to correct the issue (i.e. rerun or inspect the problem)

Details: The issue can occur with build locally, but more importantly, Docker images, which 'fail silently'.

Tests: Top NCBI genes for E. coli (taxon id: 83333) are failing

I've re-indexed the updated master branch and noticed that most of the tests for NCBI top genes for E. coli (taxon id: 83333) are failing for 'search' and/or 'get'.

For instance, null is returned from POST to /get the information for E. coli "ssb" in NCBI:

{
	"id": "1263584",
	"namespace": "ncbi"
}

Possibly related? When I try to /get E. coli "Ccdb", something comes back but the organisms value are a mix of types:

POST body:

{
	"id": "1263593",
	"namespace": "ncbi"
}

response:

{
  "namespace": "ncbi",
  "type": "protein",
  "id": "1263593",
  "organism": "83333",
  "organismName": "Escherichia coli",
  "name": "ccdB(letD)",
  "synonyms": [
    "Fpla045",
    "hypothetical protein"
  ],
  "organisms": [
    562,
    "83333"
  ],
  "ids": [
    "1263593"
  ]
}

Use NCBI taxonomy database for `entry.organismName` property

Currently, there are two sets of organism data: One exists in this project, the other in the factoid repo. Because this grounding service does not return organism names, factoid has to do its own lookup. Organisms not in the list of top model organisms are displayed as "Other".

To solve this, we should add a feature to have all the organism information stored only in this project. All organism information that factoid requires should be provided by the grounding service.

This feature is necessary for the following feature to work well: Merge entities that are associated with strains #38

NCBI has a taxon db we can use: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

We should import this data when we do an index. The organisms should be stored separately from the entity entries. Each organism entry should be like the following example:

{
  "id": "4932",
  "name": "Saccharomyces cervisiae",
  "descendantIds": [ '1337652', '1158204', '765312' /*, ... */ ]
}

The following operations should be added in the db:

  • getOrganismById(idToMatch) : resolves a promise to the organism whose id or descendantIds matches idToMatch

The following operations should be added in the indexing procedure:

  • When adding an entity entry into the index, getOrganismById() should be used to put the organismName field in the entry from organism.name.

Allow entities with "-ate" suffix to match to "-ic acid"

There are several entities that end in "ate" that a user would type when he really means the corresponding acid. For example, a user might type "lactate" to mean "lactic acid". Or he might type "citrate" to mean "citric acid".

Since this seems to be a one-way issue, we should allow for the transformation only in the ate-to-acid direction. The acid-to-ate direction should not be allowed.

Update ranking algorithm for strains

This was something that was supposed to be done in context of #38, but I forget to implement that last step in the related PR.

Update the ranking algorithm: The ranking w.r.t. organismOrdering should consider the best match of entry.organism and entry.organisms.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.