G2P '/genotypephenotype/search' experiences
Summary
We extended the GA4GH Reference server to include a the '/genotypephenotype/search' endpoint. This document describes the experience and makes some targeted suggestions for improvements, primarily for the request payload.
Approach
We based our work on the model captured in ga4gh/schemas commit of Jul 30, 2015. This version of the schema predates the separated genotype to phenotype files from baseline.
The code was based on a branch setup for this purpose by the server team.
No major refactoring of the server was needed, additional code was added to ga4gh/backend.py,ga4gh/frontend.py and test/unit/test_views.py
Data
The cancer genome database Clinical Genomics Knowledge Base published by the Monarch project was the source of Evidence.
API
The GA4GH schemas define a single endpoint /genotypephenotype/search
which accepts a POST of a request body containing one or more of Feature, PhenotypeInstance, EnvironmentalContext, and Evidence which are combined as a logical AND to query the underlying datastore. Missing types are treated as a wildcard
returning all data. Responses of matching data are returned as a list of FeaturePhenotypeAssociation. All types rely heavily on OntologyTerm
Request
http://yuml.me/edit/bf06b90a
Response
http://yuml.me/edit/25343da1
Implementation
http://yuml.me/c97fada2
Issues
Query by example
There are four datatypes types for each entity [string, external identifier, ontology identifier and 'entity'].
Currently the implementation handles queries of [string, external identifier and ontology identifier].
The 'entity' query is a type of query-by-example has been deferred. Challenges that arose:
- schema constraints: there are several fields within the schemas that are defined as non-null. This may be fine when creating an entity from a data store, however, they are problematic when creating an entity to be used in a query.
- additional discussions needed to determine what properties from an existing entity will be used for the query and which will be ignored. For example a Feature has [id,parentIds, featureSetId, referenceName, start,end, strand, featureType, attributes] we need to specify exactly what the query's expectations are.
Ontology Queries
- The 'ontologySource' is assumed to be equivalent to an Ontologies 'prefix'. However, no agreement or mechanism exists to align ontologySource to specific. Recommend collapsing ontologySource and identifier into a single URI
Name collision (SearchFeaturesResponse)
That schema contains two definitions of the class SearchFeaturesResponse
. How are these handled in the generated code in _protocol_definitions.py? (Currently I only see one)
The schema project the current server is based on is version = '0.6.be171b00'
Snippets from this commit follow
- One in the file genotypephenotypemethods.avdl, protocol GenotypePhenotypeMethods
/** This is the response from `POST /genotypephenotype/search` expressed as JSON. */
record SearchFeaturesResponse {
/**
The list of matching FeaturePhenotypeAssociation.
*/
array<org.ga4gh.models.FeaturePhenotypeAssociation> associations = [];
...
- The second one is found in sequenceAnnotationmethods.avdl
/** This is the response from `POST /features/search` expressed as JSON. */
record SearchFeaturesResponse {
/**
The list of matching annotations, sorted by start position. Annotations which
share a start position are returned in a deterministic order.
*/
array<org.ga4gh.models.Feature> features = [];
...
- The generated code only has the class associated with
sequenceAnnotationmethods.avdl
def __init__(self):
self.features = []
self.nextPageToken = None
Both sequenceAnnotationmethods.avdl and genotypephenotypemethods.avdl share the same namespace @namespace("org.ga4gh.methods")
each file defines an enclosing protocol
.
In the names section of the spec
A name only is specified, i.e., a name that contains no dots. In this case the namespace is taken from the most tightly enclosing schema or protocol. For example, if "name": "X" is specified, and this occurs within a field of the record definition of org.foo.Y, then the fullname is org.foo.X. If there is no enclosing namespace then the null namespace is used.
I'm assuming that the schemas pass validation...
A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)
TODO
Pull Request Prep
General clean up. Additional Tests.
MS Literome adapter
Create a facade to interact with MS:Literome. See http://literome.azurewebsites.net
CIViC Client
angular UI and node reverse proxy
Literome Feedback
Allow API to accept optional diseaseOrDrug, return first 100 potential associations
http://literome.azurewebsites.net/gwas/get?snporgene=BRCA2
{"ClassName":"System.ArgumentException","Message":"'diseaseOrDrug' cannot be empty.",...}
Accept dbSNP ids on par with gene name
http://literome.azurewebsites.net/gwas/get?snporgene=rs80359550&diseaseordrug=Breast%20Diseases
{"Associations":[],"Abstracts":[]}
Allow disease name flexibility
http://literome.azurewebsites.net/gwas/get?snporgene=BRCA2&diseaseordrug=Breast%20Diseases
{"Associations":[{"SnpOrGeneType":....}
http://literome.azurewebsites.net/gwas/get?snporgene=BRCA2&diseaseordrug=Breast%20Disease
{"Associations":[],"Abstracts":[]}
Accept entrez id for gene
http://literome.azurewebsites.net/gwas/get?snporgene=675&diseaseordrug=Breast%20Diseases
{"Associations":[{"SnpOrGeneType":....}
Use Drug ontology ids
DiseaseOrDrugId: "PA443559"
PA443559 equivalent_to http://www.ncbi.nlm.nih.gov/mesh/D001941