cvisb / cvisb_data Goto Github PK

Data portal and API for Center for Viral Systems Biology (CViSB) data

License: MIT License

Python 35.41% JavaScript 0.19% TypeScript 42.33% HTML 13.16% R 3.00% SCSS 5.92%

biology systems-biology systemsbiology viral-genomics viral-metagenomics viral-allele viral-diversity viral-ngs ebola ebola-outbreak

cvisb_data's People

Stargazers

Watchers

cvisb_data's Issues

Add presentation week to front-end

Fix `fetchAll` function to correctly return results

Angular/rxjs issue... queries with fetch_all=true get confused. rxjs:expand currently executing asynchronously?

Bug: experiment filtering not working on /patient page

Redo `&elisa` queries to search /experiment, not /patient

Currently, ELISA data is attached to /patient data; however, it makes more sense to define ELISA data as separate Experiments in /experiment. The cross-endpoint &elisa queries will need to be modified to reference the /experiment endpoint.

NOTE: ELISA queries are nested queries and require special parsing in order to execute properly.

https://data.cvisb.org/api/patient/query?q=__all__&elisa=[[elisa.virus.keyword:Lassa%20AND%20elisa.assayType.keyword:Ag%20AND%20elisa.ELISAresult.keyword:negative%20AND%20elisa.timepoint.keyword:%22patient%20admission%22]]

clear has S/G-id checkboxes on clear patient filters

Data download broken in some circumstances

main-es2015.6ad4d7d983597843effc.js:1 ERROR TypeError: Cannot read properties of undefined (reading 'blurry_vision')

Steps to reproduce:

Filter Lassa
Filter 2019-2020
Download

Fix cross-endpoint querying so they can be combined

Currently, if there are multiple cross-endpoint queries, only the second one gets executed. Requires rewriting the query parser to separate and combine queries. At least at first, queries will probably all be AND'd.

Returns 1832 results as of 2 December 2019:
https://data.cvisb.org/api/patient/query?q=__all__&elisa=[[elisa.virus.keyword:Lassa AND elisa.assayType.keyword:Ag AND elisa.ELISAresult.keyword:negative AND elisa.timepoint.keyword:"patient admission"]]

Returns 369 results as of 2 December 2019
https://data.cvisb.org/api/patient/query?q=__all__&experimentQuery=includedInDataset:hla

ERROR: combined query only returns second query:
https://data.cvisb.org/api/patient/query?q=__all__&elisa=[[elisa.virus.keyword:Lassa%20AND%20elisa.assayType.keyword:Ag%20AND%20elisa.ELISAresult.keyword:negative%20AND%20elisa.timepoint.keyword:%22patient%20admission%22]]&experimentQuery=includedInDataset:hla

Maximum call stack size exceeded on large /experiment upload

Data not showing on patient page

For newly added data; https://data.cvisb.org/patient/G12-658061 -- related to #82

jsonschema validation of authenticated fields

Currently, schema_conversion.py only validates public fields.

Issues with indexing by Google Dataset Search

Just starting a thread to track notes on whether CViSB datasets are being indexed on Google Dataset Search.

Currently, there are five datasets on data.cvisb.org (all listed in https://data.cvisb.org/assets/sitemap.xml):

Two datasets are indexed (SARS-CoV-2, HLA) (https://datasetsearch.research.google.com/search?query=site%3Adata.cvisb.org)

Google Search Console reports 1 error, 0 "valid with warning" and 0 "valid" (https://search.google.com/search-console/datasets?resource_id=https%3A%2F%2Fdata.cvisb.org%2F). Oddly, the one error is for the HLA dataset (one of the successfully-indexed datasets). The error relates to having an object of type Organization under Citation.

Using the Rich Results Testing tool, that error shows up for 3 datasets (Ebola, Lassa, HLA) -- of those three, HLA is successfully indexed in Google Dataset Search. Two datasets (SARS-CoV-2 and systems serology) show up as "Page is eligible for rich results", but only systems serology is successfully indexed. The URL inspection tool on Google Search Console confirms that the datasets are successfully detected -- I just requested re-indexing in the hopes that those datasets will show up in Google Dataset Search (but I seem to recall doing this before).

And one last note that at different times, I have seen all five datasets successfully indexed and also three datasets successfully indexed. As far as I know, we have not changed anything on our end that would explain those changes. From now, will try to track that here...

Data source updates -- spring 2019

Patients

survivor roster
Ebola survivor data
Lassa acute roster
Lassa acute data

waiting...

Lassa survivor data
Ebola acute roster
Ebola acute data

Samples

June/July 2018
January 2019
Tulane
TSRI-BS

waiting...

TSRI-distribution

Datasets

HLA
Serology
Viral Seq

Data

HLA

Serology

Serology

Viral Seq

AA alignment
SNPs

remove non-public pages from sitemap?

If we don't allow search engines to crawl the page content, I think there is no reason to have them in our sitemap.xml? Thinking specifically of these lines...

<url><loc>https://data.cvisb.org/sample</loc></url>

<url><loc>https://data.cvisb.org/upload</loc></url>
<url><loc>https://data.cvisb.org/upload/dataset</loc></url>
<url><loc>https://data.cvisb.org/upload/patient</loc></url>
<url><loc>https://data.cvisb.org/upload/sample</loc></url></urlset>

Easy change to make of course, but just want to be sure I'm not missing something...

Query parser: allow combinations of AND'd experimental queries

Individual

https://data.cvisb.org/api/patient/query?q=__all__&size=10&experimentQuery=includedInDataset:viral-seq

https://data.cvisb.org/api/patient/query?q=__all__&size=10&experimentQuery=includedInDataset:hla

OR'd

https://data.cvisb.org/api/patient/query?q=__all__&size=10&experimentQuery=includedInDataset:viral-seq,hla

FAIL: AND'd

https://data.cvisb.org/api/patient/query?q=__all__&size=10&experimentQuery=includedInDataset:viral-seq AND includedInDataset:hla

HLA modifications

Add #s to bar graphs
Dim on hover, like patient view
fix swoopy arrow?

Successfully validate Experiment:data data

Currently, schema_conversion.py cannot validate Experiment:data. Ideal behavior: Experiment:data will be of type HLAData OR ViralSeqData OR PiccoloData....

Right now, it's confused because it doesn't know which schema to validate against; it successfully validates against the generic Data schema and PiccoloData (since it has no required properties).

Solution: require within each data schema one @type property with a unique value for each type, so each data instance will in effect successfully route to the proper schema for validation. Remove generic Data schema.

improve usability, performance of patient selection filter

create dev.cvisb.org/robots.txt

Google's crawler is apparently getting confused between our production site at data.cvisb.org and our dev site at dev.cvisb.org. Specifically, when it finds two of the same datasets under the same domain name, then it is not predictable in terms of which dataset URL is actually indexed (see screenshot below). The solution (hopefully) will be to create dev.cvisb.org/robots.txt and disallow all crawlers.

related to #67

Authenticated properties in the root schema classes don't get validated

If authenticated:true for a property that is attached to a root Class (Patient, DataDownload, Dataset, Sample, Experiment, DataCatalog), it won't get added to the schema in schema_conversion.py. As a result, it doesn't get validated.

Add SARS-CoV-2 sequencing data

Create dataset, data downloads, experiments, and patient metadata records of SARS-CoV-2 sequences to upload into CViSB data portal

Include both public gIDs, sIDs in public form of `alternateIdentifier`

Right now, only the patientID gets transferred over to alternateIdentifier. Any connections to the other ID will be lost.

Need to store public identifiers separately from private ones within the data model.
Need to alter the private --> public conversion function to transfer all the public IDs to alternateIdentifier

Consider adding asynchronous requests to Biothings

Import SARS-CoV-2 datasets into outbreak.info/resources

Cross-link the SARS-CoV-2 datasets into outbreak.info resources metadata

Add SARS-CoV-2 systems serology data

Create dataset, data downloads, experiments, and patient metadata records of SARS-CoV-2 systems serology data from the Alter lab to upload into CViSB data portal

facet_size=10000 limit

Facet size limit set to 10,000; may begin to approach it when querying for all patient IDs, etc.

Download dataset, datadownload, experiment metadata (incl. sequences) after filtering

Filter by location or patient characteristic
Download patient metadata + dataset metadata + dataDownload metadata + experiment metadata + single .fasta of sequences (or other experimental data).

Will require @juliamullen to save SARS-CoV-2 .fasta sequence data on the backend and @flaneuse to revamp the /dataset front-end.

Related to #63

virus sequencing datasets not loading

Neither of these pages seem to load content for me?

https://data.cvisb.org/dataset/ebola-virus-seq
https://data.cvisb.org/dataset/lassa-virus-seq

Add front-end searching/filtering by admin 1-3

private --> public function: return null if private, rather than empty object

Fix public script so private variables that are arrays become null in the public form, rather than this ugly array of list

"Symptoms": [{"symptoms": {}}]

add in route parsing on changing route for patient page

currently, when the user navigates away from the patient page to another page, the filters persist. either:

clear filters on route change
or
rehighlight selected filters on page change.

Also change/parse url on filter application

redirection to login not working from /dataset

when not logged in, https://data.cvisb.org/patient and https://data.cvisb.org/sample redirect to the login page https://data.cvisb.org/login. I assume that should be the same behavior as https://data.cvisb.org/dataset, but that currently just displays a blank page.

Enable filter/search for type of raw data

Raw data now shows all types in one list. Maybe adding a filter/search for individual types like fasta or BAM would be helpful.

links from navigation "cards" on home page are broken

All links on the home page "cards" (https://data.cvisb.org/home) are broken. e.g., the link to https://data.cvisb.org/home/dataset should be https://data.cvisb.org/dataset (to match the correct link in the header). Assuming this has to do with the reassignment of the home page to "/home" so the root will redirect to cvisb.org/data...

Add SARS-CoV-2 repertoire sequencing data

Create dataset, data downloads, experiments, and patient metadata records of SARS-CoV-2 longitudinal repertoire sequencing data from the Briney lab to upload into CViSB data portal (related to: https://science.sciencemag.org/content/early/2020/06/15/science.abc7520)

Schedule public/private syncing of data

Right now, the public-ification script has to be manually called to synchronize public data with the private one. This should be scheduled to automatically sync.

Create API guide

Document API for internal and external use.

Hover on HLA .bar triggers event on other views

Happens only for DRB4 and DRB5

General API profiling and optimization

Allow arrays of enumerated values in jsonschema validation

Redo schema_conversion.py to spit out config file

Rather than copying/pasting the public field names into config_cvisb_endpoints.py, save the output of schema_conversion.py to a config file that gets references in the endpoints config file (or something else that avoids manual copy/pasting).

Increase capacity of Biothings facet function

Aggregation within Elasticsearch is much more powerful than is currently available within Biothings package; first useful thing to port over would be COUNT DISTINCT functionality. Averages, medians, etc. might also be useful.

Streamline upload process on the backend

Uploading large chunks of data is a pain, since there's not a good way to queue the data to be uploaded, and due to the complexity of the .json validation before ES-insertion, 300 records takes ~ 5 min to upload.

There are at least a few limits to queuing large amounts of data:

The front-end has a limit for how much data it can store in memory for uploading
The backend can only accept I think about 1 MB before it complains; as a result, right now the front-end parses the file into ~ 1 MB chunks to send to the backend.
On the prod server, if there are too many simultaneous requests, the multiprocessing queue can get mixed up and the same record can be inserted multiple times into the index.

Ideally, we could queue a buncha records and let it do its thing overnight. This may involve moving away from the front-end interface, but we'll still have problems with the multiprocessing inserting duplicates.

Pre-calculate stats on HLA data on backend

Periodically (when private --> public function called?) generate static stats for HLA data on the backend; store and serve to the front-end. Stats will be generated by an R script.

Bug: not all experimental results listed on patient page?

Should have HLA? https://data.cvisb.org/patient/G15-756333
https://data.cvisb.org/patient/S-2077930

sero data not listed? https://data.cvisb.org/patient/G11-552352

Enable command-line API access when authenticated

Access API via Python and/or R.

URL character limit for backend queries

When executing a query limiting the patientIDs to a subset of them, in certain cases where there's a large number of IDs (> 50), the query doesn't execute since the URL string is too long. At the moment, only becomes a problem in a small number of queries so not high priority.

Provide more graceful errors for upload errors with memory issues

Related to #38 (front-end problems with big uploads). Occasionally, when there are large number of documents to be added to the backend, the upload will fail, returning initially a 520 error followed by some 503 ones. It becomes challenging to decipher which records were successfully uploaded and which failed, as well as trying to understand why the upload failed.

Revamp private --> public function to allow for embargoed data

Check releaseDate in /experiment data; if null or after today, pop that record from the public index.

Store and display subnational data for patients

For SARS-CoV-2 seq patients, need to be able to display state / city info.

Complications:

It's often unclear whether a location is a the patient's home location or healthcare provider.
For privacy issues, we need to be able to control which information gets exposed to the public. If the info is stored in the object homeLocation, we can't make the admin3 / city data available to the public. Requires either a rewriting of the schema or the private to public python function.

Change /patient to reflect filters in URL

When you apply a filter to data.cvisb.org/patient, the url should change to a permanent url to reflect those filters

cvisb / cvisb_data Goto Github PK

cvisb_data's People

Stargazers

Watchers

cvisb_data's Issues

Patients

waiting...

Samples

waiting...

Datasets

Data

HLA

Serology

Viral Seq

Individual

OR'd

FAIL: AND'd

Recommend Projects

Recommend Topics

Recommend Org