Giter VIP home page Giter VIP logo

cvisb / cvisb_data Goto Github PK

View Code? Open in Web Editor NEW
3.0 6.0 0.0 54.33 MB

Data portal and API for Center for Viral Systems Biology (CViSB) data

Home Page: https://data.cvisb.org/home

License: MIT License

Python 35.41% JavaScript 0.19% TypeScript 42.33% HTML 13.16% R 3.00% SCSS 5.92%
biology systems-biology systemsbiology viral-genomics viral-metagenomics viral-allele viral-diversity viral-ngs ebola ebola-outbreak

cvisb_data's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

cvisb_data's Issues

Redo `&elisa` queries to search /experiment, not /patient

Currently, ELISA data is attached to /patient data; however, it makes more sense to define ELISA data as separate Experiments in /experiment. The cross-endpoint &elisa queries will need to be modified to reference the /experiment endpoint.

NOTE: ELISA queries are nested queries and require special parsing in order to execute properly.

https://data.cvisb.org/api/patient/query?q=__all__&elisa=[[elisa.virus.keyword:Lassa%20AND%20elisa.assayType.keyword:Ag%20AND%20elisa.ELISAresult.keyword:negative%20AND%20elisa.timepoint.keyword:%22patient%20admission%22]]

Data download broken in some circumstances

main-es2015.6ad4d7d983597843effc.js:1 ERROR TypeError: Cannot read properties of undefined (reading 'blurry_vision')

Steps to reproduce:

  1. Filter Lassa
  2. Filter 2019-2020
  3. Download

Fix cross-endpoint querying so they can be combined

Currently, if there are multiple cross-endpoint queries, only the second one gets executed. Requires rewriting the query parser to separate and combine queries. At least at first, queries will probably all be AND'd.

Returns 1832 results as of 2 December 2019:
https://data.cvisb.org/api/patient/query?q=__all__&elisa=[[elisa.virus.keyword:Lassa AND elisa.assayType.keyword:Ag AND elisa.ELISAresult.keyword:negative AND elisa.timepoint.keyword:"patient admission"]]

Returns 369 results as of 2 December 2019
https://data.cvisb.org/api/patient/query?q=__all__&experimentQuery=includedInDataset:hla

ERROR: combined query only returns second query:
https://data.cvisb.org/api/patient/query?q=__all__&elisa=[[elisa.virus.keyword:Lassa%20AND%20elisa.assayType.keyword:Ag%20AND%20elisa.ELISAresult.keyword:negative%20AND%20elisa.timepoint.keyword:%22patient%20admission%22]]&experimentQuery=includedInDataset:hla

Issues with indexing by Google Dataset Search

Just starting a thread to track notes on whether CViSB datasets are being indexed on Google Dataset Search.

Currently, there are five datasets on data.cvisb.org (all listed in https://data.cvisb.org/assets/sitemap.xml):

Two datasets are indexed (SARS-CoV-2, HLA) (https://datasetsearch.research.google.com/search?query=site%3Adata.cvisb.org)

image

Google Search Console reports 1 error, 0 "valid with warning" and 0 "valid" (https://search.google.com/search-console/datasets?resource_id=https%3A%2F%2Fdata.cvisb.org%2F). Oddly, the one error is for the HLA dataset (one of the successfully-indexed datasets). The error relates to having an object of type Organization under Citation.

image

Using the Rich Results Testing tool, that error shows up for 3 datasets (Ebola, Lassa, HLA) -- of those three, HLA is successfully indexed in Google Dataset Search. Two datasets (SARS-CoV-2 and systems serology) show up as "Page is eligible for rich results", but only systems serology is successfully indexed. The URL inspection tool on Google Search Console confirms that the datasets are successfully detected -- I just requested re-indexing in the hopes that those datasets will show up in Google Dataset Search (but I seem to recall doing this before).

image

And one last note that at different times, I have seen all five datasets successfully indexed and also three datasets successfully indexed. As far as I know, we have not changed anything on our end that would explain those changes. From now, will try to track that here...

Data source updates -- spring 2019

Patients

  • survivor roster
  • Ebola survivor data
  • Lassa acute roster
  • Lassa acute data

waiting...

  • Lassa survivor data
  • Ebola acute roster
  • Ebola acute data

Samples

  • June/July 2018
  • January 2019
  • Tulane
  • TSRI-BS

waiting...

  • TSRI-distribution

Datasets

  • HLA
  • Serology
  • Viral Seq

Data

HLA

  • HLA

Serology

  • Serology

Viral Seq

  • AA alignment
  • SNPs

remove non-public pages from sitemap?

If we don't allow search engines to crawl the page content, I think there is no reason to have them in our sitemap.xml? Thinking specifically of these lines...

<url><loc>https://data.cvisb.org/sample</loc></url>

<url><loc>https://data.cvisb.org/upload</loc></url>
<url><loc>https://data.cvisb.org/upload/dataset</loc></url>
<url><loc>https://data.cvisb.org/upload/patient</loc></url>
<url><loc>https://data.cvisb.org/upload/sample</loc></url></urlset>

Easy change to make of course, but just want to be sure I'm not missing something...

Query parser: allow combinations of AND'd experimental queries

HLA modifications

  • Add #s to bar graphs
  • Dim on hover, like patient view
  • fix swoopy arrow?

Successfully validate Experiment:data data

Currently, schema_conversion.py cannot validate Experiment:data. Ideal behavior: Experiment:data will be of type HLAData OR ViralSeqData OR PiccoloData....

Right now, it's confused because it doesn't know which schema to validate against; it successfully validates against the generic Data schema and PiccoloData (since it has no required properties).

Solution: require within each data schema one @type property with a unique value for each type, so each data instance will in effect successfully route to the proper schema for validation. Remove generic Data schema.

create dev.cvisb.org/robots.txt

Google's crawler is apparently getting confused between our production site at data.cvisb.org and our dev site at dev.cvisb.org. Specifically, when it finds two of the same datasets under the same domain name, then it is not predictable in terms of which dataset URL is actually indexed (see screenshot below). The solution (hopefully) will be to create dev.cvisb.org/robots.txt and disallow all crawlers.

related to #67

image

Add SARS-CoV-2 sequencing data

Create dataset, data downloads, experiments, and patient metadata records of SARS-CoV-2 sequences to upload into CViSB data portal

Include both public gIDs, sIDs in public form of `alternateIdentifier`

Right now, only the patientID gets transferred over to alternateIdentifier. Any connections to the other ID will be lost.

  1. Need to store public identifiers separately from private ones within the data model.
  2. Need to alter the private --> public conversion function to transfer all the public IDs to alternateIdentifier

Add SARS-CoV-2 systems serology data

Create dataset, data downloads, experiments, and patient metadata records of SARS-CoV-2 systems serology data from the Alter lab to upload into CViSB data portal

facet_size=10000 limit

Facet size limit set to 10,000; may begin to approach it when querying for all patient IDs, etc.

add in route parsing on changing route for patient page

currently, when the user navigates away from the patient page to another page, the filters persist. either:

  • clear filters on route change
    or
  • rehighlight selected filters on page change.

Also change/parse url on filter application

Schedule public/private syncing of data

Right now, the public-ification script has to be manually called to synchronize public data with the private one. This should be scheduled to automatically sync.

Redo schema_conversion.py to spit out config file

Rather than copying/pasting the public field names into config_cvisb_endpoints.py, save the output of schema_conversion.py to a config file that gets references in the endpoints config file (or something else that avoids manual copy/pasting).

Increase capacity of Biothings facet function

Aggregation within Elasticsearch is much more powerful than is currently available within Biothings package; first useful thing to port over would be COUNT DISTINCT functionality. Averages, medians, etc. might also be useful.

Streamline upload process on the backend

Uploading large chunks of data is a pain, since there's not a good way to queue the data to be uploaded, and due to the complexity of the .json validation before ES-insertion, 300 records takes ~ 5 min to upload.

There are at least a few limits to queuing large amounts of data:

  1. The front-end has a limit for how much data it can store in memory for uploading
  2. The backend can only accept I think about 1 MB before it complains; as a result, right now the front-end parses the file into ~ 1 MB chunks to send to the backend.
  3. On the prod server, if there are too many simultaneous requests, the multiprocessing queue can get mixed up and the same record can be inserted multiple times into the index.

Ideally, we could queue a buncha records and let it do its thing overnight. This may involve moving away from the front-end interface, but we'll still have problems with the multiprocessing inserting duplicates.

Pre-calculate stats on HLA data on backend

Periodically (when private --> public function called?) generate static stats for HLA data on the backend; store and serve to the front-end. Stats will be generated by an R script.

URL character limit for backend queries

When executing a query limiting the patientIDs to a subset of them, in certain cases where there's a large number of IDs (> 50), the query doesn't execute since the URL string is too long. At the moment, only becomes a problem in a small number of queries so not high priority.

Provide more graceful errors for upload errors with memory issues

Related to #38 (front-end problems with big uploads). Occasionally, when there are large number of documents to be added to the backend, the upload will fail, returning initially a 520 error followed by some 503 ones. It becomes challenging to decipher which records were successfully uploaded and which failed, as well as trying to understand why the upload failed.

Store and display subnational data for patients

For SARS-CoV-2 seq patients, need to be able to display state / city info.

Complications:

  1. It's often unclear whether a location is a the patient's home location or healthcare provider.
  2. For privacy issues, we need to be able to control which information gets exposed to the public. If the info is stored in the object homeLocation, we can't make the admin3 / city data available to the public. Requires either a rewriting of the schema or the private to public python function.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.