biothings / myvariant.info Goto Github PK

provide annotation based on ACMG guidlines
ACMG guidelines is widely used to interpret variants.
We could provide variant classification results based on ACMG guidelines.
https://www.acmg.net/docs/standards_guidelines_for_the_interpretation_of_sequence_variants.pdf

Which config file?

When installing myvariant and testing, it asks for a BioThings config file. Which file should we use or how should we configure it? Thanks.

Generate and store list of _id in s3

Output file: list of all _id in each myvariant's assembly. Feature was deactivated in 2cf6144 after switching to cold/hot collection design.

With cold/hot, since we never have the full merged collection in mongo, the only way to generate such list in an efficient manner is to use cache file, cold and hot ones, then sort/uniq them (as some hot _ids are already in cold) to create the output file.

Note: this file is used by clingen team to generate CAID for myvariant

the counts on the data sources table (in the docs) not updating correctly

In this table: http://docs.myvariant.info/en/latest/doc/data.html#data-sources

CIViC auto upload

CIViC is loaded through API query. Should trigger it every month.

home page has logo with a left-handed helix

switch to right-handed helix

Update dbNSFP parser

dbNSFP parser needs to be updated for version 4.0b1a
Details could be found at: https://sites.google.com/site/jpopgen/dbNSFP

Cosmic mutation frequency information seems limited/arbitrary

Thank you for this amazing resource!

We are in the process of adding selected relevant information from myvariant.info to CIViC (civicdb.org).

While considering options, we hoped to add cosmic mutation frequency. But the mutation frequency available appears to be the frequency for a single tumor site? And this is chosen arbitrarily from several possibilities perhaps?

Consider this example (which seem representative of other variants in myvariant.info):
http://myvariant.info/v1/variant/chr7:g.140453136A%3ET

This is BRAF V600E.
http://cancer.sanger.ac.uk/cosmic/mutation/overview?id=476

The mutation frequency information returned for COSMIC is:
"mut_freq": 2.83,
"tumor_site": "biliary_tract"

See attached:

This seems odd. How is this being determined? Would it be possible to determine overall mutation frequency across all tumor_sites, and then for each tumor_site and perhaps return the top site(s) and their frequencies?

Our relevant CIViC github issues are:
griffithlab/civic-server#243
griffithlab/civic-server#38

For now we will move on without using the COSMIC info but it would be great to have more options to select from here.

variant normalization

Hi,
We are wondering how the variant normalization is done in myvariant.info? When you import the variants from each database, do you do any sort of internal variant normalization or just take the chr,pos,ref,alt directly from the source?

Thanks

snpeff annotation switch to use GRCh37 and GRCh38 reference genomes

Currently, we use "hg19" and "hg38" reference genomes (from UCSC) to produce snpeff annotations. The result misses "gene_id" field (the value is the same as "gene_name"). We can switch to use GRCh37 and GRCh38 reference genomes available here:

https://sourceforge.net/projects/snpeff/files/databases/v4_3/

Also we could upgrade the snpeff version we used too.

usage stats on front page not updating

minor note -- I just noticed that usage stats on the front page of mygene.info are updating, but not so for myvariant.info. (last stamp is 2016-11-15...)

indexing premerge/hot data sends notification "done" when only premerge collection is indexed

... it should send the notification only when both premerge and hot collections are indexed

snpeff ann field is sometimes a list, sometimes an object

The format for the field ann, nested in snpeff, is a list in variants like in chr1:g.35367G>A, and an object in variants like chr7:g.140453136A>T. While trying to parse the output, this complicates the mapping of the key and values. Was this intended?

Thanks!

facet query on cadd fails (unittest MyVariantTest.test_query_facets)

http://myvariant.info/v1/query?q=cadd.gene.gene_id:ENSG00000113368&facets=cadd.polyphen.cat&size=0

gives

{
"success": false,
"error": "Could not execute query due to the following exception(s): ['illegal_argument_exception Fielddata is disabled on text fields by default. Set fielddata=true on [cadd.polyphen.cat] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.']"
}

Need to update CADD mapping (and rebuilt pre-merge/cold collection)

clingen.caid should be indexed

We now have clingen CA id loaded for hg38 index, we should have this clingen.caid field indexed (as "string_lowercase").

get_hgvs_from_vcf & get_pos_start_end doesn't handle cases where REF is None

Example:
CHROM: 22
POS: 18898839
REF: A
ALT: NONE

This example comes from dbSNP v151

update field-specific notes for some data fields

Those notes can be added here:

https://github.com/biothings/myvariant.info/blob/master/src/web/context/myvariant_field_table_notes.json

It will then rendered in the "Notes" column of the available-fields table in the docs:

http://docs.myvariant.info/en/latest/doc/data.html#available-fields

add incidence rates in TCGA data

evaluate denovo-db and add if appropriate

https://academic.oup.com/nar/article/45/D1/D804/2770653/denovo-db-a-compendium-of-human-de-novo-variants
http://denovo-db.gs.washington.edu/denovo-db/Download.jsp
http://denovo-db.gs.washington.edu/denovo-db/Usage.jsp

from https://github.com/monarch-initiative/MDAC/issues/2

Discrepancies in returned COSMIC ids

Hi!

We've noticed that there seem to be some inconsistencies with the COSMIC data being returned in the variant annotation service.

Here's an example query:

GET myvariant.info/v1/variant/chr4:g.55141036T>C?fields=cosmic,mutdb

And the response:

{
    "_id": "chr4:g.55141036T>C",
    "_version": 2,
    "cosmic": {
        "alt": "C",
        "chrom": "4",
        "cosmic_id": "COSM1430077",
        "hg19": {
            "end": 55141036,
            "start": 55141036
        },
        "mut_freq": 0.14,
        "mut_nt": "T>C",
        "ref": "T",
        "tumor_site": "large_intestine"
    },
    "mutdb": {
        "alt": "C",
        "chrom": "4",
        "cosmic_id": "85787",
        "hg19": {
            "end": 55141036,
            "start": 55141036
        },
        "mutpred_score": -1,
        "ref": "T",
        "rsid": null,
        "strand": "p"
    }
}

The cosmic id returned in the cosmic top level key (body['cosmic']['cosmid_id'] doesn't match the cosmic id returned in the mutdb top level key (body['mutdb']['cosmic_id']). Additionally, the cosmic id returned in the cosmic section isn't a valid cosmic id at all, while the one in the mutdb section appears to be the correct one for the variant in question.

I assume this is likely to come from discrepancies in the underlying data sources, but it was a little surprising to find a non-existent cosmic id in the cosmic section.

change example query

In the "Query Examples" of the myvariant.info home page, we currently show http://myvariant.info/v1/variant/chr1:g.35367G>A for annotation retrieval. But, that specific variant has a rather limited set of annotation sources. I'd suggest choosing another variant that better highlights as many of the annotation resources as possible.

MyVariant.info release notes should have anchors for each release

MyVariant.info release notes are here:

http://docs.myvariant.info/en/latest/doc/release_changes.html

It would be handy to add the anchor (for the direct URL) to each release, something like this:

http://docs.myvariant.info/en/latest/doc/release_changes.html#release-20190226

and even deeper into each of hg19 and hg38 release notes:

http://docs.myvariant.info/en/latest/doc/release_changes.html#release-20190226-hg19
http://docs.myvariant.info/en/latest/doc/release_changes.html#release-20190226-hg38

When the hash exists, it should expand the specific release note content.

The rendering of the "anchor" can be made the same as the other anchors on this page, e.g. this one:

http://docs.myvariant.info/en/latest/doc/release_changes.html#myvariant-releases

(the anchor icon will show up when mouse-over)

The same changes can be applied to docs.mygene.info and docs.mychem.info as well.

Production stability

Hi,

Great work with variant info project.
Infact I was part of the hackathon where you guys came up with this.

I am wondering how stable is this now and what are your future plans.
Any plans integrating with mygene.info or making more stable service on its own?

Thanks,
Nikhil

Parser for new data source: FIRE

https://sites.google.com/site/fireregulatoryvariation/

add github link to front page of myvariant.info

also applies to mygene and mychem...

Adding computed VEP variant annotations to MyVariant.info

VEP: http://uswest.ensembl.org/info/docs/tools/vep/index.html

Similar to the SnpEff annotation we have already, VEP is a tool to compute variant impact.

The logic of get_pos_start_end and _normalize_vcf is conflicting

Use case: try to normalize vcf before using the get_pos_start_end function.

Problem:
In the case of deletion: REF -> TTTCTTTTTCTTTTTCTTTTTCTTTCTT, ALT -> TG
_normalize_vcf would trim the first T from both REF and ALT

However, get_pos_start_end asserts the first nucleotide in both REF and ALT is the same
see: https://github.com/biothings/myvariant.info/blob/master/src/utils/hgvs.py#L150

These two functions could not be used together to handle deletion cases.

ExAC mapping

The mapping file for ExAC contains a small problem. The ac_hom field should be put in 'ac' rather than 'hom'.
Potential solutions:

change the mapping
add an additional field called 'ac_hom' under 'hom'

dbSNP download site change (Maybe?)

Currently, the newest release of dbSNP is v152. Our latest version in MyVariant.info is v151.

We download from: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/
The last update time for the file is: 4/22/2018 (v151)

v152 is stored in: ftp://ftp.ncbi.nih.gov/snp/latest_release/VCF

Also, from v152, dbSNP provides the JSON version of the data dump:
ftp://ftp.ncbi.nih.gov/snp/latest_release/JSON

Related post regarding the change from dbSNP: https://ncbiinsights.ncbi.nlm.nih.gov/2017/07/07/dbsnp-redesign-supports-future-data-expansion/

query variants with genename

Hi,
One task I'd like to run with myvariant.info is to return all variants in a gene. For example TP53, so I tried
http://myvariant.info/v1/query?q=TP53&fields=_id
which returns with count of 5918.
I also tried query with ensembleID
http://myvariant.info/v1/query?q=ENSG00000141510&fields=_id
which returns nothing.
Then I tried
http://myvariant.info/v1/query?q=dbnsfp.ensembl.geneid:ENSG00000141510&fields=_id
http://myvariant.info/v1/query?q=cadd.gene.gene_id:ENSG00000141510&fields=_id
which returns 3318 and 4539.

So the question I have is when I just search for TP53, which fields are searched exactly. It seems the default query in elasticseach is search _all fields? and why I can't get any results back with just ensembleID? Is range query a better way to get all variants related to a gene? Or what is the best way to do this task with myvariant.info api?

Thank you very much

load data from ClinGen VCI database

Matt Wright and Jimmy Zhen from the ClinGen team seemed interested in this idea at the CIViC hackathon. Need to reach out to them for more info on logistics...

Data source: gwas catalog

https://www.ebi.ac.uk/gwas/docs/file-downloads

load cytoband data from UCSC

Suggested by Beth Pitel (I think) at the CIViC Hackathon...

I think the data is in this file http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/cytoBand.txt.gz from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/

update wellderly data

Just noticed that the data from http://myvariant.info/v1/variant/chr12:g.1299226A%3EG?fields=wellderly differs from what's available from https://genomics.scripps.edu/browser/#. The allele/genotype frequencies are different, and they also separate out the illumina data from complete genomics.

Looks like the VCFs are here: https://genomics.scripps.edu/browser/files/wellderly/vcf/

Incorrect HGVS name creation from VCF file

Better format when using both always_list and allow_null options?

In the recent release, I noticed that there're some handy new features, including the always_list and allow_null option. But when they are used in combination, the result is probably not in the nicest format. Instead of returning an empty list [] when there's no data, it returns a list of a null object like so: [null].

It will cause some confusion for the client side, since usually you would check if the returned list is empty, as opposed to checking each element in the list if they are empty.

A sample request to reproduce this error would be:

https://myvariant.info/v1/query?q=rs12131234&fields=dbsnp&always_list=dbsnp.gene&allow_null=dbsnp.gene

I'm wondering if it's possible to change this behaviour? Thanks.

Chembl data parser fixes

sebastienlelong [2:51 PM]
@ChunleiWu also I see a lot of CHEBI:None in chembl: http://mychem.info/v1/drug/GWNBDVRVUYBAGA-UHFFFAOYSA-N?fields=chembl.chebi_par_id

dbSNP parser missing variants

User report variant "rs281865162" is missing in MyVariant.info. The problem comes from the dbSNP parser:
In https://github.com/biothings/myvariant.info/blob/master/src/hub/dataload/sources/dbsnp/dbsnp_vcf_parser.py#L60, we specifically remove all variants which are not single nucleotide deletion. Not sure if it is on purpose.
Need @newgene to confirm.

live query API does not work for some ClinVar RCVs

RCV000008604, RCV000008605, RCV000008606 and RCV000008607 share one variant (ClinVar variation 8131, also called measureSet id and variant id in their xml file). The API works for RCV000008604 only, but not for any others. Input data as mv.querymany(['RCV000008604'], scopes='clinvar.rcv_accession', fields='clinvar.clinvar_id')

data source: PheWAS catalog

https://phewascatalog.org/phewas

I think that should contain all the data in supp tables 3 and 8 in https://www.nature.com/articles/nbt.2749, but would be good to double check.

also, this is from an older 2013 paper. After this is loaded, would be good to check with the Vanderbilt team (eg Lisa Bastarache) whether there are any other relevant large-scale data available...

how to query a position with a POST

I would like to query the following variants using POST (i.e. on http://myvariant.info/v1/query):

q="chr1:54844G>A,chr1:61987A>G,chr1:61989G>C,chr1:86018C>G,chr1:86303G>T"

I've tried the above paramaters, but it returns the follows:

[
  {
    "query": "chr1:54844G>A",
    "notfound": true
  }
]

I understand that I also need to input a scope in order to make it work but I'm not sure what the scope should be in this case...

Thanks
Ismail

Unable to run clinvar_xml_parser dataloader

The clinvar_xml_parser.py data loader is referencing a clinvar or clinvar1 import that is not listed in the requirements:
https://github.com/SuLab/myvariant.info/blob/master/src/dataload/contrib/clinvar/clinvar_xml_parser.py#L5

It's changed from clinvar to clinvar1 - is the clinvar library that does the parseString() call available from you or is it a separate 3rd party lib to be installed?

https://github.com/SuLab/myvariant.info/blob/master/src/dataload/contrib/clinvar/clinvar_xml_parser.py#L315
record_parsed = clinvar1.parseString(record, silence=1)

Add version number for gnomAD in MyVariant new release

http://myvariant.info/v1/metadata
Looks like the version number and license info is missing for gnomAD.

load VICC-harmonized data

From Alex Wagner, this link https://s3-us-west-2.amazonaws.com/g2p-0.10/index.html has the current release of the VICC-harmonized data (described in https://www.biorxiv.org/content/early/2018/07/11/366856). It is subject to change as that manuscript goes through peer review. But once that's done and the data set is finalized, seems like a good source to import. (obviously we already have civic data directly, but this resource will provide access to several other sources as well in a standardized format.)

cc @ahwagner

VCF to HGVS conversion as datatransform edge

Wrap this function https://github.com/biothings/myvariant.info/blob/master/src/utils/hgvs.py#L88 as a new "compute" edge (compute a result from input data instead of lookin up data from mongodb)

biothings / myvariant.info Goto Github PK

myvariant.info's Issues

Recommend Projects

Recommend Topics

Recommend Org