Giter VIP home page Giter VIP logo

gnomad_methods's Introduction

Hail utilities for gnomAD

PyPI

This repo contains a number of Hail utility functions and scripts for the gnomAD project and the Translational Genomics Group. As we continue to expand the size of our datasets, we are constantly seeking to find ways to reduce the complexity of our workflows and to make these functions more generic. As a result, the interface for many of these functions will change over time as we generalize their implementation for more flexible use within our scripts. We are also continously adapting our code to regular changes in the Hail interface. These repos thus represent only a snapshot of the gnomAD code base and are shared without guarantees or warranties.

We therefore encourage users to browse through the API reference to identify modules and functions that will be useful in their own pipelines, and to edit and reconfigure relevant code to suit their particular analysis and QC needs.

gnomad_methods's People

Contributors

averywpx avatar berylc avatar ch-kr avatar chrisvittal avatar danking avatar dependabot[bot] avatar gtiao avatar jkgoodrich avatar jmarshall avatar klaricch avatar koalaqin avatar konradjk avatar lfrancioli avatar macarthurlab avatar matren395 avatar mattsolo1 avatar mike-w-wilson avatar mkanai avatar nawatts avatar qingbowang avatar raynamharris avatar theferrit32 avatar tpoterba avatar williamphu avatar wlu04 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gnomad_methods's Issues

Remove init scripts?

Now that this can be used on a Dataproc cluster using hailctl dataproc start cluster --packages gnomad, are gnomad-init.sh and master-init.sh still necessary?

How to select subset sites

I have a list of sites identified by their genomic coordinate, how can I select these sites in your hail table. I want to know the syntax.

Include changelog in docs site

The changelog should appear as a page on the documentation site.

Complicated by the fact that the changelog is written in Markdown instead of reStructuredText.

BlockMatrixResource import ignores overwrite argument

BlockMatrixResource's import_resource method ignores the overwrite argument and always passes overwrite=False to BlockMatrix.write.

def import_resource(self, overwrite: bool = True, **kwargs) -> None:
"""
Imports the BlockMatrixResource using its import_func and writes it in its path.
:param overwrite: If ``True``, overwrite an existing file at the destination.
:param kwargs: Any additional parameters to be passed to BlockMatrix.write
:return: Nothing
"""
self.import_func(**self.import_args).write(self.path, overwrite=False, **kwargs)

StructExpression instance has no field 'gq_stats' in compute_stratified_sample_qc

Thank you so much for making gnomad_methods public and releasing a python package! It's super helpful and would tremendously help setting up a QC pipeline for our own datasets.

I was wondering about the usage of compute_stratified_sample_qc method. It starts with subsetting the matrix table entry fields to the GT field here, leaving out the GQ and DP fields:

if gt_col is not None:
mt = mt.select_entries(GT=mt[gt_col])
else:
mt = mt.select_entries("GT")

However, a bit down that function, sample_qc is called:

for strat in strata:
strat_sample_qc_ht = hl.sample_qc(mt.filter_rows(mt[strat])).cols()

Which uses GQ and DP to calculate dp_stats and gq_stats correspondingly, that are required by merge_sample_qc_expr called a few more lines down:

sample_qc_ht = sample_qc_ht.annotate(
sample_qc=merge_sample_qc_expr(list(sample_qc_ht.row_value.values()))
)

Because GQ and DP are not kept, I'm getting this error at this line:

sample_qc_expr[metric].annotate(n=sample_qc_expr.n_called)

KeyError: "StructExpression instance has no field 'gq_stats'\n    Hint: use 'describe()' to show the names of all data fields."

Wondering if I'm missing something? Should DP and GQ be added into mt.select_entries in compute_stratified_sample_qc, as well as into compute_sample_qc at gnomad_qc?

VEP configuration is tied to US region

The VEP configuration files which get_vep_config points to are specific to the US region. There are other replicates of VEP data in the Europe and Australia regions. get_vep_config should point to the same replicate that Hail uses. That can be found using /usr/share/google/get_metadata_value attributes/VEP_REPLICATE.

Document working with freq array

The freq array is a frequent (๐Ÿ˜„) cause of confusion. We should document how to work with it using the global metadata fields.

Use with specific Hail version on Dataproc

Because Hail is specified as a dependency in setup.py, hailctl dataproc start a-cluster --packages gnomad always installs the latest PyPI release of Hail. Most of the time this is not an issue since the default is to use the latest release. However, this does overwrite any specific version set by the --wheel argument.

Default Variant Filtering Browser vs. Hail Table

We are studying three gene families of interest so we would like to examine the landscape of population genetic variation in members of these families. Our gene list has 50 genes+ and so I wanted to use a programmatic approach to query gnomaD for the relevant variants of interest.
To that end, I am learning and exploring the data using the Hail tables and have downloaded the latest release gnomaD table (v 2.1.1) from the downloads page.

A curious observation:
If I use the gnomad Brower for the query gene: FANCA, the number of variants returned using this approach is 4854. However, if I follow many of the suggested filters from the release blog from the macarthur lab blog and filter the Hail Tables with the filters described below then I arrive at a very different variant count compared to the direct browser download.

FANCA Variant set: n=4854 (Filters PASS)
Using the 'Source' column, I can see the variant breakdown is the following:
exomes only: 3331
genomes only: 572
shared: 951
FANCA Variant set: (Hail Table Filtering Approach)
interval list: chr, gene_start, gene_end
gnomaD browser:16:89803958-89883066 (UCSC Annotation Interval)
my query interval: 16:89803957-89883065 (Ensembl)
Removed variants that did not pass filters (RF, AC0, InbreedingCoeff)
Filters that maybe redundant but here I removed variants that:
failed hard filters (fail_hard_filters == True)
variant_type=='snv' & rf_probability >=0.1
variant_type=='indel' & rf_probability >=0.2

Final exome variant count: 5425 compared to an expectation of 4282 (Exomes + Shared)
In turn, when I apply the same filters for the genomes table (except rf_probability >= 0.4), the discrepancy is even larger with final variant set of 9481 variants compared to an expected count of ~1523 (Genomes 572 + Shared 951 -- see above)

I would really appreciate your input if I am missing any particular filters. I suspect that this might have something to do with allele frequencies or coverage depth. Should I be using the coverage table in some way and is that documented anywhere?

Allow customizing message for Slack notifications

slack_notifications uses the process name from sys.argv in notifications.

process = os.path.basename(sys.argv[0])
try:
yield
slack_client = SlackClient(token)
slack_client.send_message(
to, f":white_check_mark: Success! {process} finished!"
)
except Exception as e:
slack_client = SlackClient(token)
slack_client.send_file(
to,
content=traceback.format_exc(),
filename=f"error_{process}_{time.strftime('%Y-%m-%d_%H:%M')}.log",
filetype="text",
comment=f":x: Error in {process}",
)

However, someone may want notifications about multiple blocks in the same script. There should be a way to distinguish those notifications.

Update clinvar VCF to lateset release

ClinVar releases a new VCF every month. gnomAD v3.1 used the same VCF/HT as v3 which dates to Sept 2019. The new ClinVar resource should be added to the reference data module as a new version in the clinvar VersionedTableResource.

Document requirements

This package relies on some dependencies installed by hailctl's init scripts. It would be nice to document requirements here so that it's easier to use this in other environments.

Consider behavior of vep_or_lookup_vep with non-default config

gnomad_hail.utils.generic.vep_or_lookup_vep accepts a vep_config argument to be passed to hl.vep.
https://github.com/macarthur-lab/gnomad_hail/blob/48fa64e8a0047e23d4cb51511257759c7277125f/gnomad_hail/utils/generic.py#L292-L301

However, if this configuration differs from the configuration that was used to generate the VEP reference table, then the result can be unexpected. At best, this could lead to differences between variants that are in vs not in the reference table. For example, if the passed configuration excludes upstream/downstream variant annotations, they would still be present for variants that were in the reference table. At worst, this could cause VEP annotations for variants not in the reference table to have a different structure than for those in the reference table, which would cause vep_or_lookup_vep to fail when it tries to union the two.

Perhaps we should remove the vep_config option from vep_or_lookup_vep and always use the same configuration that was used to generate the reference table? Or at least show a warning if a non-default vep_config is passed.

Add class to handle KeyErrors in VersionedResources

Currently, VersionedResources will throw a KeyError when a version of a resource is requested does not exist. We should handle this with a better error message, alerting the user that the version does not exist.

Logging strategy

At the moment, we use a single logger and not much thought was put into a good logging strategy for our messages throughout the library. Reviewing and developing a more principled logging strategy would be a neat thing to do.

Use local copy of VEP configuration

Currently, VEP configuration is downloaded from Hail's US bucket.

VEP_REFERENCE_DATA = {
"GRCh37": {
"vep_config": "gs://hail-us-vep/vep85-loftee-gcloud.json",
"all_possible": "gs://gnomad-public/papers/2019-flagship-lof/v1.0/context/Homo_sapiens_assembly19.fasta.snps_only.vep_20181129.ht",
},
"GRCh38": {
"vep_config": "gs://hail-us-vep/vep95-GRCh38-loftee-gcloud.json",
"all_possible": "gs://gnomad-public/resources/context/grch38_context_vep_annotated.ht",
},
}

This bucket is requestor pays, so reading from it requires that either --requester-pays-allow-all or --requester-pays-allow-buckets hail-us-vep is specified when starting the cluster. If someone isn't aware of that, it's easy to start a cluster (which takes a fair amount of time with VEP) that can't run the gnomAD VEP utilities.

hailctl dataproc's VEP init scripts download these configuration files and link them to /vep_data/vep-gcloud.json. If VEP configuration was loaded from the local copy downloaded by the init scripts instead of the hail-us-vep bucket, then the requestor pays arguments would not be necessary.

https://github.com/hail-is/hail/blob/498f73704368ea548dfcd4acc469c6fbcd61f83a/hail/python/hailtop/hailctl/dataproc/resources/vep-GRCh37.sh#L26-L28

https://github.com/hail-is/hail/blob/498f73704368ea548dfcd4acc469c6fbcd61f83a/hail/python/hailtop/hailctl/dataproc/resources/vep-GRCh38.sh#L26-L28

Split up utils.generic module

The generic module has accumulated many unrelated functions. These should be split up into more specific modules.

Generic liftover

Right now much of the code in utils.liftover is specific (or has specific options) for gnomad; specific parts should be moved to gnomad_qc.v2 and all there should be generic.

I also think that checkpointing should be taken out of e.g. lift_data since it may not be desirable in all situations (e.g very small table)

Finally, docstring needs work, e.g. :param rg: Reference genome --> is this the old or new reference?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.