broadinstitute / gnomad_methods Goto Github PK

View Code? Open in Web Editor NEW

84.0 26.0 26.0 6.34 MB

Hail helper functions for the gnomAD project and Translational Genomics Group

Home Page: https://gnomad.broadinstitute.org

License: MIT License

Python 98.00% Shell 2.00%

hail genomics exomes genomes

gnomad_methods's Introduction

Hail utilities for gnomAD

This repo contains a number of Hail utility functions and scripts for the gnomAD project and the Translational Genomics Group. As we continue to expand the size of our datasets, we are constantly seeking to find ways to reduce the complexity of our workflows and to make these functions more generic. As a result, the interface for many of these functions will change over time as we generalize their implementation for more flexible use within our scripts. We are also continously adapting our code to regular changes in the Hail interface. These repos thus represent only a snapshot of the gnomAD code base and are shared without guarantees or warranties.

We therefore encourage users to browse through the API reference to identify modules and functions that will be useful in their own pipelines, and to edit and reconfigure relevant code to suit their particular analysis and QC needs.

gnomad_methods's People

Contributors

Stargazers

Watchers

gnomad_methods's Issues

VEP 101 init script

Check in init script and Dockerfile for VEP 101.

Apply hard filters

Document constants in constants.py

And maybe move them to a more appropriate module than the catch-all "constants".

Remove init scripts?

Now that this can be used on a Dataproc cluster using hailctl dataproc start cluster --packages gnomad, are gnomad-init.sh and master-init.sh still necessary?

Outlier detection

How to select subset sites

I have a list of sites identified by their genomic coordinate, how can I select these sites in your hail table. I want to know the syntax.

Include changelog in docs site

The changelog should appear as a page on the documentation site.

Complicated by the fact that the changelog is written in Markdown instead of reStructuredText.

Review examples with PMs and incorporate feedback

BlockMatrixResource import ignores overwrite argument

BlockMatrixResource's import_resource method ignores the overwrite argument and always passes overwrite=False to BlockMatrix.write.

gnomad_methods/gnomad/resources/resource_utils.py

Lines 257 to 265 in 1dbe535

 def import_resource(self, overwrite: bool = True, **kwargs) -> None: 

 """ 

  Imports the BlockMatrixResource using its import_func and writes it in its path. 

  :param overwrite: If ``True``, overwrite an existing file at the destination. 

  :param kwargs: Any additional parameters to be passed to BlockMatrix.write 

  :return: Nothing 

  """ 

 self.import_func(**self.import_args).write(self.path, overwrite=False, **kwargs)

Split `densify_sites` into `densify_sites` and `densify_intervals`

StructExpression instance has no field 'gq_stats' in compute_stratified_sample_qc

Thank you so much for making gnomad_methods public and releasing a python package! It's super helpful and would tremendously help setting up a QC pipeline for our own datasets.

I was wondering about the usage of compute_stratified_sample_qc method. It starts with subsetting the matrix table entry fields to the GT field here, leaving out the GQ and DP fields:

gnomad_methods/gnomad/sample_qc/filtering.py

Lines 202 to 205 in ef0fbf3

 if gt_col is not None: 

 mt = mt.select_entries(GT=mt[gt_col]) 

 else: 

 mt = mt.select_entries("GT")

However, a bit down that function, sample_qc is called:

gnomad_methods/gnomad/sample_qc/filtering.py

Lines 208 to 209 in ef0fbf3

 for strat in strata: 

 strat_sample_qc_ht = hl.sample_qc(mt.filter_rows(mt[strat])).cols()

Which uses GQ and DP to calculate dp_stats and gq_stats correspondingly, that are required by merge_sample_qc_expr called a few more lines down:

gnomad_methods/gnomad/sample_qc/filtering.py

Lines 226 to 228 in ef0fbf3

 sample_qc_ht = sample_qc_ht.annotate( 

 sample_qc=merge_sample_qc_expr(list(sample_qc_ht.row_value.values())) 

 )

Because GQ and DP are not kept, I'm getting this error at this line:

gnomad_methods/gnomad/sample_qc/filtering.py

Line 313 in ef0fbf3

sample_qc_expr[metric].annotate(n=sample_qc_expr.n_called)

KeyError: "StructExpression instance has no field 'gq_stats'\n    Hint: use 'describe()' to show the names of all data fields."

Wondering if I'm missing something? Should DP and GQ be added into mt.select_entries in compute_stratified_sample_qc, as well as into compute_sample_qc at gnomad_qc?

Population inference

VEP configuration is tied to US region

The VEP configuration files which get_vep_config points to are specific to the US region. There are other replicates of VEP data in the Europe and Australia regions. get_vep_config should point to the same replicate that Hail uses. That can be found using /usr/share/google/get_metadata_value attributes/VEP_REPLICATE.

Document working with freq array

The freq array is a frequent (😄) cause of confusion. We should document how to work with it using the global metadata fields.

Create release

Sample QC runtime tasks

Use with specific Hail version on Dataproc

Because Hail is specified as a dependency in setup.py, hailctl dataproc start a-cluster --packages gnomad always installs the latest PyPI release of Hail. Most of the time this is not an issue since the default is to use the latest release. However, this does overwrite any specific version set by the --wheel argument.

Compute QC MT

Impute sex ploidy on dense data

The code for sex imputation refinement relies on both chromosome X and chromosome Y ploidy now that #127 has been merged. The code to generate chrX/Y ploidy exists for sparse MatrixTables (https://github.com/macarthur-lab/gnomad_hail/blob/master/utils/sparse_mt.py#L364), but code for dense MatrixTables does not exist.

Default Variant Filtering Browser vs. Hail Table

We are studying three gene families of interest so we would like to examine the landscape of population genetic variation in members of these families. Our gene list has 50 genes+ and so I wanted to use a programmatic approach to query gnomaD for the relevant variants of interest.
To that end, I am learning and exploring the data using the Hail tables and have downloaded the latest release gnomaD table (v 2.1.1) from the downloads page.

A curious observation:
If I use the gnomad Brower for the query gene: FANCA, the number of variants returned using this approach is 4854. However, if I follow many of the suggested filters from the release blog from the macarthur lab blog and filter the Hail Tables with the filters described below then I arrive at a very different variant count compared to the direct browser download.

FANCA Variant set: n=4854 (Filters PASS)
Using the 'Source' column, I can see the variant breakdown is the following:
exomes only: 3331
genomes only: 572
shared: 951
FANCA Variant set: (Hail Table Filtering Approach)
interval list: chr, gene_start, gene_end
gnomaD browser:16:89803958-89883066 (UCSC Annotation Interval)
my query interval: 16:89803957-89883065 (Ensembl)
Removed variants that did not pass filters (RF, AC0, InbreedingCoeff)
Filters that maybe redundant but here I removed variants that:
failed hard filters (fail_hard_filters == True)
variant_type=='snv' & rf_probability >=0.1
variant_type=='indel' & rf_probability >=0.2

Final exome variant count: 5425 compared to an expectation of 4282 (Exomes + Shared)
In turn, when I apply the same filters for the genomes table (except rf_probability >= 0.4), the discrepancy is even larger with final variant set of 9481 variants compared to an expected count of ~1523 (Genomes 572 + Shared 951 -- see above)

I would really appreciate your input if I am missing any particular filters. I suspect that this might have something to do with allele frequencies or coverage depth. Should I be using the coverage table in some way and is that documented anywhere?

Metadata organization

Relatedness

Add resources for data from gnomAD papers

constraint
pext
MNVs

Create resource modules to load these in the same way we read in public resources (e.g., public_resource.ht())

Allow customizing message for Slack notifications

slack_notifications uses the process name from sys.argv in notifications.

gnomad_methods/gnomad/utils/slack.py

Lines 156 to 172 in 0b8e100

 process = os.path.basename(sys.argv[0]) 

 try: 

 yield 

 slack_client = SlackClient(token) 

 slack_client.send_message( 

 to, f":white_check_mark: Success! {process} finished!" 

 ) 

 except Exception as e: 

 slack_client = SlackClient(token) 

 slack_client.send_file( 

 to, 

 content=traceback.format_exc(), 

 filename=f"error_{process}_{time.strftime('%Y-%m-%d_%H:%M')}.log", 

 filetype="text", 

 comment=f":x: Error in {process}", 

 )

However, someone may want notifications about multiple blocks in the same script. There should be a way to distinguish those notifications.

Add additional annotations

Update clinvar VCF to lateset release

ClinVar releases a new VCF every month. gnomAD v3.1 used the same VCF/HT as v3 which dates to Sept 2019. The new ClinVar resource should be added to the reference data module as a new version in the clinvar VersionedTableResource.

Liftover includes hard-coded path to private bucket

gnomad_methods/gnomad/utils/liftover.py

Lines 27 to 28 in 2090cbb

 if gnomad: 

 return f"gs://gnomad/liftover/release/2.1.1/ht/{data_type}/gnomad.{data_type}.r2.1.1.liftover.ht"

It's surprising that setting gnomad=True sets this path to a specific bucket. Others without access to this bucket may run this script on public gnomAD data.

Document requirements

This package relies on some dependencies installed by hailctl's init scripts. It would be nice to document requirements here so that it's easier to use this in other environments.

Compute info HT

Consider behavior of vep_or_lookup_vep with non-default config

gnomad_hail.utils.generic.vep_or_lookup_vep accepts a vep_config argument to be passed to hl.vep.
https://github.com/macarthur-lab/gnomad_hail/blob/48fa64e8a0047e23d4cb51511257759c7277125f/gnomad_hail/utils/generic.py#L292-L301

However, if this configuration differs from the configuration that was used to generate the VEP reference table, then the result can be unexpected. At best, this could lead to differences between variants that are in vs not in the reference table. For example, if the passed configuration excludes upstream/downstream variant annotations, they would still be present for variants that were in the reference table. At worst, this could cause VEP annotations for variants not in the reference table to have a different structure than for those in the reference table, which would cause vep_or_lookup_vep to fail when it tries to union the two.

Perhaps we should remove the vep_config option from vep_or_lookup_vep and always use the same configuration that was used to generate the reference table? Or at least show a warning if a non-default vep_config is passed.

Add class to handle KeyErrors in VersionedResources

Currently, VersionedResources will throw a KeyError when a version of a resource is requested does not exist. We should handle this with a better error message, alerting the user that the version does not exist.

Reconcile v3.1 changes to common code functions within gnomad_qc/v3

Functions to be reconciled:
ht_to_vcf_mt
make_hist_dict
make_info_dict
make_vcf_filter_dict
add_as_info_dict
make_label_combos
remove_fields_from_globals
build_export_reference()

Update gnomAD resources to reflect free Google Cloud-hosted paths

Prefixes should change from gs://gnomad-public/ or gs://gnomad-public-requester-pays/ to gs://gcp-public-data--gnomad

Platform PCA

Create resources PR for new Amazon resources

We should discuss creating a resources library for gnomAD files hosted on Amazon for any users that are writing pipelines on AWS

Compute trio stats

Logging strategy

At the moment, we use a single logger and not much thought was put into a good logging strategy for our messages throughout the library. Reviewing and developing a more principled logging strategy would be a neat thing to do.

Population PCA

Publish documentation for latest release

Currently, we publish documentation on every commit to master, so https://broadinstitute.github.io/gnomad_methods/ contains documentation for the latest development version. However, users are likely using the last PyPI release, which may lag behind development. We should possibly publish documentation only on releases or publish both a stable and development version of documentation.

Remove old_alleles annotation in v3 Hail Table

Currently this will include alleles from non-releasable sampless

Compute sample QC HT

Run VEP

Sex imputation

Update resource paths to new Google/AWS/Azure public hosted data

Since we'll end up with 3 versions (one on each of the major cloud providers) for each public file, we'll need to implement a new structure for specifying which cloud version to use

Avoid hyphenated field names in VCF export

Hyphenated info keys are incompatible with bcftools query format (samtools/bcftools#1335).

Hyphenated info keys are also not allowed in version 4.3 of the VCF specification (item 8 of section 1.6.1). (CurrentgnomAD VCFs are version 4.2)

VCF export scripts should validate field names against the regular expression in the VCF spec.

Use local copy of VEP configuration

Currently, VEP configuration is downloaded from Hail's US bucket.

gnomad_methods/gnomad/utils/vep.py

Lines 65 to 74 in 2090cbb

 VEP_REFERENCE_DATA = { 

 "GRCh37": { 

 "vep_config": "gs://hail-us-vep/vep85-loftee-gcloud.json", 

 "all_possible": "gs://gnomad-public/papers/2019-flagship-lof/v1.0/context/Homo_sapiens_assembly19.fasta.snps_only.vep_20181129.ht", 

 }, 

 "GRCh38": { 

 "vep_config": "gs://hail-us-vep/vep95-GRCh38-loftee-gcloud.json", 

 "all_possible": "gs://gnomad-public/resources/context/grch38_context_vep_annotated.ht", 

 }, 

 }

This bucket is requestor pays, so reading from it requires that either --requester-pays-allow-all or --requester-pays-allow-buckets hail-us-vep is specified when starting the cluster. If someone isn't aware of that, it's easy to start a cluster (which takes a fair amount of time with VEP) that can't run the gnomAD VEP utilities.

hailctl dataproc's VEP init scripts download these configuration files and link them to /vep_data/vep-gcloud.json. If VEP configuration was loaded from the local copy downloaded by the init scripts instead of the hail-us-vep bucket, then the requestor pays arguments would not be necessary.

https://github.com/hail-is/hail/blob/498f73704368ea548dfcd4acc469c6fbcd61f83a/hail/python/hailtop/hailctl/dataproc/resources/vep-GRCh37.sh#L26-L28

https://github.com/hail-is/hail/blob/498f73704368ea548dfcd4acc469c6fbcd61f83a/hail/python/hailtop/hailctl/dataproc/resources/vep-GRCh38.sh#L26-L28

Finally, docstring needs work, e.g. :param rg: Reference genome --> is this the old or new reference?

	def import_resource(self, overwrite: bool = True, **kwargs) -> None:
	"""
	Imports the BlockMatrixResource using its import_func and writes it in its path.

	:param overwrite: If ``True``, overwrite an existing file at the destination.
	:param kwargs: Any additional parameters to be passed to BlockMatrix.write
	:return: Nothing
	"""
	self.import_func(self.import_args).write(self.path, overwrite=False, kwargs)

	if gt_col is not None:
	mt = mt.select_entries(GT=mt[gt_col])
	else:
	mt = mt.select_entries("GT")

	for strat in strata:
	strat_sample_qc_ht = hl.sample_qc(mt.filter_rows(mt[strat])).cols()

	sample_qc_ht = sample_qc_ht.annotate(
	sample_qc=merge_sample_qc_expr(list(sample_qc_ht.row_value.values()))
	)

	process = os.path.basename(sys.argv[0])
	try:
	yield

	slack_client = SlackClient(token)
	slack_client.send_message(
	to, f":white_check_mark: Success! {process} finished!"
	)
	except Exception as e:
	slack_client = SlackClient(token)
	slack_client.send_file(
	to,
	content=traceback.format_exc(),
	filename=f"error_{process}_{time.strftime('%Y-%m-%d_%H:%M')}.log",
	filetype="text",
	comment=f":x: Error in {process}",
	)

	if gnomad:
	return f"gs://gnomad/liftover/release/2.1.1/ht/{data_type}/gnomad.{data_type}.r2.1.1.liftover.ht"

	VEP_REFERENCE_DATA = {
	"GRCh37": {
	"vep_config": "gs://hail-us-vep/vep85-loftee-gcloud.json",
	"all_possible": "gs://gnomad-public/papers/2019-flagship-lof/v1.0/context/Homo_sapiens_assembly19.fasta.snps_only.vep_20181129.ht",
	},
	"GRCh38": {
	"vep_config": "gs://hail-us-vep/vep95-GRCh38-loftee-gcloud.json",
	"all_possible": "gs://gnomad-public/resources/context/grch38_context_vep_annotated.ht",
	},
	}

broadinstitute / gnomad_methods Goto Github PK

gnomad_methods's Introduction

Hail utilities for gnomAD

gnomad_methods's People

Contributors

Stargazers

Watchers

Forkers

gnomad_methods's Issues

Recommend Projects

Recommend Topics

Recommend Org