Problem with getcoverage

CODEX2

Full-spectrum copy number variation detection by high-throughput DNA sequencing

Author

Yuchao Jiang, Nancy R. Zhang

Maintainer

Description

High-throughput DNA sequencing enables detection of copy number variations (CNVs) on the genome-wide scale with finer resolution compared to array-based methods, but suffers from biases and artifacts that lead to false discoveries and low sensitivity. We describe CODEX2, a statistical framework for full-spectrum CNV profiling that is sensitive for variants with both common and rare population frequencies and that is applicable to study designs with and without negative control samples. We demonstrate and evaluate CODEX2 on whole-exome and targeted sequencing data, where biases are the most prominent. CODEX2 outperforms existing methods and, in particular, significantly improves sensitivity for common CNVs.

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("CODEX")
BiocManager::install("WES.1KG.WUGSC")

install.packages('devtools')
devtools::install_github("yuchaojiang/CODEX2/package")

Questions?

If you have questions or encounter problems when using CODEX2, you can: (1) report directly here using the Issues tab by GitHub; (2) post in our Google user group https://groups.google.com/d/forum/codex2; (3) email us at [email protected].

Citation

Yuchao Jiang, Runjin Wang, Eugene Urrutia, Ioannis N. Anastopoulos, Katherine L. Nathanson, Nancy R. Zhang, 2018. CODEX2: full-spectrum copy number variation detection by high-throughput DNA sequencing. Genome Biology, 19 (1), 202, 2018. (link).

Running CODEX2

The figure below illustrates the two experimental designs for which CODEX2 can be applied: (i) case-control design with a group of negative control samples, where the goal is to detect CNVs disproportionately present in the ‘cases’ versus the ‘controls’; and (ii) detection of all CNVs present in all samples design, such as in the Exome Aggregation Consortium. The key innovation in CODEX2 is the usage of negative control genome regions in a genome-wide latent factor model for sample- and position-specific background correction, and the utilization of negative control samples, under a case-control design, to further improve background bias estimation under this model. The negative control genome regions defined by CODEX2 are regions that do not harbor common CNVs, but that are still allowed to harbor rare CNVs, and can be constructed from existing studies or learned from data.

R notebook with step-by-step demonstration is available here as html.

Demo code for CODEX2 is available here as Rmd.

IMPORTANT: CODEX2 for cancer genomics

In segmentation step, use fractional mode for somatic CNA detection (cancer is heterogenous) and interger mode for germline CNV detection (you will get CNV calls in your blood samples, which are germline).
For segmentation with paired tumor-normal experimental design, a modified CBS (circular binary segmentation) algorithm can be adopted, which ultilizes the pair information. Refer to the paired_tumor_normal_segmentation folder for code (not actively updated/maintained). Note that, from our experience, the default segmentation by CODEX2 (not using the pair information) does not make much difference. Normalization is the first order effect in WES study design.

CODEX2 for targeted sequencing

We've adapted CODEX2 for targeted sequencing. Instead of normalizing and segmenting each chromosome separately, for targeted sequencing, we combine all targets across the genome to perform normalization, followed by segmentation within each gene. Refer to codes below (need to source segment_targeted.R for gene-based segmentation).

Visualization by IGV

One can load CODEX2's CNV calling results into IGV for visualization by generating a tab-delimited seg file for each sample. Below is a sample code that we use in our daily practice -- for each sample, a *.seg.txt file is generated with six columns and header 'Sample', 'Chromosome','Start','End','Num_Probes','Segment_Mean', which correspond to sample name, chromosome, CNV start bp, CNV end bp, number of exonic targets, and log ratio of raw (i.e. observed) depths of coverage versus normalized (i.e. expected) coverage (deletion has a negative log ratio, duplication has a positive log ratio, copy-neutral region has a log ratio around 0).

CODEX2_IGV.R

CODEX2 for hg38?

CODEX2 by default is for hg19 reference. It can be adapted to hg38: only the calculations of GC content and mappability need to be changed; to get coverage for exons across samples stays the same (make sure that the exonic targets in the bed file are also in hg38 coordinates). To calculte GC content in hg38, you need to download the hg38 reference from Bioconductor. Then, after loading CODEX2, load the hg38 reference package and use the correct genome argument in the getgc() function to get the corresponding GC content.

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("BSgenome.Hsapiens.UCSC.hg38")

library(CODEX2)
library(BSgenome.Hsapiens.UCSC.hg38)
# The following object is masked from ‘package:BSgenome.Hsapiens.UCSC.hg19’:  Hsapiens

gc <- getgc(ref, genome = BSgenome.Hsapiens.UCSC.hg38)

For mappability, we download the 100mer mappability for hg19 from the ENCODE Project (link) and lifted over from hg19 to hg38 (link). The mappability for each exon/target/bin is taken as the mean mappability across all overlapped segments by ENCODE, weighted by the lengths of the segments.

Note that CODEX2 can also be adapted to the mouse genome, see below.

CODEX2 for mouse genome

CODEX2 can be applied to WES of the mouse genome. Only the calculation of GC content and mappability needs to be modified from the default (hg19). The library for the mm10 mouse genome sequencing needs to be loaded: BSgenome.Mmusculus.UCSC.mm10.

GC content can be calculated with the correct mouse genome:

library(BSgenome.Mmusculus.UCSC.mm10)
gc <- getgc(ref, genome = BSgenome.Mmusculus.UCSC.mm10)

Mappability pre-calculation: This step can be computationally extensive and thus parallel computing is recommended. There are two workarounds: 1) set all mappability to 1 using mapp=rep(1,length(gc)) since mappability is only used in the QC step to filter out exons with low mappability and thus should not affect the final output too much; 2) adopt QC procedures based on annotation results, e.g., filter out all exons within segmental duplication regions, which generally have low mappability.

Common questions

How many samples does CODEX2 need? Should I separately run samples from different batches?

We have applied CODEX2 to data sets of sample size ranging from 30 to 500. Yes, samples from different batches are highly recommended to run separately. The Poisson latent factor can presumably capture the batch effects but if additional knowledge is available beforehand, it should be ultilized. If batch information is not available, sometimes we refer to the header within the bam files.
Error in glm.fit?

Yes, we are aware that sometimes the normalize() function leads to error in glm.fit. CODEX2 adopts an iterative estimation procedure to estimate the Poisson latent factors via Poisson glm, the exon-specific bias by taking the median across all samples, and the GC content bias by fitting a non-parametric smooth.spline. We did our best to make sure that the iteration/estimation runs properly, yet sometimes the Poisson glm function in R still fails to converge due to: (1) extreme heterogeneity in the data (i.e., the data is just too noisy or the samples are from multiple batches); (2) the number of Poisson latent factors to estimate is too large. See question below.
What is the range of K? Which one is optimal?

Based on our experience, very rarely do we run CODEX2 with greater than 10 latent factors (i.e., K = 1:10 suffices for most, if not all, datasets we have). The larger the K is, the longer the estimation takes. Also, refer to the previous question regarding potential pitfalls with a large value of K.

CODEX2 includes three statistical metrics to help the users more wisely choose the optimal K: AIC, BIC, and residual variance. A pdf plot is automatically generated by the choiceofK() function. Sometimes the optimal value based on the metrics are not clear. In this case, we recommend a sanity check by focusing on known positive/negative controls and visualizing the normalization/segmentation results. For normalize2() which specifies the normal samples, the effect on different optimal K values diminishes since only the normal samples are used to estimate the exon-wise biases and latent factors.

	sample_name	chr	cnv	st_bp	ed_bp	length_kb	st_exon	ed_exon	raw_cov	norm_cov	copy_no	lratio	mBIC
cnv1	Sample 1	chr17	dup	7687395	7689351	1.957	1307	1311	474	341	3	21.683	4.216
cnv2	Sample 1	chr17	del	19348563	19353709	5.147	799	800	49	104	1	18.038	4.88
cnv3	Sample 3	chr17	dup	3094629	3121197	26.569	1150	1156	2117	1548	3	84.334	66.263
cnv4	Sample 3	chr17	dup	28673501	28681954	8.454	874	875	206	129	3	19.022	68.109
cnv5	Sample 4	chr17	dup	73626838	73627110	0.273	532	533	604	446	3	21.891	4.485
cnv6	Sample 4	chr17	dup	19348563	19353709	5.147	799	800	389	279	3	18.219	5.588
cnv7	Sample 6	chr17	del	41209048	41215988	6.941	210	212	252	443	1	46.84	29.398

	sample_name	chr	cnv	st_bp	ed_bp	length_kb	st_exon	ed_exon	raw_cov	norm_cov	copy_no	lratio	mBIC
cnv1	Sample 1	chr17	dup	7687395	7689351	1.957	1307	1311	474	341	3	21.683	4.216
cnv2	Sample 1	chr17	del	19348563	19353709	5.147	799	800	49	104	1	18.038	4.88
cnv3	Sample 3	chr17	dup	3094629	3121197	26.569	1150	1156	2117	1548	3	84.334	66.263
cnv4	Sample 3	chr17	dup	28673501	28681954	8.454	874	875	206	129	3	19.022	68.109
cnv5	Sample 4	chr17	dup	73626838	73627110	0.273	532	533	604	446	3	21.891	4.485
cnv6	Sample 4	chr17	dup	19348563	19353709	5.147	799	800	389	279	3	18.219	5.588
cnv7	Sample 6	chr17	del	41209048	41215988	6.941	210	212	252	443	1	46.84	29.398

	sample_name	chr	cnv	st_bp	ed_bp	length_kb	st_exon	ed_exon	raw_cov	norm_cov	copy_no	lratio	mBIC
cnv1	Sample 1	chr13	dup	7687395	7689351	1.957	903	905	317	218	3	19.527	3.53
cnv2	Sample 1	chr17	dup	78403552	78410018	6.467	581	582	167	97	3	19.21	6.351
cnv3	Sample 4	chr13	dup	67647500	68331876	684.377	217	219	715	538	3	20.896	4.269
cnv4	Sample 4	chr17	del	10278985	10284601	5.617	924	925	122	208	1	19.442	8.308
cnv5	Sample 4	chr13	dup	75485511	75489741	4.231	228	229	439	319	3	18.492	11.985

	sample_name	chr	cnv	st_bp	ed_bp	length_kb	st_exon	ed_exon	raw_cov	norm_cov	copy_no	lratio	mBIC
cnv1	TS1	chr1	del	14862604	153993261	139130.658	791	836	6452	9534	1	295.137	279.455
cnv2	TS2	chr1	dup	14862604	153230197	138367.594	791	833	13222	10416	3	152.839	136.733
cnv3	Sample 2	chr1	del	14862604	154221443	139358.84	791	838	5468	10091	1	1255.645	1247.018
cnv4	Sample 4	chr1	dup	45609814	50117179	4507.366	558	559	269	182	3	18.066	1.993
cnv5	Sample 5	chr1	dup	151849769	151900169	50.401	353	354	820	601	3	31.968	15.849

	sample_name	chr	cnv	st_bp	ed_bp	length_kb	st_exon	ed_exon	raw_cov	norm_cov	copy_no	lratio	mBIC
cnv1	Sample 4	chr17	dup	5036717	5041065	4.349	127	131	980	755	3	19.839	2.346
cnv2	Sample 4	chr17	dup	8480533	8508320	27.788	305	307	301	205	3	19.54	4.606
cnv3	Sample 4	chr17	dup	36874077	36875856	1.78	590	591	751	574	3	17.492	5
cnv4	Sample 6	chr17	del	41209048	41215988	6.941	802	804	252	440	1	45.34	27.665

yuchaojiang / codex2 Goto Github PK

codex2's Introduction

CODEX2

Author

Maintainer

Description

Installation

Questions?

Citation

Running CODEX2

IMPORTANT: CODEX2 for cancer genomics

CODEX2 for targeted sequencing

Visualization by IGV

CODEX2 for hg38?

CODEX2 for mouse genome

Common questions

codex2's People

Contributors

Stargazers

Watchers

Forkers

codex2's Issues

Initialize Variables

Verify chromosome names match exactly

Call getgc()

The following object is masked from ‘package:BSgenome.Hsapiens.UCSC.hg19’: Hsapiens

Running CODEX2 with negative control samples

Recommend Projects

Recommend Topics

Recommend Org