privefl / paper-ldpred2 Goto Github PK

View Code? Open in Web Editor NEW

16.0 3.0 12.0 15.11 MB

Paper discribing LDpred2

Home Page: https://doi.org/10.1093/bioinformatics/btaa1029

TeX 61.66% R 38.19% Shell 0.15%

ldpred2 polygenic

paper-ldpred2's Introduction

Paper describing LDpred2 in Bioinformatics: https://doi.org/10.1093/bioinformatics/btaa1029

paper-ldpred2's People

Contributors

Stargazers

Watchers

Forkers

opain saafdk allelica saurabhhebbalkar5 sarah97531 erickawaguchi wangdi2014 allan-xuyu hessj ryurko richelbilderbeek zscu lm380-r

paper-ldpred2's Issues

Lower than expected results

Dear Florian,

Our group is using LDpred2 for the development of polygenic scores. After obtaining lower than expected results, we have tried to simulate the results of the paper with the summary statitics of Nipkay et al. for CAD.

The paper describes an AUC for LDpred2 auto genome-wide of 61.8 [61.3-62.3].

In our case, we obtain 0.54 [0.53-0.55].

We have checked the summary statistics, genotypes and phenotypes and the information is correct. We have also checked the code thoroughly but did not detect the bug. Would it be possible to share the code we have used with you? If you have any other suggestions on how we can review your input data or results, we would also really appreciate it.

Thank you very much in advance!

Paloma

Matrix Math

Hi,

I was trying to check my matrix math and compare to the output of big_prodMat in the bigsnpr package (about 20 lines of code and output attached). I expected to be able to multiply the beta values by the genotypes, sum them up, and get the same answer as in the function. However, it is always a little off. I expected it to be exact. Could this be due to different levels of rounding?

Thanks.
Liz

# These are the beta values for all 30 scenarios of h2 and p
beta_auto <- sapply(multi_auto, function(auto) auto$beta_est)

# Chr22 is the test chromosome.  We will do these one chromosome at a time
bedfile = "TAGC_Chr22.bed"
bed = snp_readBed(bedfile, backingfile = sub_bed(bedfile))
TAGC <- snp_attach(bed)
#str(TAGC) # 7729 x 16419 for Chr22, 7729 x 97458 for Chr2
G <- TAGC$genotypes

# Map just to the chromosome we are calculating
map <- dplyr::transmute(TAGC$map,
                        chr = as.integer(chromosome), pos = physical.pos,
                        a0 = allele1, a1 = allele2)  # reversed somehow..
map_pgs <- df_beta[c(1:5,14)]; map_pgs$beta <- 1
map_pgs$hg19 = map_pgs$pos # Keeping this for comparison later
map_pgs$pos = map_pgs$pos_hg38 # We need hg38 because this is the TopMed Sequence genome assembly
map_pgs2 <- snp_match(map_pgs, map)

# _NUM_ID_ seems to be the map into the chromosome sized items 
# while _NUM_ID_.ss seems to map into the full chromosome matrix)
GFBM = as_FBM(G[,map_pgs2[["_NUM_ID_"]]])
pred_auto <- big_prodMat(GFBM, sweep(beta_auto[map_pgs2[["_NUM_ID_.ss"]],], 1, map_pgs2$beta, '*'), ncores = NCORES)

# Do some matrix math to compare what we would expect....this is on Chr22.
ColID = 7; PersonID = 1013
Onebeta = beta_auto[map_pgs2[["_NUM_ID_.ss"]],ColID]
Oneperson = G[PersonID,map_pgs2[["_NUM_ID_"]]]
PRSPerson = sum(Onebeta*(-1)*Oneperson)
print(PRSPerson)
print("Now pred_auto calculation")
print(pred_auto[PersonID,])

# Output below
933,074 variants to be matched.
0 ambiguous SNPs have been removed.
13,358 variants have been matched; 0 were flipped and 13,175 were reversed.

[1] 0.05068206
[1] "Now pred_auto calculation"
[1]  0.0489877538  0.0466978807  0.0491356973  0.0510569618  0.0511219294
[6]  0.0429244722  0.0461996485  0.0480051599  0.0443773547  0.0466585931
[11]  0.0468342975  0.0448386896  0.0494604417  0.0482822561  0.0428231562
[16]  0.0477076403  0.0448193827  0.0495798391  0.0472493303  0.0158506961
[21] -0.0003978807  0.0002496557  0.0003152112  0.0009851252  0.0021061355
[26] -0.0008542866 -0.0009923572 -0.0008881025 -0.0017234227 -0.0015367257

controlling for covariates

Dear Florian!
I have a question related to controlling for covariates. For my analyses, I’m using a GWAS summary statistic where age, gender, and 10 principal components were added as covariates in the original analysis. After running LDPred2-auto on my independent population, I would like to calculate the R2 change as described here: [https://choishingwan.github.io/PRS-Tutorial/ldpred/]
My question is, is it correct to control for age, gender, and principal components calculated for my population, when I'm calculating the R2 change caused by the PRS? Isn’t it a problem that in the original GWAS the effect of the SNPs was already controlled for age and gender?
Sorry for the question, but I didn’t find any reference for that.
Thank you in advance,
Dora

Calculating genome-wide correlation matrix for new LD reference

Dear Dr. Privé,
I´m trying to work with the newly provided LD reference data.

The example code from your tutorial works well when using the HapMap3 variants with LD blocks calculating corr_chr. How has this code to be adjusted when using the new HapMap+ variants (only a genome-wide .map file available)

for (chr in 1:22) {

  cat(chr, ".. ", sep = "")

  ## indices in 'df_beta'
  ind.chr <- which(df_beta$chr == chr)
  ## indices in 'map_ldref'
  ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]
  ## indices in 'corr_chr'
  ind.chr3 <- match(ind.chr2, which(map_ldref$chr == chr))

  **corr_chr <- readRDS(paste0("ld-ref/LD_chr", chr, ".rds"))[ind.chr3, ind.chr3]**

  if (chr == 1) {
    corr <- as_SFBM(corr_chr, tmp)
  } else {
    corr$add_columns(corr_chr, nrow(corr))
  }
}

Thanks for your help!

GWAS sumstats without physical position and chromosome numbers

Hi Florian,

Hope you are doing well!

I am currently working on a project that involves the calculation of polygenic scores of multiple traits by LDpred2 for prediction.

However, external GWAS summary statistics are often heterogenous in terms of format, with some did not offer information on the physical position and chromosome numbers of the SNPs.
They did offer the RSID of the SNPs though.

Therefore, I am just wondering if we can still run LDpred2 without the physical positions (given that RSIDs are available) of the SNPs. Many thanks!

best wishes,
Chris

Combining chromosomes from .bed files

Hi-

I am in the second stage of LDPred2 where I am bringing in the genotypes from a WGS and I would like to combine all the genotypes from 22 chromosomes into one big matrix to generate the predicted PRS. I am using snp_readBed to read the individual chromosome .bed files but is there an R function to combine all the chromosomes together so that I can do one big matrix multiply using big_prodMat?

Thanks.
Liz

Cholmod Error

Hi there,

I am making use of .bgen files in the ldpred2 software which is around 22GB, I do have a large amounts of variants matched with my summary statistics as seen below:

#5,868,901 variants to be matched.
#892,483 ambiguous SNPs have been removed.
#4,965,936 variants have been matched; 0 were flipped and 2,689,301 were reversed.

When creating the on-disk sparse genome-wide correlation matrix on-the-fly I get the following error:

Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 89
Calls: ... -> .local -> as_gCsimpl -> as -> asMethod

Is this because I do not have enough memory to run the correlation step? I seem to be stuck here and can't find a solution.

Thanks in advance!

Regards,
Dom

big_prodVec or big_prodMat?

Hi Florian,

when running big_prodVec(G_imp, beta_auto, ind.col = df_beta[["NUM_ID"]]), I get the error message:
error in pMatVec4(X, y.col, ind.row, ind.col, ncores = ncores) :
Tested 501101 < 501100. Subscript out of bounds.
501,100 is the number of variants in G
G_imp comes from Snp_fastImputeSimple(G).

when running pred_auto <- big_prodMat(G_imp, beta_auto, ind.col = df_beta[["NUM_ID"]]), I get a matrix with NAs only.

Do you have any advice for me?

Thanks in advance!

List of used SNPs

Hi!

I was wondering if, when calculating PRS using LDpred2, it was possible to obtain a list of all the used SNPs in the PRS calculation.

Kind regards,
Bram

Issue with example-with-provided-ldref.R

Hi,

First of all, thank you for this tool. I was trying to work with the example-with-provided-ldref.R script but when I keep getting an error when I try to run lines 50-70.

Code:
tmp <- tempfile(tmpdir = "tmp-data")

for (chr in 1:22) {

cat(chr, ".. ", sep = "")

ind.chr <- which(df_beta$chr == chr)
ind.chr2 <- df_beta$_NUM_ID_[ind.chr]
ind.chr3 <- match(ind.chr2, which(map_ldref$chr == chr))

corr_chr <- readRDS(paste0("ld_chr", chr, ".rds"))[ind.chr3, ind.chr3]

if (chr == 1) {
corr <- as_SFBM(corr_chr, tmp, compact = TRUE)
} else {
corr$add_columns(corr_chr, nrow(corr))
}
}

Error: 1.. Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file 'ld_chr1.rds', probable reason 'No such file or directory'

I am new to this so any help would be greatly appreciated. I can get ind.chr, ind.chr2 and ind.chr3 commands to run but I'm not sure how to save those so I can use them for the rest of the code.

Thank you in advance!

Correlation ldref example vs main tutorial/vignette

Hi Dr. Privé,

I just wanted to make sure the correlation here from the example with provided ldref:

tmp <- tempfile(tmpdir = "tmp-data")

for (chr in 1:22) {

  cat(chr, ".. ", sep = "")

  ## indices in 'df_beta'
  ind.chr <- which(df_beta$chr == chr)
  ## indices in 'map_ldref'
  ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]
  ## indices in 'corr_chr'
  ind.chr3 <- match(ind.chr2, which(map_ldref$chr == chr))

  corr_chr <- readRDS(paste0("ldref/LD_with_blocks_chr", chr, ".rds"))[ind.chr3, ind.chr3]

  if (chr == 1) {
    corr <- as_SFBM(corr_chr, tmp)
  } else {
    corr$add_columns(corr_chr, nrow(corr))
  }
}

can be used in place of the below correlation code from the tutorial (when my dataset has less than 2000 individuals) to calculate ldsc and all the proceeding required to get pred_auto.

I use code similar to the ldref example code until the h2_est <- ldsc[["h2"]] step and then compute LDpred2-auto using the code from the polygenic scoring vignette/tutorial.

POS2 <- obj.bigSNP$map$genetic.dist`

tmp <- tempfile(tmpdir = "tmp-data")

for (chr in 1:22) {

  # print(chr)

  ## indices in 'df_beta'
  ind.chr <- which(df_beta$chr == chr)
  ## indices in 'G'
  ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]

  corr0 <- snp_cor(G, ind.col = ind.chr2, size = 3 / 1000,
                   infos.pos = POS2[ind.chr2], ncores = NCORES)

  if (chr == 1) {
    ld <- Matrix::colSums(corr0^2)
    corr <- as_SFBM(corr0, tmp, compact = TRUE)
  } else {
    ld <- c(ld, Matrix::colSums(corr0^2))
    corr$add_columns(corr0, nrow(corr))
  }
}

If a dataset is barely over 2000 individuals is it harmful to still use the provided ldref example code pipeline until the h2_est <- ldsc[["h2"]] step?

Thank you in advance,
Kate