Giter VIP home page Giter VIP logo

meffil's Introduction

perishky (Matthew Suderman)

  • I am a Senior Lecturer in Epigenetic Epidemiology at the University of Bristol, UK.

  • Epigenetics is the study of mechanisms that determine biological charactistics independently of DNA sequence. Epigenetic mechanisms have critical roles in development and response to the environment.

  • Epidemiology is the study of health and disease in populations.

  • I create R packages and scripts for analysing DNA methylation datasets generated using the Illumina Infinium Beadchips, but have also created scripts for analysing data generated using other types of microarrays and high-througput sequencing.

    • The EWAS catalog is an online database of summary statistics of published epigenome-wide association studies.
    • meffil is for the quality assessment, normalization and analysis of genome-wide DNA methylation datasets.
    • ewaff is for testing associations in genome-wide DNA methylation datasets.
    • dmrff is for identifying differentially methylated regions.
    • meffonym is for calculating scores for a long list of DNA methylation models for estimating things like age and smoking history.
    • alspac is for querying the ALSPAC study data which is stored in SPSS and STATA formats.
  • I regularly produce an update of the Epigenetic Epidemiology literature 1-2 times per month.

  • I aim to write code that is as beautiful as perishky is tasty.

meffil's People

Contributors

epzjlm avatar explodecomputer avatar perishky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

meffil's Issues

meffil.ewas.summary coerce error

When I run ewas analysis using meffil.ewas.summary, it came up with an error message saying that "(list) object cannot be coerced to type 'double'". I think it is due to the format of ewas.object$covariates (which is a tibble), so in the meffil.ewas.sample.characteristics function, summarize.variable doesn't work when calculating the mean and sd. I would suggest converting the object into a vector or a dataframe. Hope that helps.

meffil.normalize.samples function needs raw values

Dear parishky,
I am trying to run the function meffil.normalize.samples in the server because on my PC I do not have enough memory to run it. To this end, I uploaded the qc.objects generated on my computer to the server and continued with the standard pipeline.
The issue error when I tried to run the mentioned function:

norm.beta <- meffil.normalize.samples(norm.objects, cpglist.remove=badcpgs)
Error in read.idat(paste(basename, "_Grn.idat", sep = ""), verbose = verbose) :
Filename does not exist:data//DELCODE/Epigenetics/idat/205707890015_R05C01_Grn.idat.gzdata//DELCODE/Epigenetics/idat/205707890015_R05C01_Grn.idat

why does this function need access to the raw idat values? Is there a way to avoid this?
thank you very much,
Rafael

load-epic-manifest

In load-epic-manifest.r, semicolon separated SNPs results in NAs at the following line:

manifest$snp.exclude <- manifest$Name %in%
with(manifest,Name[which(as.numeric(as.character(SNP_MinorAlleleFrequency))>0.01)])

new cell type reference

The documentation says that one can create a new cell type reference by using 'meffil.create.cell.type.reference()' function, but in R I get an error (could not find function "meffil.create.cell.type.reference"). I saw in the code here, on github, that there is no such function, but instead two other similar functions: 'meffil.add.cell.type.reference' and 'create.cell.type.reference'. Why aren't those linked correctly? Which one should I use?

EPIC v2 multiple `IlmnID` per CpG `Name` behavior

Dear Dr. Suderman,

Thank you so much for adding EPIC v2 support to {meffil} so quickly.

It looks like there are now multiple IlmnID for a cpg Name in the manifest. Because of this, the CpG name in the beta matrix returned by {meffil} is no longer unique as in EPIC v1. For example, cg00002033 is replicated twice. In the manifest, the IlmnID would be
cg00002033_TC11 and cg00002033_TC12 and the Name column would be 2 values of cg00002033.

I was just wondering how this duplication is handled by {meffil}? QCing my data with the current version of {meffil}, it looks like the beta values for a duplicated cpg have the same exact value. For example, cg00002033 is duplicated twice. cg00002033(1) and cg00002033(2) are the same for each individual in my data.
screen1

I'm not sure if this is an intended behavior since I would expect the values would be close but not the same. Even if we get back different values for each repeat, I'm not sure if averaging across the repeats is valid either. Do you have a recommendation?

Sincerely,
Hung

Normalize a subset of samples

Good afternoon,

I am planning to use meffil to normalize my data. I do not have any problems to write the code, but I have doubts with the design.

When the methylation data was generated, samples from different cohorts were included in the same array. However, currently I am only interested in analyzing one cohort, so I would only like to analyze around a 10% of the total samples.

In the manual, I have read that meffil uses functional normalization. What approach should be better in my case:

  1. Normalize all the samples together and then select the samples I am interested in.
  2. Select the samples I am interested in the sample sheet and only normalize my subset.

I have not been able to find information about this topic, either in the manual or in any article. Let me know if we can move this discussion to a more suitable place.

Thank you very much,

Start with methlyation level table

Hi, thank you for the amazing tool! I found that all of the tutorial requires to start with .idat file. However, some GEO dataset doesn't provide the original .idat file. For example, the Hannum dataset GSE142536, which only provides the methylation ratio at each CpG site in each sample.

I'm wondering if it is possible to start with such kind of table (i.e. skip the QC step) and apply the rest of the normalization procedures to such kind of data?

Thank you!

How to make sure meffil matches the newly added EPIC b5 manifest

Hi Perishky,

I am trying to add the EPIC b5 manifest (released in Mar 2020) to meffil and did the following:

In the load-epic-manifest.r script

  1. update of filename <- "ftp://webdata2:[email protected]/downloads/productfiles/methylationEPIC/infinium-methylationepic-v-1-0-b5-manifest-file-csv.zip")
  2. add snp exclusions (re issue #17 )

    manifest$snp.exclude <- manifest$Name %in% with(manifest, Name[which(as.numeric(as.character(SNP_MinorAlleleFrequency)) > 0.01)])

Then add EPIC annotation/updated manifest (ref. meffil/data-raw/globals.r) by sourcing the above script and use load.epic.manifest(), the following warning comes up (Is it okay to proceed?):

In which(as.numeric(as.character(SNP_MinorAlleleFrequency)) > 0.01) :
  NAs introduced by coercion

After meffil.add.chip("epic", manifest.epic), what else should be done to ensure meffil now is matching the meffil objects to the newly added manifest? Is it automatic? or any other steps to enforce it to work on meffil objects (something like guess.chip())?

Thank you very much, as I couldn't find tutorial materials for this bit (if I overlook it, sorry for opening the question here; reference to the tutorial would be very much appreciated).

Thanks

DMPs vs DMRs

Looks like EWAS function gives DMPs as an outcome. Is it possible to do DMR analysis? I've read in some studies that DMRs is a better way to compare methylation pattern.

Thanks!

dimnames error in meffil.normalize.samples()

Hi,

I'm trying to run meffil.normalize.samples() on my data, but it keeps returning me this error:

Error in dimnames(ret) <- list(sites, names(norm.objects)) :
  length of 'dimnames' [1] not equal to array extent
In addition: Warning message:
In mclapply(X, FUN, ...) :
  all scheduled cores encountered errors in user code

I then tried your steps in readme.md, but this gave the same error as above. After downloading and extracting the test data, I used this code:

samplesheet <- meffil.create.samplesheet(path)
qc.objects <- meffil.qc(samplesheet, verbose = TRUE)

qc.summary <- meffil.qc.summary(qc.objects)
meffil.qc.report(qc.summary, output.file="qc_test/report.html")

qc.objects <- meffil.remove.samples(qc.objects, qc.summary$bad.samples$sample.name)
norm.objects <- meffil.normalize.quantiles(qc.objects, number.pcs=10)

norm.beta <- meffil.normalize.samples(norm.objects, cpglist.remove=qc.summary$bad.cpgs$name, verbose = TRUE)

Any idea what's going wrong?
Best regards, Tom

R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Stream 8

Matrix products: default
BLAS/LAPACK: /tbkuipers/miniconda3/envs/R/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] meffil_1.1.2          preprocessCore_1.52.1 SmartSVA_0.1.3
 [4] RSpectra_0.16-0       isva_1.9              JADE_2.0-3
 [7] qvalue_2.22.0         gdsfmt_1.26.1         statmod_1.4.35
[10] quadprog_1.5-8        DNAcopy_1.64.0        fastICA_1.2-2
[13] lme4_1.1-26           Matrix_1.3-2          multcomp_1.4-17
[16] TH.data_1.0-10        survival_3.2-7        mvtnorm_1.1-2
[19] matrixStats_0.57.0    markdown_1.1          gridExtra_2.3
[22] Cairo_1.5-12.2        knitr_1.30            reshape2_1.4.4
[25] plyr_1.8.6            ggplot2_3.3.3         sva_3.38.0
[28] BiocParallel_1.24.1   genefilter_1.72.1     mgcv_1.8-33
[31] nlme_3.1-151          limma_3.46.0          MASS_7.3-53
[34] illuminaio_0.32.0

loaded via a namespace (and not attached):
 [1] bit64_4.0.5          httr_1.4.2           tools_4.0.3
 [4] R6_2.5.0             DBI_1.1.1            BiocGenerics_0.36.0
 [7] colorspace_2.0-0     withr_2.4.1          tidyselect_1.1.0
[10] base64_2.0           bit_4.0.4            compiler_4.0.3
[13] Biobase_2.50.0       sandwich_3.0-1       labeling_0.4.2
[16] scales_1.1.1         askpass_1.1          stringr_1.4.0
[19] digest_0.6.27        minqa_1.2.4          pkgconfig_2.0.3
[22] highr_0.8            fastmap_1.0.1        rlang_0.4.10
[25] RSQLite_2.2.2        generics_0.1.0       farver_2.0.3
[28] zoo_1.8-8            dplyr_1.0.3          magrittr_2.0.1
[31] Rcpp_1.0.6           munsell_0.5.0        S4Vectors_0.28.1
[34] lifecycle_1.0.0      stringi_1.5.3        edgeR_3.32.1
[37] grid_4.0.3           blob_1.2.1           crayon_1.3.4
[40] lattice_0.20-41      splines_4.0.3        annotate_1.68.0
[43] locfit_1.5-9.4       pillar_1.4.7         boot_1.3-25
[46] codetools_0.2-18     stats4_4.0.3         XML_3.99-0.5
[49] glue_1.4.2           evaluate_0.14        vctrs_0.3.6
[52] nloptr_1.2.2.2       gtable_0.3.0         openssl_1.4.3
[55] purrr_0.3.4          clue_0.3-59          assertthat_0.2.1
[58] cachem_1.0.4         xfun_0.20            mime_0.9
[61] xtable_1.8-4         tibble_3.0.5         AnnotationDbi_1.52.0
[64] memoise_2.0.0        IRanges_2.24.1       cluster_2.1.0
[67] ellipsis_0.3.1

normalized control probe values

Hi,

This is a really nice package. I am following your quantile normalization tutorials and wanted to see how the normalization algorithm also adjusted the control probe values (i.e. extension, hybe, stain, etc...) Is there anyway I can view these values. I've only been able to access the raw values.

Cheers!

Guidance on plink step in QC

First of all, thank you @perishky for open sourcing this very well-documented methylation data analysis package. For those of us new to biostatistics, could you briefly explain or elaborate on the QC step that involves plink? Having gone through the tutorial, I am uncertain about where to obtain the binary .ped file dataset which is a required input to the --bfile argument (in addition to snp-names.txt, which we get from qc objects featureset). I am working with methylation data from two different brain regions and am therefore using the guintivano dlpfc cell reference type. So far, I have found this tool for converting GEO samples to .ped format using GSM accession ID, but am not sure which sample to use. Any guidance about how/where to obtain the matrix of genotypes for comparison, or where to find out more about this process, would be helpful. Thank you!

Testing more than one phenotype

Hi There,

I'm wondering if there's a way to test association with more than one phenotype (outcome of interest) at once?
I know this isn't an "issue", per se, but I didn't know how else to reach out.

Thanks!

Version conflicts with CopyNumber450k

Hey,

I have trouble when trying to extract structural variants (as explained here).

Appearently the CopyNumber450k package which is used in your pipeline was removed with Bioconductor 3.5 (see https://www.bioconductor.org/checkResults/3.5/bioc-20171018/).
Yet, all releases below 3.5 are not compatible with R >3.5.3 anymore.
Additionally there are some dependencies of meffil which need R >3.5.3 to function properly.

This basically means, that there are some unadressed version conflicts and I currently don't know how to fix them. I don't know if using multiple R versions on my device will fix these issues, since there would be the need of switching back and forth in the same session which is (afaik) not possible.

Do you have any ideas how to fix this? Or do you know any alternatives to the CopyNumber450k package which can be used with meffil? I'm not sure if I understood the role of the CopyNumber450k package in your script throughly, so maybe I'm missing an easy solution.

issue with sva

ewas.ret <- meffil.ewas(norm.beta, variable=smok, covariates=data2,winsorize.pct=NA,isva=T,sva=T)

Error in La.svd(x, nu, nv) : error code 1 from Lapack routine 'dgesdd'
Calls: meffil.ewas -> sva -> num.sv -> svd -> La.svd
Execution halted

Problem generating normalized beta value for EPIC array data

I am analysing EPIC array data and when I pass the norm.objects to the following meffil.normalize.samples function,
norm.beta <- meffil.normalize.samples(norm.objects, cpglist.remove=qc.summary$bad.cpgs$name)
the error message is as follows:

Error in dimnames(ret) <- list(sites, names(norm.objects)) :
  length of 'dimnames' [1] not equal to array extent

I have tried the following but with no luck:

  1. Checked the length of norm.objects which is the same as my sample size
  2. Try using the make.names function to turn column names into 'syntactically valid names', from comment on https://stackoverflow.com/questions/12985653/what-does-length-of-dimnames-1-not-equal-to-array-extent-mean by names(norm.objects)<-make.names(names(norm.objects)) , returning the same error message when using the meffil.normalize.samples function
  3. As the column names cannot be inspected by colnames(norm.objects), it is difficult to tell if it is a problem with column naming in the norm.objects.

I wondered if it is a problem with EPIC array data? as the tutorial is 450K data and seems fine.

The sessionInfo() is as follows:

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: .../software/OpenBLAS/0.3.5-GCC-8.2.0-2.31.1/lib/libopenblas_skylakexp-r0.3.5.so

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] meffil_1.1.1          preprocessCore_1.48.0 SmartSVA_0.1.3
 [4] RSpectra_0.16-0       isva_1.9              JADE_2.0-3
 [7] qvalue_2.16.0         gdsfmt_1.20.0         statmod_1.4.34
[10] quadprog_1.5-8        DNAcopy_1.60.0        fastICA_1.2-2
[13] lme4_1.1-23           Matrix_1.2-17         multcomp_1.4-13
[16] TH.data_1.0-10        survival_2.44-1.1     mvtnorm_1.1-1
[19] matrixStats_0.56.0    markdown_1.1          gridExtra_2.3
[22] Cairo_1.5-12          knitr_1.28            reshape2_1.4.4
[25] plyr_1.8.6            ggplot2_3.3.1         sva_3.32.1
[28] BiocParallel_1.18.1   genefilter_1.66.0     mgcv_1.8-28
[31] nlme_3.1-140          limma_3.42.2          MASS_7.3-51.4
[34] illuminaio_0.28.0

loaded via a namespace (and not attached):
 [1] Biobase_2.44.0       bit64_0.9-7          splines_3.6.0
 [4] assertthat_0.2.1     askpass_1.1          highr_0.8
 [7] stats4_3.6.0         blob_1.2.1           pillar_1.4.4
[10] RSQLite_2.2.0        lattice_0.20-38      glue_1.4.1
[13] digest_0.6.25        minqa_1.2.4          colorspace_1.4-1
[16] sandwich_2.5-1       XML_3.99-0.3         pkgconfig_2.0.3
[19] purrr_0.3.2          xtable_1.8-4         scales_1.1.1
[22] tibble_3.0.1         openssl_1.4.1        annotate_1.62.0
[25] farver_2.0.3         IRanges_2.18.1       ellipsis_0.3.1
[28] withr_2.2.0          BiocGenerics_0.30.0  mime_0.9
[31] magrittr_1.5         crayon_1.3.4         evaluate_0.14
[34] memoise_1.1.0        tools_3.6.0          lifecycle_0.2.0
[37] stringr_1.4.0        S4Vectors_0.22.0     munsell_0.5.0
[40] cluster_2.0.9        AnnotationDbi_1.46.1 base64_2.0
[43] compiler_3.6.0       rlang_0.4.6          grid_3.6.0
[46] RCurl_1.98-1.2       nloptr_1.2.2.1       labeling_0.3
[49] bitops_1.0-6         boot_1.3-22          gtable_0.3.0
[52] codetools_0.2-16     DBI_1.1.0            R6_2.4.1
[55] zoo_1.8-8            dplyr_0.8.1          bit_1.1-15.2
[58] clue_0.3-57          stringi_1.4.6        Rcpp_1.0.4.6
[61] vctrs_0.3.1          tidyselect_0.2.5     xfun_0.14

Thanks for helping!

meffil.ewas.summary with GDS

Hi!

Using a GDS file for meffil.ewas.summary does not seem to work:

Error in meffil::meffil.ewas.summary(ewas.object = res, beta = gds.filename,  : 
  is.matrix(beta) is not TRUE

All other functions that used this GDS file worked, so any clue, if this is on my end?

Thanks!

450k, epicv1, and epicv2

I have tried to combine all three types of chips: 450k, EPICv1, EPICv2 into one dataset with meffil.qc

meffil.qc(anno, featureset = "common", cell.type.reference = NA, verbose = T);
...
Error in meffil.probe.info(chip, featureset) :
featureset common is not compatible with microarray epic2

Is there is any solution yet? Thank you.
Rust

ewas.parameters <- meffil.ewas.parameters(sig.threshold=1e-20, max.plots=5)

To whom it may concern,

I have a question about this sig.threshold, why here set up to 1e-20? And I read this function, and it will use the bonferroni correction significance threshold if using the default setting. Wondering if we can set up FDR correction method instead of bonferroni?

Thank you,
MAO

27k Support

Hi there, does meffil offer support for the 27k array? Ideally, I'm looking to conduct a study that would combine 450k, 850k and 27k info.

Thanks!

Normalization function creates many NAs

Dear Mathew,
The meffil.normalize.samples function is producing many NAs in more than one of my data sets. If I wish to continue only with the CpGs that have normalized values for all the samples I end up with only 477,376 probes.
Is there a way to solve that? I do not remember finding this problem when using the package previously.
Thanks a lot!
Rafael

Cell composition

First of all, thank you for this handy tool.

I have one understanding issue : when I correct my data using meffil.qc(). i do cell.type.reference="blood gse35069", as I have whole blood samples. When later in the pipeline, I would perform an EWAS, do I have to set cell.counts = 0 (as we already normalized) or how do I provide the cell conuts, in case it is needed?

Thank you in advance?

Length > 1 error on R 4.2

Dear Dr. Suderman,

Thank you for your hard work on {meffil}. I'm using R 4.2, Windows build, {meffil} version 1.3.2 and I have the following error

library(meffil)
options(mc.cores = 6L)
my_scree_plot = meffil.plot.pc.fit(qc.object)

Which returned the error

Error in if (class(pca.ret) != "matrix") pca.ret <- pca.ret$x : 
  the condition has length > 1

I think this has to do with R 4.2 making length > 1 in if statement an error. I tried to look for the function that made this call but I couldn't find it. Can you help me?

Thank you so much

More flexibility for Sex column

Hey there. I have been trying to import my csv file, but I keep getting this error:

Error in check.samplesheet(df) :
Sex column must only contain 'M', 'F' or NA values

I believe that this lies in how I set up my sex column. I feel that anything not explicitly 'M' or 'F' should be converted to NA before proceeding. I believe it's just having trouble recognizing blank values when reading in the csv.

If you could add na.strings=c("NA","NaN", " ") to read.csv line that would be great, or modify any(! samplesheet$Sex %in% c("M", "F", NA)) to include c("NA","NaN", " ").

Thanks!

error in meffil.ewas

Hey,
I'm getting an error running the following command:
set.seed(5)
ewas.ret <- meffil.ewas(norm.beta, variable=phen[,2], covariates=covs, isva=F)

Error message:
Error in sove.default(t( %*% mod) :
System is computationally singular: reciprocal condition number = 1.9905e-43
Calls: meffil.ewas -> sva -> num.sv -> solve -> solve.default
Execution halted

Anyone who what this error message means?

meffil.normalize.dataset() threw me an error

To whom it may concern,
I was trying to use this function below, however, it threw me an error. Could you please give me some suggestions?

meffil.normalize.dataset(samplesheet, qc.file="qc/report.html", author="mao", study="IlluminaEpic", number.pcs=10)

Error in preprocessCore::normalize.quantiles.use.target(matrix(orig.signal), :
ERROR; return code from pthread_create() is 22
Calls: meffil.normalize.dataset ... lapply -> FUN -> meffil.normalize.sample ->
In addition: Warning messages:
1: In readChar(con, nchars = n) : truncating string with embedded nuls
2: In readChar(con, nchars = n) : truncating string with embedded nuls
3: The <scale> argument of guides() cannot be FALSE. Use "none" instead as
of ggplot2 3.3.4.
The deprecated feature was likely used in the meffil package.
Please report the issue to the authors.
4: In readChar(con, nchars = n) : truncating string with embedded nuls
5: In readChar(con, nchars = n) : truncating string with embedded nuls
Execution halted

Thank you,

MAO

Meffil - coding of unknown phenotypes

Hi there. We are currently using meffil to run an EWAS in some Illumina methylation data from ~200 subjects who are part of a longitudinal study, and are trying to run an analysis in a subset of participants who have subsequently converted to disease status versus participants who have definitely remained disease free (the remainder having no data or are ambiguous as they have developed some associated symtoms). We are running into issues in relation to getting the program to run in the reduced dataset (n=58 with longitudinal outcomes of interest). I imagine this is because there are not sufficient samples across each sentrixID from the whole experiment (and other user defined covariates) to allow computation in just 58 subjects. We have tried to include all 212 subjects, coding those individuals with ambiguous phenotypes as "-9" to indicate missing data so that all sentrixIDs are included in input, but the software seems to think this "-9" value is an actual category and is computing differences between the three coded groups. Similarly it fails when "NA" is included as the missing value. Is there a way we can include all data from 212 subjects as input, but only phenotype coding for the 58 participants with known disease outcomes? We have tried including 212 subjects in covariates file, and only 58 subjects in pheno file, but this also fails to run.
Any advice you can provide would be much appreciated.

Not sure what happened?

For some reason I'm getting this error:

Error in meffil.qc.summary(qc.objects, parameters = qc.parameters, verbose = verbose) :
sapply(qc.objects, is.qc.object) are not all TRUE

Can you explain what has happened and a possible debug?

cord blood gse68456

Dear developer
Thank you for your great work. I'm having problems using cell type reference of "cord blood gse68456". Cell type "nRBC" is not defined, but "CD3T" is included, instead. I appreciate if you could kindly make an amendment.
Thank you in advance.

QC Reports

Any chance QC reports could be formatted like ENmix and minfi?

Or can you recommend how to integrate functions from ENmix and minfi into meffil?

I'm not sure that I am satisfied with the current QC report being generated by meffil. Some parts of it can be hard to interpret.

Thanks!

Displaying plots such as these:
freqpolygon_beta_afterqc
target_removal
staining
specificity_i
hybridization
bisulfite_conversion_ii

Imputation Issue

I've noticed that the pipeline performs mean imputation during normalization. Is there anyway to disable this feature? Or replace it with another function?

I'd like to use k-nn or mice for imputation.

It would also be nice for meffil to output a "missingness report".

EWAS: fixed vs random and multiple batch effects

So, when running meffil.ewas(), variable of interest (variable) is a vector, fixed effects (covariates) is a data.frame and random effects (batch) is a vector.

I am using Sex as covariate. I have multiple batch effects (Sample_Plate, Slide and Array), How would I include that? Should I just include them all as covariates then? What is practically the downstream consequence of including some variables as fixed and other as random?

Report variance of normalized betas principal components

I am using meffil to normalize my dataset. In order to assess the potential effects of different batch variables on the final normalized beta values, the package offers an automated report which shows the associations between the user batch variables and the principal components of the normalized beta values.

Although this approach is great and has helped me a lot to identify potential issues, I am missing information about the variability explained by these principal components. If I am not wrong, this information is only provided for the control probes, but not for the normalized beta.

I am aware that PCs are typically computed using the most variable CpGs and not the full dataset. However, even if the variability estimate does not fully reflect the total dataset variability, they can be helpful to determine which PCs are more relevant when trying to remove batch effects.

Thanks,

Workflow validation

Hi, I am testing the meffil workflow with a publicly available dataset to see if I get the same results. But, I am unable to find any DMPs. I was able to reproduce the results using minfi but not using meffil.

So I am running this minfi workflow and the data is in this R package which needs to be installed from source.

Here is the code I use. Perhaps, I am doing something wrong or there is some explanation. One sample is discarded due to high detection p-value, cell type info is not used, norm is quantile. Also, I am unsure how to model contrasts with variables that have more than two levels.

library(meffil)
options(mc.cores=10)

# read data --------------------------------------------------------------------
dataDirectory <- system.file("extdata", package = "methylationArrayAnalysis")
ss <- meffil.read.samplesheet(dataDirectory,pattern="sampleSheet.csv")

m_qc <- meffil.qc(ss,cell.type.reference="andrews and bakulski cord blood",verbose=T)
qc_summary <- meffil.qc.summary(m_qc,parameters=meffil.qc.parameters(detectionp.samples.threshold=0.1))
meffil.qc.report(qc_summary,output.file="meffil-qc-report.html")

# remove bad samples -----------------------------------------------------------
outlier <- qc_summary$bad.samples
m_qc <- meffil.remove.samples(m_qc,outlier$sample.name)
ss <- ss[ss$Sample_Name!="11",]

# normalisation ----------------------------------------------------------------
pc <- meffil.plot.pc.fit(m_qc)
m_norm <- meffil.normalize.quantiles(m_qc,number.pcs=2)
m_beta <- meffil.normalize.samples(m_norm)

for (i in 1:length(m_norm)) {
  m_norm[[i]]$samplesheet$Sample_Source <- as.factor(m_norm[[i]]$samplesheet$Sample_Source)
  m_norm[[i]]$samplesheet$Sample_Group <- as.factor(m_norm[[i]]$samplesheet$Sample_Group)
}

norm_params <- meffil.normalization.parameters(m_norm,control.pcs=1:5,
                                               batch.pcs=1:5,batch.threshold=0.01)

mpcs <- meffil.methylation.pcs(m_beta,probe.range=20000)
norm_summary <- meffil.normalization.summary(m_norm, pcs=mpcs,parameters=norm_params)
meffil.normalization.report(norm_summary, output.file="meffil-norm-report.html")

# ewas -------------------------------------------------------------------------

# select two levels to compare
ss1 <- filter(ss,Sample_Group=="act_naive" | Sample_Group=="act_rTreg")
m_beta1 <- m_beta[,ss1$Sample_Name]

cvar <- ss1[,"Sample_Source",drop=F]
cvar$Sample_Source <- factor(cvar$Sample_Source)

# sva off due to error
m_ewas <- meffil.ewas(m_beta1,variable=factor(ss1$Sample_Group),covariates=cvar,sva=F,isva=F,smartsva=F)
ewas_summary <- meffil.ewas.summary(m_ewas,m_beta1)
meffil.ewas.report(ewas_summary, output.file="meffil-ewas-report.html")

# number of probes with fdr<0.05
pv <- m_ewas$analyses$all$table[,"fdr",drop=F]
nrow(pv[pv$fdr<0.05,])

Checks for sample swap detection

Hi!
I detected some SNP discordances between my methylation and genotype arrays. I am trying to rematch those swaps. Using the below function, I obtained the genotype corresponding to the 65 probes used to test the discordance. I get the data per SNP (0, 1,2), but I don't know which allele they correspond to. How could I get this info?

snp.betas <- meffil.snp.betas(qc.objects)
genotypes <- meffil:::calculate.beta.genotypes(snp.betas)

Thanks in advance!

qc.objects - annotation

Greetings. First of all, thank you for developing this useful package!
I'm new to using Meffil and I had been following the guidelines on the readme.md for step-by-step normalization when I ran into an issue.

I got to this part of the guidance:
#save the SNP names in R
annotation <- qc.objects[[1]]$annotation
writeLines(meffil.snp.names(annotation), con="snp-names.txt")

I couldn't get this code to run because my qc.objects dataset doesn't have a variable named annotation. Do you know why it might not have generated the annotation variable?

mysterious error trying to estimate cell counts from betas - please help?

I have installed all prerequisites as per instruction manual.
I am running R 3.6.2. on Ubuntu 19.10 and I get this error -
is this still the openbals issue/workaround?

cellcounts<-meffil.estimate.cell.counts.from.betas(beta = B_t,cell.type.reference="blood gse35069 complete",verbose=T)
Error in preprocessCore::normalize.quantiles.use.target(beta[subset, , :
ERROR; return code from pthread_create() is 22

Negative estimated cell counts?

Hello,
Thanks for the great package.
I only have pre-processed betas (from cord - see distribution below) so I must use meffil.estimate.cell.counts.from.betas.
There are a lot of negative values, especially for NK.
Of note, I did remove SNP-related probes, XY chromosome probes and low variability (<5%) probes and batch corrected for plate.

cellcountref = "andrews and bakulski cord blood"

cellcounts3<-meffil.estimate.cell.counts.from.betas(ARIEScordbetapoint0stICA,cell.type.reference=cellcountref,verbose=T)
head(cellcounts3)
Bcell CD4T CD8T Gran Mono NK nRBC
35 0.000000e+00 0.19415426 1.436839e-01 0.4361359 2.176463e-18 2.699185e-19 -6.938894e-18
42 -5.335846e-18 0.19536009 0.000000e+00 0.5927273 0.000000e+00 -8.517630e-18 2.775558e-17
45 -4.336809e-19 0.19231770 5.951174e-20 0.5538809 6.938894e-18 -9.660659e-18 0.000000e+00
50 -5.403754e-18 0.08274238 9.387805e-02 0.5950767 0.000000e+00 -9.282177e-18 0.000000e+00
73 -2.185562e-18 0.00000000 4.426202e-03 0.8130283 -1.199445e-18 0.000000e+00 0.000000e+00
118 1.168658e-01 0.03662122 6.565257e-02 0.5457070 1.602137e-18 -1.042258e-18 0.000000e+00

cellcountref2 ="cord blood gse68456"
cellcounts4<-meffil.estimate.cell.counts.from.betas(as.matrix(ARIEScordbetapoint0stICA),cell.type.reference=cellcountref2,verbose=T)
head(cellcounts4)
Bcell CD4T CD8T Gran Mono NK RBC
35 0.000000e+00 1.093173e-01 1.074017e-01 0.5587779 -3.088042e-19 3.544531e-03 0.03901592
42 6.938894e-18 8.018818e-02 6.655908e-02 0.6905427 0.000000e+00 0.000000e+00 0.00000000
45 0.000000e+00 6.050381e-02 9.786610e-02 0.6523306 0.000000e+00 0.000000e+00 0.00000000
50 0.000000e+00 6.646457e-02 5.677843e-02 0.6550116 2.477728e-02 0.000000e+00 0.00000000
73 -1.526199e-18 1.565778e-02 9.471646e-19 0.8668614 0.000000e+00 0.000000e+00 0.00000000
118 2.728014e-02 9.222220e-19 1.876269e-19 0.6439960 2.631825e-02 8.673617e-19 0.12577491

There are fewer negative values with the second reference but the distribution of estimated values between the two are very similar (see below)

Thanks for any thoughts,
J

Below is my session info:

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252 LC_NUMERIC=C LC_TIME=English_Canada.1252

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] FlowSorted.CordBloodNorway.450k_1.4.0 minfi_1.24.0 bumphunter_1.20.0 locfit_1.5-9.1
[5] iterators_1.0.10 foreach_1.4.4 Biostrings_2.46.0 XVector_0.18.0
[9] SummarizedExperiment_1.8.1 DelayedArray_0.4.1 Biobase_2.38.0 GenomicRanges_1.30.3
[13] GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0
[17] meffil_1.0.0 statmod_1.4.30 quadprog_1.5-5 DNAcopy_1.52.0
[21] fastICA_1.2-1 lme4_1.1-19 Matrix_1.2-15 multcomp_1.4-8
[25] TH.data_1.0-9 survival_2.43-3 mvtnorm_1.0-8 matrixStats_0.54.0
[29] markdown_0.9 gridExtra_2.3 Cairo_1.5-9 knitr_1.21
[33] reshape2_1.4.3 plyr_1.8.4 ggplot2_3.1.0 sva_3.26.0
[37] BiocParallel_1.12.0 genefilter_1.60.0 mgcv_1.8-26 nlme_3.1-137
[41] limma_3.34.9 MASS_7.3-51.1 illuminaio_0.20.0

loaded via a namespace (and not attached):
[1] minqa_1.2.4 colorspace_1.4-0 siggenes_1.52.0 mclust_5.4.2 base64_2.0 rstudioapi_0.9.0 bit64_0.9-7
[8] AnnotationDbi_1.40.0 xml2_1.2.0 codetools_0.2-16 splines_3.4.3 nloptr_1.2.1 Rsamtools_1.30.0 annotate_1.56.2
[15] readr_1.3.1 compiler_3.4.3 httr_1.4.0 assertthat_0.2.0 lazyeval_0.2.1 prettyunits_1.0.2 tools_3.4.3
[22] bindrcpp_0.2.2 gtable_0.2.0 glue_1.3.0 GenomeInfoDbData_1.0.0 dplyr_0.7.8 doRNG_1.7.1 Rcpp_1.0.0
[29] multtest_2.34.0 preprocessCore_1.40.0 rtracklayer_1.38.3 xfun_0.4 stringr_1.3.1 rngtools_1.3.1 XML_3.98-1.16
[36] beanplot_1.2 zlibbioc_1.24.0 zoo_1.8-4 scales_1.0.0 hms_0.4.2 sandwich_2.5-0 GEOquery_2.46.15
[43] RColorBrewer_1.1-2 yaml_2.2.0 memoise_1.1.0 pkgmaker_0.27 biomaRt_2.34.2 reshape_0.8.8 stringi_1.2.4
[50] RSQLite_2.1.1 RMySQL_0.10.16 GenomicFeatures_1.30.3 bibtex_0.4.2 rlang_0.3.1 pkgconfig_2.0.2 bitops_1.0-6
[57] nor1mix_1.2-3 lattice_0.20-38 purrr_0.2.5 bindr_0.1.1 GenomicAlignments_1.14.2 bit_1.1-14 tidyselect_0.2.5
[64] magrittr_1.5 R6_2.3.0 DBI_1.0.0 pillar_1.3.1 withr_2.1.2 RCurl_1.95-4.11 tibble_2.0.1
[71] crayon_1.3.4 progress_1.2.0 grid_3.4.3 data.table_1.12.0 blob_1.1.1 digest_0.6.18 xtable_1.8-3
[78] tidyr_0.8.2 openssl_1.1 munsell_0.5.0 registry_0.5

Below is the histogram of my betas (to show it's bound by zero and 1.)
image

Histogram of estimated cell counts4
image

Histogram of estimated cell counts3
image

ewas / champ output comparison

I have used both the meffil package, and a combination of minfi and ChAMP to determine differentially methylated positions in Schizophrenic vs normal brain tissue samples. IDAT files were obtained from GSE61107 and the following sample sheet.

I am getting completely different results from both packages - with meffil, I obtained 20 significant CpG sites whereas with minfi + ChAMP, I obtain 250,000 significant sites. Out of the 20 sites I get from meffil, only 17 match the other output. I've double-checked the code to make sure it aligns with the meffil tutorial, and followed publicly available tutorials (link) for the other packages. Here, you can find the code I used to make this comparison for your reference.

What could be causing such large discrepancies in the exact same dataset? The dorsolateral prefrontal cortex (DLPFC) cell reference is used in both meffil and minfi+ChAMP, functional normalization (minfi: preprocessFunnorm() ) was used in both as well. In minfi+ChAMP, SVA was used to estimate surrogate variables which were then regressed out. One difference is that p-value adjustmemt in ChAMP defaults to the Benjamini-Hochburg method, whereas meffil uses the Bonferroni correction, however, this alone cannot account for such a vast difference in output. Any guidance about the differences between meffil.ewas and champ.DMP() would be greatly appreciated. Thank you!

Figure sub-folder in the EWAS report

Meffil works nicely, and in the EWAS report folder, it used to create a sub-folder named "figure".
However, in my last EWAS report, which worked ok and generated the report folder; but the "figure" sub-folder is missing.
Any thought or solution.

Creating chunks for parallelization using `meffil.normalize.samples()` error

Dear Dr. Suderman,

I found a bug while running meffil.normalize.samples() using multiple cores.

Error in read.idat(paste(basename, "_Grn.idat", sep = ""), verbose = verbose) : 
  Filename does not exist:./data/idats/<Sentrid_id>_Grn.idat.gz./data/idats/<Sentrid_id>_Grn.idat

So I dug a bit deeper and I think I might have found the problem.

I think the problem is because in ./R/mclapply.r the function mclapply.safe() is chunking the work into smaller chunks based on the argument "max.bytes". However, this means that for a sample, one _Grn.idat might end up in a chunk while the _Red.idat might end up in another chunk and this will raise an error in ./R/rg.r read.rg() function which calls read.idat(). Updated with EDIT in the below comment.

I haven't dug deep enough into the codes to understand fully how the chunking works. So if you were to implement a fix, what would you suggest is the strategy? I can try to implement it in my fork and test it out.

Because of this issue, and another issue - many academics and students use Windows machines, may I suggest that this is a good opportunity to also replace the use of parallel::mclapply() with the use of the {future} package future.apply::future_lapply? This would allow academics to use the PSOCK cluster scheme to run meffil.qc() and meffil.normalize.samples() on their Windows machines. I tried a quick solution in my fork by replacing just 2 mclapply() calls in ./R/mclapply.r with future.apply::future_lapply and everything worked on my Windows machine.

Thank you so much for your work and looking forward to hearing back

Extracting genotypes from Methylation data

Thanks for having such a clear package and workflow.
I was wondering is it also possible to just extract genotypes from the methylation data without having to input genotypes derived from another source?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.