Giter VIP home page Giter VIP logo

eggla's Introduction

EGG Longitudinal Analysis

GitHub tag DOI
codecov R-CMD-check Build Docker

Tools for longitudinal analysis within the EGG (Early Growth Genetics) Consortium.

Installation

  • Install the development version from GitHub:
# install.packages("remotes")
remotes::install_github("mcanouil/eggla")
  • Install a particular version:
# install.packages("remotes")
remotes::install_github("mcanouil/[email protected]")
# or the latest stable version
remotes::install_github("mcanouil/eggla@latest")

Docker Images

  • docker pull ghcr.io/mcanouil/eggla:devel.
  • docker pull ghcr.io/mcanouil/eggla:latest.

License

MIT © Mickaël Canouil, Nicole Warrington, Kimberley Burrows, and Anni Heiskala.

Code of Conduct

Please note that the eggla project is released with a Contributor Code of Conduct.
By contributing to this project, you agree to abide by its terms.

eggla's People

Contributors

dependabot[bot] avatar mcanouil avatar ning-l avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

ning-l

eggla's Issues

`performance::model_performance()` exibits a log-likelihood warning for AIC/BIC

Bug description

Initially reported by @burrowsk

Warning messages:
1: Log-likelihood is corrected for models with transformed response. However, this ignores 'REML=TRUE'. Log-likelihood value is probably inaccurate.

eggla version output

eggla v0.12.3
performance v0.9.1

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

parameter needed for knots run_eggla_lmm.R

Bug description

Hello!

Could a parameter for knots be included in the run_eggla_lmm.R function please? Currently, the knots are set at 2, 8, and 12 (line 82 of function). I believe the final knot placements are to be 1, 8 and 12?

Thanks!

eggla version output

‘0.11.2.9000’

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

New version of run_eggla_gwas not working

Bug description

Hi Mickael, I have been trying to run GWASes again with the new version and I am running out o ideas what to do. When I try the gwas with all four types of phenotypre, the analysis stops at chromosome 4:

run_eggla_gwas(
  data = pcs,
  results = res,
  id_column ="ID",
  traits = c("slope_.*", "auc_.*", "^AP_.*", "^AR_.*"),
  covariates = c("sex","PC1", "PC2", "PC3", "PC4", "PC5"),
  vcfs = list.files(
      path = "/.../genetic_files/",
      pattern = "\\.vcf$|\\.vcf.gz$",
      full.names = TRUE
    ),
  working_directory = "/wrk/anni/",
  use_info = TRUE,
  bin_path = list(bcftools = "/usr/local/bin/bcftools",plink2 = "/usr/local/bin/plink2"),
  bcftools_view_options = NULL,
  build = "37",
  strand = "+",
  info_type = "minimac3 R2",
  threads = 1,
  quiet = FALSE,
  clean = TRUE
)
...
...
...
[chr3.dose.vcf.gz] Performing PLINK2 regression ...
[chr3.dose.vcf.gz] Combining results files ...
[chr3.dose.vcf.gz] Extracting INFO ...
[chr3.dose.vcf.gz] Computing PLINK2 missing rate ...
[chr3.dose.vcf.gz] Computing PLINK2 Hardy-Weinberg test ...
[chr3.dose.vcf.gz] Results written in "/.../gwas_plink2/da9ed3b2a26386f9bbc14160926d590a__chr3.dose.results.gz"
[chr4.dose.vcf.gz] Performing PLINK2 regression ...
[chr4.dose.vcf.gz] Combining results files ...
Error in data.table::setnames(x = results, old = c("CHROM", "POS", "ID",  :
  Items of 'old' not found in column names: [CHROM, POS, ID, A1, AX, A1_FREQ, MACH_R2, OBS_CT, BETA, SE, ...]. Consider skip_absent=TRUE.
In addition: Warning message:
In `[.data.table`(data.table::setnames(x = data.table::rbindlist(l = lapply(X = (function(x) `names<-`(x,  :
  column(s) not removed because not found: [TEST, T_STAT, REF, ALT]

I tried running the analysis only with chromosome 4 and that got completed without problems.

I also tried the GWAS with only AUC and slope as didn't have any problems with the same data sets in the previous version of eggla that I used (I have only been running the analyses with slope and auc until now), and with this got an error at chromosome 6:

run_eggla_gwas(
  data = pcs,
  results = res,
  id_column ="ID",
  traits = c("slope_.*", "auc_.*"),
  covariates = c("sex","PC1", "PC2", "PC3", "PC4", "PC5"),
  vcfs = list.files(
      path = "/.../genetic_files/",
      pattern = "\\.vcf$|\\.vcf.gz$",
      full.names = TRUE
    ),
  working_directory = "/wrk/anni/",
  use_info = TRUE,
  bin_path = list(bcftools = "/usr/local/bin/bcftools",plink2 = "/usr/local/bin/plink2"),
  bcftools_view_options = NULL,
  build = "37",
  strand = "+",
  info_type = "minimac3 R2",
  threads = 1,
  quiet = FALSE,
  clean = TRUE
)
[chr5.dose.vcf.gz] Results written in "/.../gwas_plink2/afdae49888058df51fdaf6dcb37c4c28__chr5.dose.results.gz"
[chr6.dose.vcf.gz] Performing PLINK2 regression ...
[chr6.dose.vcf.gz] Combining results files ...
[chr6.dose.vcf.gz] Extracting INFO ...
[chr6.dose.vcf.gz] Computing PLINK2 missing rate ...
[chr6.dose.vcf.gz] Computing PLINK2 Hardy-Weinberg test ...
Error in data.table::fread(sprintf("%s.vmiss", results_file)) :
  File '/.../gwas_plink2/afdae49888058df51fdaf6dcb37c4c28__chr6.dose.vmiss' does not exist or is non-readable. getwd()=='/wrk/anni/'
In addition: Warning messages:
1: In FUN(X[[i]], ...) :
  Discarded single-line footer: <<6     87575956        6:87575956:A:T  A       T       T       A       1827.>>
2: In FUN(X[[i]], ...) :
  Discarded single-line footer: <<6     87576217        6:87576217:G:T  G       T       T       G       420.028 7226    0.0581274       0.98993 PC3     3613    -0.0119956      0.00827245      -1.45007        0.14712>>
3: In FUN(X[[i]], ...) :
  Discarded single-line footer: <<6     87574674        6:87574674:A:G  A       G       G       A       0.0418091       7226    5.785>>
4: In FUN(X[[i]], ...) : Discarded single-line footer: <<6      87571928        6:8>>
5: In FUN(X[[i]], ...) :
  Discarded single-line footer: <<6     87575584        6:87575584:G:A  G       A       A       G       0.0310059       7226    4.29087e-06     0.0309974       PC2     3>>
6: In FUN(X[[i]], ...) :
  Discarded single-line footer: <<6     87576714        6:87576714:G:A  G       A       A       G       0.00292969      7226    4.05437e-07     0.000975752     PC2     3613    0.00012>>
>

What do you think of this?

eggla version output

> packageVersion('eggla')
[1] ‘0.19.0’

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Ensure GWAS output is consistent with analysis plan

Column Header Description Format Examples
SNPID SNP/INDEL ID label from the imputation server, or from directly genotyped data Text 4:10229 / rs693
STRAND Orientation of the site to the human genome strand used + (Should be aligned to the forward strand) +
BUILD Build of the genome on which the SNP is orientated Numeric 37
CHR Chromosome on which the SNP resides Numeric 1
POS Position of SNP on Base pairs on human genome build used 34000345
EFFECT_ALLELE Allele for which the effect (BETA) is reported SNP: Capital letter (A,C,G,T) / INDEL: Allele output by imputation server A
NON_EFFECT_ALLELE Other allele at this site SNP: Capital letter (A,C,G,T) / INDEL: Allele output by imputation server G
N Total number of samples analysed at this site Numeric, integer 1234
N0 Number of homozygous samples with zero copies of the EFFECT_ALLELE Numeric, integer or float with 4 digits to the right of the decimal (imputed) 623 / 745.2345
N1 Number of heterozygous samples with one copy of the EFFECT_ALLELE Numeric, integer or float with 4 digits to the right of the decimal (imputed) 623 / 745.2345
N2 Number of homozygous samples with two copies of the EFFECT_ALLELE Numeric, integer or float with 4 digits to the right of the decimal (imputed) 623 / 745.2345
EAF Allele frequency of the EFFECT_ALLELE Frequency with 4 digits to the right of the decimal 0.3546
HWE_P Exact HWE p-value for samples analysed, only if genotyped data is used in analysis Scientific E notation with 4 digits to the right of the decimal (set to missing for imputed SNPs)
CALL_RATE Genotyping call rate for the SNP (N_successful/N_attempted) Frequency 4 digits to the right of the decimal. Set to 1.000 if imputed 0.9936
BETA Estimate of the effect size Numeric float with 4 digits to the right of the decimal 0.2036
SE Estimated standard error on the estimate of the effect size, uncorrected for genomic control Numeric float with 4 digits to the right of the decimal 0.5611
PVAL P value of the variant association Scientific E notation with 4 digits to the right of the decimal 3.244E-10
IMPUTED Is the SNP/INDEL imputed? 0 = Genotyped / 1 = Imputed 1
INFO_TYPE Type of information provided in the INFO column Text, (set to missing if genotyped) impute_info
INFO Measure of information content for the imputed SNP result Numeric float with 4 digits to the right of the decimal (set to missing if genotyped) 0.4832 / .

INFO_TYPE description given to genotyped SNPs

Bug description

This is related to the previous issue #80 I reported, turned out I didn't check the results carefully enough. So I was able to run run_eggla_gwas with R2 in the INFO, but the description given in the function argument info_type went only to genotyped SNPs.

P.S. Sorry for taking time with the analysis plan, I am trying to learn how to respond to the review request!

eggla version output

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Locale:
  LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
  LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
  LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
  LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

Package version:
  backports_1.4.1      bayestestR_0.13.0    beeswarm_0.4.0      
  BH_1.75.0.0          bit_4.0.4            bit64_4.0.5         
  broom_1.0.1          broom.mixed_0.2.9.4  cli_3.4.1           
  clipr_0.8.0          coda_0.19.4          codetools_0.2.18    
  colorspace_2.0.3     compiler_4.2.1       cpp11_0.4.3         
  crayon_1.5.2         curl_4.3.3           data.table_1.14.4   
  datawizard_0.6.3     digest_0.6.30        distributional_0.3.1
  doParallel_1.0.17    dplyr_1.0.10         eggla_0.17.4        
  ellipsis_0.3.2       fansi_1.0.3          farver_2.1.1        
  forcats_0.5.2        foreach_1.5.2        furrr_0.3.1         
  future_1.28.0        generics_0.1.3       ggbeeswarm_0.6.0    
  ggdist_3.2.0         ggplot2_3.3.6        ggtext_0.1.2        
  globals_0.16.1       glue_1.6.2           graphics_4.2.1      
  grDevices_4.2.1      grid_4.2.1           gridtext_0.1.5      
  growthcleanr_2.0.0   gtable_0.3.1         haven_2.5.1         
  HDInterval_0.2.2     hms_1.1.2            insight_0.18.6      
  isoband_0.2.6        iterators_1.0.14     jpeg_0.1.9          
  labeling_0.4.2       labelled_2.10.0      lattice_0.20.44     
  lifecycle_1.0.3      listenv_0.8.0        magrittr_2.0.3      
  markdown_1.2         MASS_7.3.53          Matrix_1.3.3        
  methods_4.2.1        mgcv_1.8.35          mime_0.12           
  munsell_0.5.0        nlme_3.1.152         numDeriv_2016.8.1.1 
  parallel_4.2.1       parallelly_1.32.1    patchwork_1.1.2     
  performance_0.10.0   pillar_1.8.1         pkgconfig_2.0.3     
  plyr_1.8.7           png_0.1.7            prettyunits_1.1.1   
  progress_1.2.2       purrr_0.3.5          R6_2.5.1            
  RColorBrewer_1.1.3   Rcpp_1.0.9           readr_2.1.3         
  rlang_1.0.6          scales_1.2.1         splines_4.2.1       
  stats_4.2.1          stringi_1.7.8        stringr_1.4.1       
  tibble_3.1.8         tidyr_1.2.1          tidyselect_1.2.0    
  tools_4.2.1          tzdb_0.3.0           utf8_1.2.2          
  utils_4.2.1          vctrs_0.5.0          vipor_0.4.5         
  viridisLite_0.4.1    vroom_1.6.0          withr_2.5.0         
  xfun_0.34            xml2_1.3.3          

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

run_eggla_gwas: Error in colnames

Bug description

I created lmm results with

results_archives <- run_eggla_lmm(
  data = fread("selected_cleaned_data.csv"),
  id_variable = "ID",
  age_days_variable = NULL,
  age_years_variable = "age",
  weight_kilograms_variable = "weight",
  height_centimetres_variable = "height",
  sex_variable = "sex",
  covariates = NULL,
  male_coded_zero = FALSE,
  random_complexity = 2,
  knots=c(1,8,12),
  parallel = FALSE,
  parallel_n_chunks = 1
)

and this worked fine. Next, tried the GWAS with the following error:

run_eggla_gwas(
  data = "selected_cleaned_data.csv",
  results = results_archives,
  id_column = "ID",
  traits = c("slope_.*"),
  covariates = c("sex"),
  vcfs = list.files(
    path = file.path("/genetic_files", "vcf"),
    pattern = "\\.vcf$|\\.vcf.gz$",
    full.names = TRUE
  ),
  path = output_path,
  vep = NULL,
  bin_path = list(
    bcftools = "/usr/local/bin/bcftools",
    plink2 = "/usr/local/bin/plink2"
  ),
  threads = 1
)
#> Formatting VCFs ...
#> Writing trait results ...
#> Error in colnamesInt(x, neworder, check_dups = FALSE) :
#>   argument specifying columns specify non existing column(s): cols[1]='trait_model'
#> In addition: Warning message:
#> In run_eggla_gwas(data = "selected_cleaned_data.csv", results = results_archives,  :
#>   Sex must be coded: '1'/'M'/'m' = male, '2'/'F'/'f' = female, 'NA'/'0' = missing! '0' have been recoded as '2', i.e., female.

Looking at the code I don't see if trait_model was defined after setting it to NULL?

Regarding the warning: could the default coding for sex be the same as in the other functions: 0=female, 1=male?

xfun::session_info()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Locale:
#>   LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
#>   LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
#>   LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
#>   LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#> 
#> Package version:
#>   assertthat_0.2.1    backports_1.4.1     bayestestR_0.12.1
#>   BH_1.75.0.0         bit_4.0.4           bit64_4.0.5
#>   broom_1.0.0         broom.mixed_0.2.9.4 cli_3.3.0
#>   clipr_0.8.0         coda_0.19.4         codetools_0.2-18
#>   colorspace_2.0-3    compiler_4.2.1      cpp11_0.4.2
#>   crayon_1.5.1        data.table_1.14.2   datawizard_0.5.1
#>   DBI_1.1.1           digest_0.6.29       doParallel_1.0.17
#>   dplyr_1.0.9         eggla_0.12.3        ellipsis_0.3.2
#>   fansi_1.0.3         farver_2.1.1        forcats_0.5.2
#>   foreach_1.5.2       furrr_0.3.1         future_1.27.0
#>   future.apply_1.9.0  generics_0.1.3      ggplot2_3.3.6
#>   globals_0.16.0      glue_1.6.2          graphics_4.2.1
#>   grDevices_4.2.1     grid_4.2.1          growthcleanr_2.0.0
#>   gtable_0.3.0        haven_2.5.1         hms_1.1.2
#>   insight_0.18.2      isoband_0.2.5       iterators_1.0.14
#>   labeling_0.4.2      labelled_2.9.1      lattice_0.20-44
#>   lifecycle_1.0.1     listenv_0.8.0       magrittr_2.0.3
#>   MASS_7.3.53         Matrix_1.3-3        methods_4.2.1
#>   mgcv_1.8-35         munsell_0.5.0       nlme_3.1-152
#>   parallel_4.2.1      parallelly_1.32.1   patchwork_1.1.2
#>   performance_0.9.2   pillar_1.8.1        pkgconfig_2.0.3
#>   plyr_1.8.7          prettyunits_1.1.1   progress_1.2.2
#>   purrr_0.3.4         R6_2.5.1            RColorBrewer_1.1.3
#>   Rcpp_1.0.9          readr_2.1.2         rlang_1.0.4
#>   rstudioapi_0.13     scales_1.2.1        splines_4.2.1
#>   stats_4.2.1         stringi_1.7.8       stringr_1.4.1
#>   tibble_3.1.8        tidyr_1.2.0         tidyselect_1.1.2
#>   tools_4.2.1         tzdb_0.3.0          utf8_1.2.2
#>   utils_4.2.1         vctrs_0.4.1         viridisLite_0.4.1
#>   vroom_1.5.7         withr_2.5.0         xfun_0.20

eggla version output

packageVersion('eggla')
#> [1] ‘0.12.3’

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

GWAS function output file to tab separated

Bug description

Could we change the run_eggla_gwas function's summary stats output into tab separated .txt.gz file? I think this was discussed before and is like that in the analysis plan but at the moment the output comes as comma separated. The csv files are difficult in the QC process I have to currently convert them into tab separated before the QC.

eggla version output

> packageVersion('eggla')
[1] ‘0.19.5’

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Error in "vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__"

Issue while merging results/data in GWAS function in one of the following chunk:

  • eggla/R/do_eggla_gwas.R

    Lines 137 to 166 in 3d86570

    dt <- data.table::merge.data.table(
    x = data.table::fread(data),
    y = data.table::setnames(
    x = data.table::rbindlist(lapply(
    X = results_zip,
    path = path,
    FUN = function(izip, path) {
    utils::unzip(
    zipfile = izip,
    files = "derived-slopes.csv",
    exdir = path
    )
    utils::unzip(
    zipfile = izip,
    files = "derived-aucs.csv",
    exdir = path
    )
    on.exit(unlink(file.path(path, c("derived-slopes.csv", "derived-aucs.csv"))))
    data.table::merge.data.table(
    x = data.table::fread(file.path(path, "derived-slopes.csv")),
    y = data.table::fread(file.path(path, "derived-aucs.csv")),
    by = "egg_id"
    )
    }
    )),
    old = "egg_id",
    new = id_column
    ),
    by = id_column
    )
  • eggla/R/do_eggla_gwas.R

    Lines 428 to 436 in 3d86570

    data.table::fwrite(
    x = data.table::merge.data.table(
    x = results,
    y = annot,
    by = c("CHROM", "POS", "ID", "REF", "ALT"), # intersect(names(results), names(annot))
    all.x = TRUE
    ),
    file = output_results_file
    )
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 11768 rows; more than 7079 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Adding in knots as a parm in egg_slopes()

egg_slopes() requires knots, but is fixed within the function. Could knots be placed as a param to allow for use with different knots models please?
e.g.

egg_slopes <- function(
fit,
period = c(0, 0.5, 1.5, 3.5, 6.5, 10, 12, 17),
knots = c(2, 8, 12)
) {
stopifnot(inherits(fit, "lme"))
#knots <- c(2, 8, 12)
id_var <- names(fit[["groups"]])

slopes <- matrix(
data = NA_real_,
nrow = length(unique(fit$data[[id_var]])),
ncol = length(period) / 2
)
colnames(slopes) <- paste0(
"slope_",
sapply(split(
x = period,
f = rep(
x = seq(1, length(period), length(period) %/% 4),
each = length(period) %/% 4
)
), paste, collapse = "--")
)
pred <- matrix(
data = NA_real_,
nrow = length(unique(fit$data[[id_var]])),
ncol = length(period)
)
colnames(pred) <- paste0("pred_period_", round(period, digits = 1))

fxef <- nlme::fixef(fit)
fxef <- unname(fxef[
grep("\(Intercept\)|gsp\(.\)|poly\(.\)", names(fxef))
])
rnef <- nlme::ranef(fit)
rnef <- rnef[, grep("\(Intercept\)|gsp\(.\)|poly\(.\)", names(rnef))]

rnef <- cbind.data.frame(
as.matrix(rnef),
matrix(
data = rep(0, (length(fxef) - ncol(rnef)) * nrow(rnef)),
nrow = nrow(rnef),
ncol = length(fxef) - ncol(rnef)
)
)

for (i in seq_along(unique(fit$data[[id_var]]))) {
coeff <- fxef + as.numeric(rnef[i, ])
for (j in 1:(length(period) / 2)) {
x1 <- period[j * 2 - 1]
y1_tmp <- coeff *
c(x1^0, x1^1, x1^2, x1^3, (x1 - knots)^3) /
c(1, 1, 2, rep(6, 4))
y1 <- sum(y1_tmp[1:(4 + findInterval(x1, knots, left.open = TRUE))])

  x2 <- period[j * 2]
  y2_tmp <- coeff *
    c(x2^0, x2^1, x2^2, x2^3, (x2 - knots)^3) /
    c(1, 1, 2, rep(6, 4))
  y2 <- sum(y2_tmp[1:(4 + findInterval(x2, knots, left.open = TRUE))])
  
  pred[i, j * 2 - 1] <- y1
  pred[i, j * 2] <- y2
  slopes[i, j] <- (y2 - y1) / (x2 - x1)
}

}
out <- cbind.data.frame(ID = unique(fit$data[[id_var]]), pred, slopes)
names(out)[1] <- id_var
out
}

Thanks!

Missing covariates in `stats::predict()` in `ggplot2` code

Reported by @burrowsk: code for plotting predicted trajectories does not account for covariates in the models.

  • In vignettes/articles/models-diagnostics.Rmd
    ```{r model-predict-plot}
    ggplot(
    data = good_res_models[
    j = rbindlist(list(
    (function(x) {
    data.table(age = seq(0, 16, 0.1))[
    j = bmi := if (inherits(x[[1]], "lme")) {
    exp(predict(x[[1]], .SD, level = 0))
    } else NA_real_,
    .SDcols = "age"
    ]
    })(lme)
    )),
    by = list(sex, dataset, models)
    ]
    ) +
    aes(
    x = age,
    y = bmi,
    colour = c("0" = "Male", "1" = "Female")[as.character(sex)]
    ) +
    geom_path() +
    labs(
    title = "Cubic Splines (Fixed Effects = Random Effects)",
    x = "AGE (years)",
    y = "BMI (kg/m\u00B2)",
    colour = NULL
    )
    ```
  • In vignettes/articles/run-cubic-splines.Rmd
    ```{r}
    ggplot() +
    aes(x = age, y = bmi) +
    stat_smooth(
    data = pheno_dt_female,
    mapping = aes(colour = "Loess"),
    method = "loess", formula = y ~ x, linetype = 2, se = FALSE
    ) +
    geom_path(
    data = data.table(
    age = seq(min(pheno_dt_female[["age"]]), max(pheno_dt_female[["age"]]), 0.1)
    )[
    j = bmi := exp(predict(res, .SD, level = 0)),
    .SDcols = "age"
    ],
    mapping = aes(colour = "Cubic Splines (Random Linear Splines)"),
    ) +
    labs(x = "AGE (years)", y = "BMI (kg/m\u00B2)", colour = "Model")
    ```

Error in run_eggla_gwas when use_info=TRUE

Bug description

I'm trying run_eggla_gwas again, and managed to run it with argument use_info=FALSE. With the following command

run_eggla_gwas(
  data = "selected_cleaned_PCs_NFBC1966.csv",
  results = results_archives,
  id_column = "ID",
  traits = c("slope_.*", "auc_.*"),
  covariates = c("sex", "PC1", "PC2", "PC3", "PC4", "PC5"),
  vcfs = list.files(
    path = gen_file_path,
    pattern = "\\.vcf$|\\.vcf.gz$",
    full.names = TRUE
  ),
  working_directory = work_dir_path,
  use_info = TRUE,
  build = "37",
  strand = "+",
  vep = NULL,
  bin_path = list(
    bcftools = "/usr/local/bin/bcftools",
    plink2 = "/usr/local/bin/plink2"
  ),
  threads = 1
)

I get error and message:
Formatting VCFs ...
[chr1.dose.vcf.gz] Performing PLINK2 regression ...
[chr1.dose.vcf.gz] Extracting INFO ...
[chr1.dose.vcf.gz] Combining results files ...
[chr1.dose.vcf.gz] Writing annotated results file ...
Error in colnamesInt(x, neworder, check_dups = FALSE) :
argument specifying columns specify non existing column(s): cols[21]='INFO'
Calls: run_eggla_gwas ... resolve.list -> signalConditionsASAP -> signalConditions
Execution halted
srun: error: localhost: task 0: Exited with exit code 1

Is there something that is missing in the arguments, or in my vcf files? I don't catch where it's coming from.

$ bcftools view -h "genetic_files/chr20.dose.vcf.gz"
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=GENOTYPED,Description="Site was genotyped">
##FILTER=<ID=GENOTYPED_ONLY,Description="Site was genotyped only">
##filedate=2016.10.6
##source=Minimac3
##contig=<ID=20>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
##bcftools_viewVersion=1.4+htslib-1.4
##bcftools_viewCommand=view -h chr20.dose.vcf.gz; Date=Thu May 17 10:11:34 2018
##bcftools_viewVersion=1.10.2+htslib-1.10.2
##bcftools_viewCommand=view -h genetic_files/chr20.dose.vcf.gz; Date=Mon Oct 24 10:39:04 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ...

eggla version output

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Locale:
  LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
  LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
  LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
  LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

Package version:
  backports_1.4.1      bayestestR_0.13.0    beeswarm_0.4.0      
  BH_1.75.0.0          bit_4.0.4            bit64_4.0.5         
  broom_1.0.1          broom.mixed_0.2.9.4  cli_3.4.1           
  clipr_0.8.0          coda_0.19.4          codetools_0.2.18    
  colorspace_2.0.3     compiler_4.2.1       cpp11_0.4.3         
  crayon_1.5.2         curl_4.3.3           data.table_1.14.4   
  datawizard_0.6.3     digest_0.6.30        distributional_0.3.1
  doParallel_1.0.17    dplyr_1.0.10         eggla_0.17.3        
  ellipsis_0.3.2       fansi_1.0.3          farver_2.1.1        
  forcats_0.5.2        foreach_1.5.2        furrr_0.3.1         
  future_1.28.0        generics_0.1.3       ggbeeswarm_0.6.0    
  ggdist_3.2.0         ggplot2_3.3.6        ggtext_0.1.2        
  globals_0.16.1       glue_1.6.2           graphics_4.2.1      
  grDevices_4.2.1      grid_4.2.1           gridtext_0.1.5      
  growthcleanr_2.0.0   gtable_0.3.1         haven_2.5.1         
  HDInterval_0.2.2     hms_1.1.2            insight_0.18.6      
  isoband_0.2.6        iterators_1.0.14     jpeg_0.1.9          
  labeling_0.4.2       labelled_2.10.0      lattice_0.20.44     
  lifecycle_1.0.3      listenv_0.8.0        magrittr_2.0.3      
  markdown_1.2         MASS_7.3.53          Matrix_1.3.3        
  methods_4.2.1        mgcv_1.8.35          mime_0.12           
  munsell_0.5.0        nlme_3.1.152         numDeriv_2016.8.1.1 
  parallel_4.2.1       parallelly_1.32.1    patchwork_1.1.2     
  performance_0.10.0   pillar_1.8.1         pkgconfig_2.0.3     
  plyr_1.8.7           png_0.1.7            prettyunits_1.1.1   
  progress_1.2.2       purrr_0.3.5          R6_2.5.1            
  RColorBrewer_1.1.3   Rcpp_1.0.9           readr_2.1.3         
  rlang_1.0.6          scales_1.2.1         splines_4.2.1       
  stats_4.2.1          stringi_1.7.8        stringr_1.4.1       
  tibble_3.1.8         tidyr_1.2.1          tidyselect_1.2.0    
  tools_4.2.1          tzdb_0.3.0           utf8_1.2.2          
  utils_4.2.1          vctrs_0.5.0          vipor_0.4.5         
  viridisLite_0.4.1    vroom_1.6.0          withr_2.5.0         
  xfun_0.34            xml2_1.3.3          

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

HWE_P in run_eggla_gwas() results inconsistent with analysis plan

Bug description

These are from the older eggla version as I haven't got updated results yet, but in case you haven't noticed this (I didn't find in change log) the result formatting in run_eggla_gwas leaves values for HWE_P in every SNP when the analysis plan stated to set HWE_P missing for imputed SNPs.

I didn'pay attention to this when I checked the NFBC results and corrected it, but now found the same in OBE results:

> summary(OBE_AR$HWE_P) # there are no missing values
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.0000  1.0000  1.0000  0.8859  1.0000  1.0000

eggla version output

eggla_0.17.4 and older?

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Outlier plot legend getting mixed

Bug description

I got that really nice outlier plot as an output from run_eggla_lmm(). The legend explaining colors seems to get mixed when I only have two different colors in my plot:
derived-outliers

Looking at the "derived-ouliers.csv" from the same lot of output I have altogether 701 "Zscore outliers" and 1657 "IQR outliers", and actually all Zscore outliers are also IQR-outliers.

> table(outliers$Outlier_Zscore)

    0     1 
74658   701

> table(outliers$Outlier_IQR)

    0     1 
73702  1657 

> table(outliers$Outlier_IQR, outliers$Outlier_Zscore)
   
        0     1
  0 73702     0
  1   956   701

eggla version output

> packageVersion("eggla")
[1] ‘0.16.0’
> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Finnish_Finland.utf8  LC_CTYPE=Finnish_Finland.utf8    LC_MONETARY=Finnish_Finland.utf8
[4] LC_NUMERIC=C                     LC_TIME=Finnish_Finland.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.2     stringr_1.4.1     dplyr_1.0.10      purrr_0.3.5       readr_2.1.3       tidyr_1.2.1      
 [7] tibble_3.1.8      ggplot2_3.3.6     tidyverse_1.3.2   eggla_0.16.0      data.table_1.14.2

loaded via a namespace (and not attached):
 [1] cellranger_1.1.0    pillar_1.8.1        compiler_4.2.1      dbplyr_2.2.1        tools_4.2.1         lubridate_1.8.0    
 [7] jsonlite_1.8.0      googledrive_2.0.0   lifecycle_1.0.3     gargle_1.2.0        gtable_0.3.1        nlme_3.1-157       
[13] lattice_0.20-45     pkgconfig_2.0.3     rlang_1.0.6         reprex_2.0.2        DBI_1.1.3           cli_3.4.1          
[19] rstudioapi_0.14     patchwork_1.1.2     haven_2.5.1         xml2_1.3.3          withr_2.5.0         httr_1.4.4         
[25] hms_1.1.2           fs_1.5.2            generics_0.1.3      vctrs_0.4.2         googlesheets4_1.0.1 grid_4.2.1         
[31] tidyselect_1.2.0    glue_1.6.2          R6_2.5.1            fansi_1.0.3         readxl_1.4.1        tzdb_0.3.0         
[37] modelr_0.1.9        magrittr_2.0.3      ellipsis_0.3.2      scales_1.2.1        backports_1.4.1     rvest_1.0.3        
[43] assertthat_0.2.1    colorspace_2.0-3    utf8_1.2.2          stringi_1.7.8       munsell_0.5.0       broom_1.0.1        
[49] crayon_1.5.2       

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Update default value for compute_apar function

Bug description

When predicting the age at adiposity peak and rebound in the cohorts, we found that the age at adiposity peak only had 4 or 5 values (e.g. 0.65, 0.7, 0.75, 0.8, 0.85years). This was causing issues when we then excluded outliers in some cohorts as all but one category was flagged as an outlier. The current default 'step' in the compute_apar() function is 0.05. Could we please change this to 0.01? This extra resolution seems to resolve the issues within the cohorts we have tested.

The code that needs changing is as follows (version 0.18.5):

Current code: function (fit, from = c("predicted", "observed"), start = 0.25, end = 10, step = 0.05, filter = NULL)
Updated code: function (fit, from = c("predicted", "observed"), start = 0.25, end = 10, step = 0.01, filter = NULL)

eggla version output

packageVersion('eggla')
[1] ‘0.18.5’

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Increase of number of iterations for the EM algorithm `niterEM`

Hi!
We have been troubleshooting one cohort where the chosen model was not running without warnings. We managed to solve this by increasing the number of iterations for the EM algorithm used to refine the initial estimates of the random effects variance-covariance coefficients from the default of 25 to 100.
The nlme control variable is niterEM

Please could the increase in iterations be incorporated into the egg_model() function in order to run_eggla_lmm() to run for the cohort?

Thanks!

Error in id.vars for covariates in dt_long <- melt() for run_eggla_lmm()

Bug description

I have a covariate to include in the run_eggla_lmm() function. The following error occurs:

Error in melt.data.table(data = data.table::as.data.table(data)[j = list(egg_id = as.character(get(id_variable)),  : 
  One or more values in 'id.vars' is invalid.

This relates to line 105: id.vars

May require addition (and expansion of multiple) covariate names in preceding j = list() (line 29).

Thanks!

eggla version output

‘0.11.2.9000’

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Archives not created `run_eggla_lmm()`

Bug description

I can see the various tables and images created from the archives function (line 152 run_eggla_lmm()) appear in the male and female directories that are created (line 157). however, once the run_eggla_lmm() function is completed for each sex the files are automatically removed. I believed this was due to line 158 r on.exit(unlink(results_directory, recursive = TRUE)) .
Commenting this line (158) resulted in the files being retained.

I'm working on RStudio version 1.4.1717 on Windows 10.

related code block:

 archives <- sapply(
    X = c(0, 1),
    FUN = function(isex) {
      sex_literal <- c("0" = "male", "1" = "female")[as.character(isex)]
      results_directory <- file.path(working_directory, sex_literal)
      dir.create(results_directory, recursive = TRUE)
      on.exit(unlink(results_directory, recursive = TRUE))
      results <- egg_model(
        formula = base_model,
        data = dt_clean[egg_sex %in% isex],
        id_var = "egg_id",
        random_complexity = random_complexity,
        use_car1 = use_car1,
        knots = knots,
        quiet = quiet
      )

      saveRDS(
        object = results,
        file = file.path(
          working_directory,
          sprintf("%s-%s-model-object.rds", Sys.Date(), sex_literal)
        )
      )

      writeLines(
        text = deparse1(results$call),
        con = file.path(results_directory, "model-call.txt")
      )

      data.table::fwrite(
        x = broom.mixed::tidy(results),
        file = file.path(results_directory, "model-coefficients.csv")
      )

      grDevices::png(
        filename = file.path(results_directory, "model-residuals.png"),
        width = 4 * 2.5,
        height = 3 * 2.5,
        units = "in",
        res = 120
      )
      print(
        plot_residuals(
          x = x_variable,
          y = y_variable,
          fit = results
        ) +
          patchwork::plot_annotation(
            title = sprintf(
              "Cubic Splines (Random Linear Splines) - BMI - %s",
              c("0" = "Male", "1" = "Female")[as.character(isex)]
            ),
            tag_levels = "A"
          )
      )
      invisible(grDevices::dev.off())

      data.table::fwrite(
        x = egg_slopes(
          fit = results,
          period = period,
          knots = knots
        ),
        file = file.path(results_directory, "derived-slopes.csv")
      )

      data.table::fwrite(
        x = egg_aucs(
          fit = results,
          period = period,
          knots = knots
        ),
        file = file.path(results_directory, "derived-aucs.csv")
      )

      data.table::fwrite(
        x = egg_outliers(
          fit = results,
          period = period,
          knots = knots
        ),
        file = file.path(results_directory, "derived-outliers.csv")
      )

      eggc <- egg_correlations(
        fit = results,
        period = period,
        knots = knots
      )

      data.table::fwrite(
        x = eggc[["AUC"]],
        file = file.path(results_directory, "derived-aucs-correlations.csv")
      )
      data.table::fwrite(
        x = eggc[["SLOPE"]],
        file = file.path(results_directory, "derived-slopes-correlations.csv")
      )

      owd <- getwd()
      on.exit(setwd(owd), add = TRUE)
      setwd(results_directory)
      archive_filename <- file.path(
        working_directory,
        sprintf("%s.zip", sex_literal)
      )
      utils::zip(
        zipfile = archive_filename,
        files = list.files()
      )
      archive_filename
    }
  )

  if (!quiet) {
    message("Results available at:")
    message(paste(sprintf("+ '%s'", archives), collapse = "\n"))
  }
  archives
}

eggla version output

‘0.11.2.9000’

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Covariates and slopes

Hi!
I'm having trouble marrying up the slopes to the average curve. The lines from one predicted period to another don't fall along the curve as expected.

curve_slopes

The data for the predicted periods and the model fit are the same. I ended up playing with the prediction for the curve by setting my covariate to 0 (my covariate is coded 1/2) and the slopes matched much more closely:

curve_slopes2

Am I being paranoid here, doesn't the predicted periods used for deriving the slopes include covariates in the predictions? I know we specify "filter" in run_eggla_lmm() but I'm having trouble seeing how this is incorporated into predicting the slopes?

Thanks!

Error in renv::init(base = TRUE)

In run_eggla.sh script (and interactive code) the line (14):

-e 'renv::init(base = TRUE)' \

results in error:

renv::init(base = TRUE)
Error in renv::init(base = TRUE) : unused argument (base = TRUE)
Traceback (most recent calls last):
3: renv::init(base = TRUE)
2: renv_dots_check(...)
1: stop(err)

should the line instead be?:

-e 'renv::init(bare = TRUE)' \

Title of residual plots in run_eggla_lmm()

Bug description

The title of the residual plots output from run_eggla_lmm() is fixed at "Cubic Splines (Random Linear Splines) - BMI - [sex]"; which I understand is the default model complexity for the parameter complexity. However, if running more complex random effects the title is no longer correct.
Could the model title be removed? perhaps just the sex should remain?

[lines 245-256]

   p_model_residuals <- plot_residuals(
        x = x_variable,
        y = y_variable,
        fit = results
      ) +
        patchwork::plot_annotation(
          title = sprintf(
            "Cubic Splines (Random Linear Splines) - BMI - %s",
            c("0" = "Male", "1" = "Female")[as.character(isex)]
          ),
          tag_levels = "A"
        )

eggla version output

eggla 0.18.5

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

N and EAF definition

Bug description

I noticed a while ago that run_eggla_gwas output gives N, N0, N1 and N2 results using the total genotyped sample instead of the actual analysed sample which has both genotype + phenotype. We've had also the OBS_CT field from PLINK in the results, so this was not critical as the actual sample size was also in the summary stats. However, related to the meta-analysis we use the minor allele count for filtering the results, and related to this I did some check ups on how well the MAC and MAF from genotyped sample agrees with the values from the actual analysed sample. I found some really large differences between these in some SNPs, which made me go to the vcf files to see what was going on as I couldn't see from the run_eggla_gwas or my own codes an apparent explanation to this. I did my calculations on SNPtest because I couldn't work out how to select the subject IDs on bcftools.

Seems like in run_eggla_gwas that part of the summary stats that come from bcftools are calculated based on the total genotyped sample, and EAF is counted using the GT field in vcf files whereas in the regression analyses we use the dosage DS. Could we have all of the summary stats, including Ns, EAF and HWE_P to be based on the actual analysed sample, and EAF using the dosage? Already thinking of the QC for the meta-analysis, also the extra fields in results that are not stated in the analysis plan, i.e. T_STAT,OBS_CT,ERRCODE,FDR,BONFERRONI could be also dropped out.

eggla version output

eggla_0.18.2

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Error in `as.data.frame.ranef.lme`

Weird new error in GitHub Action

-- RMarkdown error -------------------------------------------------------------
Quitting from lines 182-187 (03-run-cubic-splines.Rmd) 
Error in as.data.frame.ranef.lme(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  'optional' argument not implemented

Replace slopes_dt with aucs_dt in outlier removal [run_eggla_lmm()]

Bug description

In the code that removes outliers in run_eggla_lmm() [lines 461 to 469], should slopes_dt[["egg_id"]] [line 467] be aucs_dt[["egg_id"]] instead. It doesn't make a difference in my dataset, but may in others?

      if (outlier_exclude && sum(outliers_dt[["Outlier"]]) > 0) {
        outliers_to_exclude <- outliers_dt[Outlier == 1]
        for (icol in unique(outliers_to_exclude[["parameter"]])) {
          if (any(grepl("^auc_", icol))) {
            aucs_dt <- data.table::setDF(aucs_dt)
            aucs_dt[
              slopes_dt[["egg_id"]] %in% outliers_to_exclude[parameter %in% icol, ID],
              icol
            ] <- NA_real_

Thanks!

eggla version output

eggla 0.18.5

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Missing covariates in `stats::predict()` and compute AP/AR

As reported by @burrowsk compute_apar() does not keep covariates to predict values (using stats::predict()) in compute_apar() leading to an error.

11: eval(predvars, data, env)
10: eval(predvars, data, env)
9: model.frame.default(formula = asOneFormula(formula(reSt), fixed), 
       data = newdata, na.action = na.action, drop.unused.levels = TRUE, 
       xlev = lapply(contr, rownames))
8: model.frame(formula = asOneFormula(formula(reSt), fixed), data = newdata, 
       na.action = na.action, drop.unused.levels = TRUE, xlev = lapply(contr, 
           rownames))
7: predict.lme(object = fit, newdata = .SD, interval = "prediction")
6: stats::predict(object = fit, newdata = .SD, interval = "prediction")
5: eval(jsub, SDenv, parent.frame())
4: eval(jsub, SDenv, parent.frame())
3: `[.data.table`(data.table::setnames(x = data.table::data.table(egg_id = unique(fit[["groups"]][[id_var]]), 
       egg_ageyears = list(seq(from = start, to = end, by = step))), 
       old = c("egg_id", "egg_ageyears"), new = c(id_var, age_var))[j = `names<-`(list(unlist(.SD)), 
       age_var), .SDcols = c(age_var), by = c(id_var)], j = `:=`(bmi, 
       f(stats::predict(object = fit, newdata = .SD, interval = "prediction"))))
2: data.table::setnames(x = data.table::data.table(egg_id = unique(fit[["groups"]][[id_var]]), 
       egg_ageyears = list(seq(from = start, to = end, by = step))), 
       old = c("egg_id", "egg_ageyears"), new = c(id_var, age_var))[j = `names<-`(list(unlist(.SD)), 
       age_var), .SDcols = c(age_var), by = c(id_var)][j = `:=`(bmi, 
       f(stats::predict(object = fit, newdata = .SD, interval = "prediction")))]
1: compute_apar(fit = res_f_1, from = "predicted")

  • In R/compute_apar.R

    eggla/R/compute_apar.R

    Lines 84 to 111 in 2fce6b1

    out <- switch(EXPR = from,
    "observed" = {
    data.table::as.data.table(fit[["data"]])[
    j = .SD,
    .SDcols = c(id_var, age_var, bmi_var)
    ]
    },
    "predicted" = {
    data.table::setnames(
    x = data.table::data.table(
    egg_id = unique(fit[["groups"]][[id_var]]),
    egg_ageyears = list(seq(from = start, to = end, by = step))
    ),
    old = c("egg_id", "egg_ageyears"),
    new = c(id_var, age_var)
    )[
    j = `names<-`(list(unlist(.SD)), age_var),
    .SDcols = c(age_var),
    by = c(id_var)
    ][
    j = bmi := f(stats::predict(
    object = fit,
    newdata = .SD,
    interval = "prediction"
    ))
    ]
    }
    )

AP/AR BMI predicted for all values of covariate within individuals

Bug description

When predicting data using compute_apar() on LMM data including a cohort specific covariate, the predictions of BMI are performed on both values of the binary variable for example:

Source covariate is coded as 1 = clinic, 2 = questionnaire

Individual 1 is coded as source=1
Individual 2 is coded as source=2

Output from predicted_data object (https://m.canouil.fr/eggla/articles/adiposity-peak-rebound.html)

sex egg_id egg_source egg_ageyears egg_bmi AP AR
0 "1" 1 0.25 16.42 FALSE FALSE
0 "1" 1 0.3 16.72 FALSE FALSE
0 "1" 1 0.35 16.98 FALSE FALSE
0 "1" 1 0.4 17.20 FALSE FALSE
0 "1" 1 0.45 17.37 FALSE FALSE
0 "1" 1 0.5 17.51 FALSE FALSE
0 "1" 1 0.55 17.61 FALSE FALSE
0 "1" 1 0.6 17.68 FALSE FALSE
0 "1" 1 0.65 17.71 FALSE FALSE
0 "1" 1 0.7 17.73 TRUE FALSE
sex egg_id egg_source egg_ageyears egg_bmi AP AR
1 "2" 1 0.25 15.3911 FALSE TRUE
1 "2" 2 0.25 15.0745 FALSE FALSE
1 "2" 1 0.3 15.7286 FALSE FALSE
1 "2" 2 0.3 15.4051 FALSE FALSE
1 "2" 1 0.35 16.0235 FALSE FALSE
1 "2" 2 0.35 15.6939 FALSE FALSE
1 "2" 1 0.4 16.2767 FALSE FALSE
1 "2" 2 0.4 15.9419 FALSE FALSE
1 "2" 1 0.45 16.4898 FALSE FALSE
1 "2" 2 0.45 16.1506 FALSE FALSE

Note that AP and AR appear to be predicted for each age increment, holding source =1. So, there are no duplicate AP or ARs which is good. For the predictions, would we not want to predict based on the individuals value for source, i.e those with 1 predicted with 1 and those with 2 predicted with 2 rather than 1?

Thanks!

eggla version output

[1] ‘0.12.3

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Release `eggla` `v1.0.0`

eggla v1.0.01 should be released either upon manuscript submission (or when the analytical plan will be made available to the whole consortium).

Footnotes

  1. v0.*.* versions are for "beta" or "work in progress" software/tool, whilst v1.*.* or later major versions means the software has all the main features it should have for a stable/production use. See https://semver.org/ for more details on semantic versioning.

Incorrect parameter correlations

Bug description

I noticed that my correlations between the parameters were quite different from previous analyses. For instance using the derived-slopes.csv from run_eggla_lmm() for bmigrowth females, the correlations (using stats::cor) are:

term slope_0--0.5 slope_1.5--3.5 slope_6.5--10 slope_12--17
slope_0--0.5 1 0.991665022459947 0.308438879534697 -0.685332215016685
slope_1.5--3.5 0.991665022459947 1 0.428429069334579 -0.585792675893659
slope_6.5--10 0.308438879534697 0.428429069334579 1 0.481341973221797
slope_12--17 -0.685332215016685 -0.585792675893659 0.481341973221797 1

The correlations for the same data given in derived-parameters-correlations.csv are:

term slope_0--0.5 slope_1.5--3.5 slope_6.5--10 slope_12--17
slope_0--0.5 1 0.96025 0.02945 -0.48586
slope_1.5--3.5 0.96025 1 0.307301 -0.22257
slope_6.5--10 0.02945 0.307301 1 0.859348
slope_12--17 -0.48586 -0.22257 0.859348 1

I think the issue becomes apparent in the following code block of run_eggla_lmm()

      eggc_dt <- Reduce(
        f = function(x, y) merge(x, y, by = names(results$groups)),
        x = lapply(
          X = list(
            slopes_dt,
            aucs_dt,
            apar_dt
          ),
          FUN = function(data) {
            data <- data.table::setnames(data, "egg_id", names(results$groups), skip_absent = TRUE)
            data[grep(paste0("^auc_|^slope_|^AP_|^AR_|^", names(results$groups), "$"), names(data))]
          }
        )
      )

Object eggc_dt results in a reduced dataframe of the expected 13 variables but for only 5 individuals. This should be for all ~50 females right?

Example code run

data("bmigrowth")
fwrite(x = bmigrowth, file = "bmigrowth.csv")
results <- run_eggla_lmm(
     data = fread("bmigrowth.csv"),
     id_variable = "ID",
     age_days_variable = NULL,
     age_years_variable = "age",
     weight_kilograms_variable = "weight",
     height_centimetres_variable = "height",
     sex_variable = "sex",
     covariates = NULL,
     random_complexity = 2
  )

slopes <- read.csv("derived-slopes.csv")
eggc <- data.table::as.data.table(
  x = stats::cor(slopes[grep("^slope_", names(slopes))]),
  keep.rownames = "term"
)

eggla version output

packageVersion('eggla')
[1] ‘0.18.0’

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

`v0.4.0`

  • Implement selected model: "cubic splines with random linear splines". (#2)
  • Add functionality to install R dependencies with specific version. (#3)
  • Check derived parameters
    • AUCs
    • Slopes
    • Adiposity Peak
    • Adiposity rebond

Inconsistent use of vep and vep_file in run_eggla_gwas()

Bug description

The parameter to add the path to a VEP annotation file is vep within run_eggla_gwas() [line 95]. the parameter vep is called in line 443 vep_file = vep, however, all other instances specify vep_file which causes the function to exit upon the first instance [lines 408-423].
Only applicable should a user want to specify a vep file.

if (nzchar(system.file(package = "future.apply"))) {
    eggla_lapply <- function(X, basename_file, vep_file, bin_path, bcftools_view_options, build, strand, info_type, FUN) {
      future.apply::future_lapply(
        X = X,
        basename_file = basename_file,
        vep_file = vep_file,
        bin_path = bin_path,
        bcftools_view_options = bcftools_view_options,
        build = build,
        strand = strand,
        info_type = info_type,
        future.globals = FALSE,
        future.packages = "data.table",
        FUN = FUN
      )
    }

eggla version output

eggla 0.18.5

Checklist

  • formatted your issue so it is easier for us to read?
  • included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
  • documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Concordance of dataframe name for Daymont and non-Daymont datasets

Hi,
There is a very minor issue of data frame naming when following the pipeline according to the README. Could the resulting data frame from Daymont cleaning (visits_clean) be instead named pheno_dt?
The pipeline calls on pheno_dt and therefore does not use visits_clean. The result is no filtering of observations and missing data in BMI giving errors when plotting residuals.

Thanks!

# insert reprex here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.