mcanouil / eggla Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 1.0 49.58 MB

Early Growth Genetics Longitudinal Analysis

Home Page: https://m.canouil.dev/eggla/

License: Other

R 99.78% Shell 0.22%

r-stats r-package growth-curves longitudinal-analysis r genetics mixed-effects-models spline-regression infancy

eggla's Introduction

EGG Longitudinal Analysis

Tools for longitudinal analysis within the EGG (Early Growth Genetics) Consortium.

Installation

Install the development version from GitHub:

# install.packages("remotes")
remotes::install_github("mcanouil/eggla")

Install a particular version:

# install.packages("remotes")
remotes::install_github("mcanouil/[email protected]")
# or the latest stable version
remotes::install_github("mcanouil/eggla@latest")

Docker Images

docker pull ghcr.io/mcanouil/eggla:devel.
docker pull ghcr.io/mcanouil/eggla:latest.

License

MIT © Mickaël Canouil, Nicole Warrington, Kimberley Burrows, and Anni Heiskala.

Code of Conduct

Please note that the eggla project is released with a Contributor Code of Conduct.
By contributing to this project, you agree to abide by its terms.

eggla's People

Contributors

Stargazers

Watchers

Forkers

ning-l

eggla's Issues

Switch from `nlme` to `lme4`

For computation time reasons and since AR1 models do not fit/compute well, replace nlme with lme4.

`performance::model_performance()` exibits a log-likelihood warning for AIC/BIC

Bug description

Initially reported by @burrowsk

Warning messages:
1: Log-likelihood is corrected for models with transformed response. However, this ignores 'REML=TRUE'. Log-likelihood value is probably inaccurate.

`eggla` version output

eggla v0.12.3
performance v0.9.1

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Add functionality to install R dependencies with specific version.

Use renv for renv.lock
Define a function to locally install/load packages listed in renv.lock file.

parameter needed for knots run_eggla_lmm.R

Bug description

Hello!

Could a parameter for knots be included in the run_eggla_lmm.R function please? Currently, the knots are set at 2, 8, and 12 (line 82 of function). I believe the final knot placements are to be 1, 8 and 12?

Thanks!

`eggla` version output

‘0.11.2.9000’

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Check that all diagnostics figures/tables to the output of `run_eggla_lmm()`

Check that all diagnostics figures/tables to the output of run_eggla_lmm():

Anything else? 🤔

New version of run_eggla_gwas not working

Bug description

Hi Mickael, I have been trying to run GWASes again with the new version and I am running out o ideas what to do. When I try the gwas with all four types of phenotypre, the analysis stops at chromosome 4:

run_eggla_gwas(
  data = pcs,
  results = res,
  id_column ="ID",
  traits = c("slope_.*", "auc_.*", "^AP_.*", "^AR_.*"),
  covariates = c("sex","PC1", "PC2", "PC3", "PC4", "PC5"),
  vcfs = list.files(
      path = "/.../genetic_files/",
      pattern = "\\.vcf$|\\.vcf.gz$",
      full.names = TRUE
    ),
  working_directory = "/wrk/anni/",
  use_info = TRUE,
  bin_path = list(bcftools = "/usr/local/bin/bcftools",plink2 = "/usr/local/bin/plink2"),
  bcftools_view_options = NULL,
  build = "37",
  strand = "+",
  info_type = "minimac3 R2",
  threads = 1,
  quiet = FALSE,
  clean = TRUE
)
...
...
...
[chr3.dose.vcf.gz] Performing PLINK2 regression ...
[chr3.dose.vcf.gz] Combining results files ...
[chr3.dose.vcf.gz] Extracting INFO ...
[chr3.dose.vcf.gz] Computing PLINK2 missing rate ...
[chr3.dose.vcf.gz] Computing PLINK2 Hardy-Weinberg test ...
[chr3.dose.vcf.gz] Results written in "/.../gwas_plink2/da9ed3b2a26386f9bbc14160926d590a__chr3.dose.results.gz"
[chr4.dose.vcf.gz] Performing PLINK2 regression ...
[chr4.dose.vcf.gz] Combining results files ...
Error in data.table::setnames(x = results, old = c("CHROM", "POS", "ID",  :
  Items of 'old' not found in column names: [CHROM, POS, ID, A1, AX, A1_FREQ, MACH_R2, OBS_CT, BETA, SE, ...]. Consider skip_absent=TRUE.
In addition: Warning message:
In `[.data.table`(data.table::setnames(x = data.table::rbindlist(l = lapply(X = (function(x) `names<-`(x,  :
  column(s) not removed because not found: [TEST, T_STAT, REF, ALT]

I tried running the analysis only with chromosome 4 and that got completed without problems.

I also tried the GWAS with only AUC and slope as didn't have any problems with the same data sets in the previous version of eggla that I used (I have only been running the analyses with slope and auc until now), and with this got an error at chromosome 6:

run_eggla_gwas(
  data = pcs,
  results = res,
  id_column ="ID",
  traits = c("slope_.*", "auc_.*"),
  covariates = c("sex","PC1", "PC2", "PC3", "PC4", "PC5"),
  vcfs = list.files(
      path = "/.../genetic_files/",
      pattern = "\\.vcf$|\\.vcf.gz$",
      full.names = TRUE
    ),
  working_directory = "/wrk/anni/",
  use_info = TRUE,
  bin_path = list(bcftools = "/usr/local/bin/bcftools",plink2 = "/usr/local/bin/plink2"),
  bcftools_view_options = NULL,
  build = "37",
  strand = "+",
  info_type = "minimac3 R2",
  threads = 1,
  quiet = FALSE,
  clean = TRUE
)
[chr5.dose.vcf.gz] Results written in "/.../gwas_plink2/afdae49888058df51fdaf6dcb37c4c28__chr5.dose.results.gz"
[chr6.dose.vcf.gz] Performing PLINK2 regression ...
[chr6.dose.vcf.gz] Combining results files ...
[chr6.dose.vcf.gz] Extracting INFO ...
[chr6.dose.vcf.gz] Computing PLINK2 missing rate ...
[chr6.dose.vcf.gz] Computing PLINK2 Hardy-Weinberg test ...
Error in data.table::fread(sprintf("%s.vmiss", results_file)) :
  File '/.../gwas_plink2/afdae49888058df51fdaf6dcb37c4c28__chr6.dose.vmiss' does not exist or is non-readable. getwd()=='/wrk/anni/'
In addition: Warning messages:
1: In FUN(X[[i]], ...) :
  Discarded single-line footer: <<6     87575956        6:87575956:A:T  A       T       T       A       1827.>>
2: In FUN(X[[i]], ...) :
  Discarded single-line footer: <<6     87576217        6:87576217:G:T  G       T       T       G       420.028 7226    0.0581274       0.98993 PC3     3613    -0.0119956      0.00827245      -1.45007        0.14712>>
3: In FUN(X[[i]], ...) :
  Discarded single-line footer: <<6     87574674        6:87574674:A:G  A       G       G       A       0.0418091       7226    5.785>>
4: In FUN(X[[i]], ...) : Discarded single-line footer: <<6      87571928        6:8>>
5: In FUN(X[[i]], ...) :
  Discarded single-line footer: <<6     87575584        6:87575584:G:A  G       A       A       G       0.0310059       7226    4.29087e-06     0.0309974       PC2     3>>
6: In FUN(X[[i]], ...) :
  Discarded single-line footer: <<6     87576714        6:87576714:G:A  G       A       A       G       0.00292969      7226    4.05437e-07     0.000975752     PC2     3613    0.00012>>
>

What do you think of this?

`eggla` version output

> packageVersion('eggla')
[1] ‘0.19.0’

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Genotypes / Phenotypes with no matching IDs

Add a test to check if IDs match between genotypes and phenotypes.
If not, generate an error message with suggestion to use:

bcftools reheader –samples oldid_newid_space_separated_columns.txt

See https://samtools.github.io/bcftools/bcftools.html#reheader

Ensure GWAS output is consistent with analysis plan

Column Header	Description	Format	Examples
SNPID	SNP/INDEL ID label from the imputation server, or from directly genotyped data	Text	4:10229 / rs693
STRAND	Orientation of the site to the human genome strand used	+ (Should be aligned to the forward strand)	+
BUILD	Build of the genome on which the SNP is orientated	Numeric	37
CHR	Chromosome on which the SNP resides	Numeric	1
POS	Position of SNP on	Base pairs on human genome build used	34000345
EFFECT_ALLELE	Allele for which the effect (BETA) is reported	SNP: Capital letter (A,C,G,T) / INDEL: Allele output by imputation server	A
NON_EFFECT_ALLELE	Other allele at this site	SNP: Capital letter (A,C,G,T) / INDEL: Allele output by imputation server	G
N	Total number of samples analysed at this site	Numeric, integer	1234
N0	Number of homozygous samples with zero copies of the EFFECT_ALLELE	Numeric, integer or float with 4 digits to the right of the decimal (imputed)	623 / 745.2345
N1	Number of heterozygous samples with one copy of the EFFECT_ALLELE	Numeric, integer or float with 4 digits to the right of the decimal (imputed)	623 / 745.2345
N2	Number of homozygous samples with two copies of the EFFECT_ALLELE	Numeric, integer or float with 4 digits to the right of the decimal (imputed)	623 / 745.2345
EAF	Allele frequency of the EFFECT_ALLELE	Frequency with 4 digits to the right of the decimal	0.3546
HWE_P	Exact HWE p-value for samples analysed, only if genotyped data is used in analysis	Scientific E notation with 4 digits to the right of the decimal (set to missing for imputed	SNPs)
CALL_RATE	Genotyping call rate for the SNP (N_successful/N_attempted)	Frequency 4 digits to the right of the decimal. Set to 1.000 if imputed	0.9936
BETA	Estimate of the effect size	Numeric float with 4 digits to the right of the decimal	0.2036
SE	Estimated standard error on the estimate of the effect size, uncorrected for genomic control	Numeric float with 4 digits to the right of the decimal	0.5611
PVAL	P value of the variant association	Scientific E notation with 4 digits to the right of the decimal	3.244E-10
IMPUTED	Is the SNP/INDEL imputed?	0 = Genotyped / 1 = Imputed	1
INFO_TYPE	Type of information provided in the INFO column	Text, (set to missing if genotyped)	impute_info
INFO	Measure of information content for the imputed SNP result	Numeric float with 4 digits to the right of the decimal (set to missing if genotyped)	0.4832 / .

INFO_TYPE description given to genotyped SNPs

Bug description

This is related to the previous issue #80 I reported, turned out I didn't check the results carefully enough. So I was able to run run_eggla_gwas with R2 in the INFO, but the description given in the function argument info_type went only to genotyped SNPs.

P.S. Sorry for taking time with the analysis plan, I am trying to learn how to respond to the review request!

`eggla` version output

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Locale:
  LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
  LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
  LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
  LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

Package version:
  backports_1.4.1      bayestestR_0.13.0    beeswarm_0.4.0      
  BH_1.75.0.0          bit_4.0.4            bit64_4.0.5         
  broom_1.0.1          broom.mixed_0.2.9.4  cli_3.4.1           
  clipr_0.8.0          coda_0.19.4          codetools_0.2.18    
  colorspace_2.0.3     compiler_4.2.1       cpp11_0.4.3         
  crayon_1.5.2         curl_4.3.3           data.table_1.14.4   
  datawizard_0.6.3     digest_0.6.30        distributional_0.3.1
  doParallel_1.0.17    dplyr_1.0.10         eggla_0.17.4        
  ellipsis_0.3.2       fansi_1.0.3          farver_2.1.1        
  forcats_0.5.2        foreach_1.5.2        furrr_0.3.1         
  future_1.28.0        generics_0.1.3       ggbeeswarm_0.6.0    
  ggdist_3.2.0         ggplot2_3.3.6        ggtext_0.1.2        
  globals_0.16.1       glue_1.6.2           graphics_4.2.1      
  grDevices_4.2.1      grid_4.2.1           gridtext_0.1.5      
  growthcleanr_2.0.0   gtable_0.3.1         haven_2.5.1         
  HDInterval_0.2.2     hms_1.1.2            insight_0.18.6      
  isoband_0.2.6        iterators_1.0.14     jpeg_0.1.9          
  labeling_0.4.2       labelled_2.10.0      lattice_0.20.44     
  lifecycle_1.0.3      listenv_0.8.0        magrittr_2.0.3      
  markdown_1.2         MASS_7.3.53          Matrix_1.3.3        
  methods_4.2.1        mgcv_1.8.35          mime_0.12           
  munsell_0.5.0        nlme_3.1.152         numDeriv_2016.8.1.1 
  parallel_4.2.1       parallelly_1.32.1    patchwork_1.1.2     
  performance_0.10.0   pillar_1.8.1         pkgconfig_2.0.3     
  plyr_1.8.7           png_0.1.7            prettyunits_1.1.1   
  progress_1.2.2       purrr_0.3.5          R6_2.5.1            
  RColorBrewer_1.1.3   Rcpp_1.0.9           readr_2.1.3         
  rlang_1.0.6          scales_1.2.1         splines_4.2.1       
  stats_4.2.1          stringi_1.7.8        stringr_1.4.1       
  tibble_3.1.8         tidyr_1.2.1          tidyselect_1.2.0    
  tools_4.2.1          tzdb_0.3.0           utf8_1.2.2          
  utils_4.2.1          vctrs_0.5.0          vipor_0.4.5         
  viridisLite_0.4.1    vroom_1.6.0          withr_2.5.0         
  xfun_0.34            xml2_1.3.3

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Implement selected model

Implement selected model: "cubic splines with random linear splines".

run_eggla_gwas: Error in colnames

Bug description

I created lmm results with

results_archives <- run_eggla_lmm(
  data = fread("selected_cleaned_data.csv"),
  id_variable = "ID",
  age_days_variable = NULL,
  age_years_variable = "age",
  weight_kilograms_variable = "weight",
  height_centimetres_variable = "height",
  sex_variable = "sex",
  covariates = NULL,
  male_coded_zero = FALSE,
  random_complexity = 2,
  knots=c(1,8,12),
  parallel = FALSE,
  parallel_n_chunks = 1
)

and this worked fine. Next, tried the GWAS with the following error:

run_eggla_gwas(
  data = "selected_cleaned_data.csv",
  results = results_archives,
  id_column = "ID",
  traits = c("slope_.*"),
  covariates = c("sex"),
  vcfs = list.files(
    path = file.path("/genetic_files", "vcf"),
    pattern = "\\.vcf$|\\.vcf.gz$",
    full.names = TRUE
  ),
  path = output_path,
  vep = NULL,
  bin_path = list(
    bcftools = "/usr/local/bin/bcftools",
    plink2 = "/usr/local/bin/plink2"
  ),
  threads = 1
)
#> Formatting VCFs ...
#> Writing trait results ...
#> Error in colnamesInt(x, neworder, check_dups = FALSE) :
#>   argument specifying columns specify non existing column(s): cols[1]='trait_model'
#> In addition: Warning message:
#> In run_eggla_gwas(data = "selected_cleaned_data.csv", results = results_archives,  :
#>   Sex must be coded: '1'/'M'/'m' = male, '2'/'F'/'f' = female, 'NA'/'0' = missing! '0' have been recoded as '2', i.e., female.

Looking at the code I don't see if trait_model was defined after setting it to NULL?

Regarding the warning: could the default coding for sex be the same as in the other functions: 0=female, 1=male?

xfun::session_info()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Locale:
#>   LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
#>   LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
#>   LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
#>   LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#> 
#> Package version:
#>   assertthat_0.2.1    backports_1.4.1     bayestestR_0.12.1
#>   BH_1.75.0.0         bit_4.0.4           bit64_4.0.5
#>   broom_1.0.0         broom.mixed_0.2.9.4 cli_3.3.0
#>   clipr_0.8.0         coda_0.19.4         codetools_0.2-18
#>   colorspace_2.0-3    compiler_4.2.1      cpp11_0.4.2
#>   crayon_1.5.1        data.table_1.14.2   datawizard_0.5.1
#>   DBI_1.1.1           digest_0.6.29       doParallel_1.0.17
#>   dplyr_1.0.9         eggla_0.12.3        ellipsis_0.3.2
#>   fansi_1.0.3         farver_2.1.1        forcats_0.5.2
#>   foreach_1.5.2       furrr_0.3.1         future_1.27.0
#>   future.apply_1.9.0  generics_0.1.3      ggplot2_3.3.6
#>   globals_0.16.0      glue_1.6.2          graphics_4.2.1
#>   grDevices_4.2.1     grid_4.2.1          growthcleanr_2.0.0
#>   gtable_0.3.0        haven_2.5.1         hms_1.1.2
#>   insight_0.18.2      isoband_0.2.5       iterators_1.0.14
#>   labeling_0.4.2      labelled_2.9.1      lattice_0.20-44
#>   lifecycle_1.0.1     listenv_0.8.0       magrittr_2.0.3
#>   MASS_7.3.53         Matrix_1.3-3        methods_4.2.1
#>   mgcv_1.8-35         munsell_0.5.0       nlme_3.1-152
#>   parallel_4.2.1      parallelly_1.32.1   patchwork_1.1.2
#>   performance_0.9.2   pillar_1.8.1        pkgconfig_2.0.3
#>   plyr_1.8.7          prettyunits_1.1.1   progress_1.2.2
#>   purrr_0.3.4         R6_2.5.1            RColorBrewer_1.1.3
#>   Rcpp_1.0.9          readr_2.1.2         rlang_1.0.4
#>   rstudioapi_0.13     scales_1.2.1        splines_4.2.1
#>   stats_4.2.1         stringi_1.7.8       stringr_1.4.1
#>   tibble_3.1.8        tidyr_1.2.0         tidyselect_1.1.2
#>   tools_4.2.1         tzdb_0.3.0          utf8_1.2.2
#>   utils_4.2.1         vctrs_0.4.1         viridisLite_0.4.1
#>   vroom_1.5.7         withr_2.5.0         xfun_0.20

`eggla` version output

packageVersion('eggla')
#> [1] ‘0.12.3’

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

GWAS function output file to tab separated

Bug description

Could we change the run_eggla_gwas function's summary stats output into tab separated .txt.gz file? I think this was discussed before and is like that in the analysis plan but at the moment the output comes as comma separated. The csv files are difficult in the QC process I have to currently convert them into tab separated before the QC.

`eggla` version output

> packageVersion('eggla')
[1] ‘0.19.5’

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Error in "vecseq(f, len, if (allow.cartesian || notjoin || !anyDuplicated(f__"

Issue while merging results/data in GWAS function in one of the following chunk:

eggla/R/do_eggla_gwas.R

Lines 137 to 166 in 3d86570

 dt <- data.table::merge.data.table( 

 x = data.table::fread(data), 

 y = data.table::setnames( 

 x = data.table::rbindlist(lapply( 

 X = results_zip, 

 path = path, 

 FUN = function(izip, path) { 

 utils::unzip( 

 zipfile = izip, 

 files = "derived-slopes.csv", 

 exdir = path 

 ) 

 utils::unzip( 

 zipfile = izip, 

 files = "derived-aucs.csv", 

 exdir = path 

 ) 

 on.exit(unlink(file.path(path, c("derived-slopes.csv", "derived-aucs.csv")))) 

 data.table::merge.data.table( 

 x = data.table::fread(file.path(path, "derived-slopes.csv")), 

 y = data.table::fread(file.path(path, "derived-aucs.csv")), 

 by = "egg_id" 

 ) 

 } 

 )), 

 old = "egg_id", 

 new = id_column 

 ), 

 by = id_column 

 )

eggla/R/do_eggla_gwas.R

Lines 428 to 436 in 3d86570

 data.table::fwrite( 

 x = data.table::merge.data.table( 

 x = results, 

 y = annot, 

 by = c("CHROM", "POS", "ID", "REF", "ALT"), # intersect(names(results), names(annot)) 

 all.x = TRUE 

 ), 

 file = output_results_file 

 )

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 11768 rows; more than 7079 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

`growthcleanr` has been archived from CRAN

https://cran.r-project.org/package=growthcleanr

eggla should rely on the GitHub repository?

Adding in knots as a parm in egg_slopes()

egg_slopes() requires knots, but is fixed within the function. Could knots be placed as a param to allow for use with different knots models please?
e.g.

egg_slopes <- function(
fit,
period = c(0, 0.5, 1.5, 3.5, 6.5, 10, 12, 17),
knots = c(2, 8, 12)
) {
stopifnot(inherits(fit, "lme"))
#knots <- c(2, 8, 12)
id_var <- names(fit[["groups"]])

slopes <- matrix(
data = NA_real_,
nrow = length(unique(fit$data[[id_var]])),
ncol = length(period) / 2
)
colnames(slopes) <- paste0(
"slope_",
sapply(split(
x = period,
f = rep(
x = seq(1, length(period), length(period) %/% 4),
each = length(period) %/% 4
)
), paste, collapse = "--")
)
pred <- matrix(
data = NA_real_,
nrow = length(unique(fit$data[[id_var]])),
ncol = length(period)
)
colnames(pred) <- paste0("pred_period_", round(period, digits = 1))

fxef <- nlme::fixef(fit)
fxef <- unname(fxef[
grep("$Intercept$|gsp$.$|poly$.$", names(fxef))
])
rnef <- nlme::ranef(fit)
rnef <- rnef[, grep("$Intercept$|gsp$.$|poly$.$", names(rnef))]

rnef <- cbind.data.frame(
as.matrix(rnef),
matrix(
data = rep(0, (length(fxef) - ncol(rnef)) * nrow(rnef)),
nrow = nrow(rnef),
ncol = length(fxef) - ncol(rnef)
)
)

for (i in seq_along(unique(fit$data[[id_var]]))) {
coeff <- fxef + as.numeric(rnef[i, ])
for (j in 1:(length(period) / 2)) {
x1 <- period[j * 2 - 1]
y1_tmp <- coeff *
c(x1^0, x1^1, x1^2, x1^3, (x1 - knots)^3) /
c(1, 1, 2, rep(6, 4))
y1 <- sum(y1_tmp[1:(4 + findInterval(x1, knots, left.open = TRUE))])

  x2 <- period[j * 2]
  y2_tmp <- coeff *
    c(x2^0, x2^1, x2^2, x2^3, (x2 - knots)^3) /
    c(1, 1, 2, rep(6, 4))
  y2 <- sum(y2_tmp[1:(4 + findInterval(x2, knots, left.open = TRUE))])
  
  pred[i, j * 2 - 1] <- y1
  pred[i, j * 2] <- y2
  slopes[i, j] <- (y2 - y1) / (x2 - x1)
}

}
out <- cbind.data.frame(ID = unique(fit$data[[id_var]]), pred, slopes)
names(out)[1] <- id_var
out
}

Thanks!

Currently in `egg_model` (and related), warnings are not necessarily captured

As discussed in #112 we probably want to keep a record of warnings that could appear in the linear mixed model.
Currently, only the first try will account for warnings and errors before increasing "iterations", then only reports are captured and reported, i.e., warnings are left to the users to disclose.

Missing covariates in `stats::predict()` in `ggplot2` code

Reported by @burrowsk: code for plotting predicted trajectories does not account for covariates in the models.

In vignettes/articles/models-diagnostics.Rmd

eggla/vignettes/articles/models-diagnostics.Rmd

Lines 322 to 350 in 2fce6b1

 ```{r model-predict-plot} 

 ggplot( 

  data = good_res_models[ 

  j = rbindlist(list( 

  (function(x) { 

  data.table(age = seq(0, 16, 0.1))[ 

  j = bmi := if (inherits(x[[1]], "lme")) { 

  exp(predict(x[[1]], .SD, level = 0)) 

  } else NA_real_, 

  .SDcols = "age" 

  ] 

  })(lme) 

  )), 

  by = list(sex, dataset, models) 

  ] 

 ) + 

  aes( 

  x = age, 

  y = bmi, 

  colour = c("0" = "Male", "1" = "Female")[as.character(sex)] 

  ) + 

  geom_path() + 

  labs( 

  title = "Cubic Splines (Fixed Effects = Random Effects)", 

  x = "AGE (years)", 

  y = "BMI (kg/m\u00B2)", 

  colour = NULL 

  ) 

 ```

In vignettes/articles/run-cubic-splines.Rmd

eggla/vignettes/articles/run-cubic-splines.Rmd

Lines 146 to 164 in 2fce6b1

 ```{r} 

 ggplot() + 

  aes(x = age, y = bmi) + 

  stat_smooth( 

  data = pheno_dt_female, 

  mapping = aes(colour = "Loess"), 

  method = "loess", formula = y ~ x, linetype = 2, se = FALSE 

  ) + 

  geom_path( 

  data = data.table( 

  age = seq(min(pheno_dt_female[["age"]]), max(pheno_dt_female[["age"]]), 0.1) 

  )[ 

  j = bmi := exp(predict(res, .SD, level = 0)), 

  .SDcols = "age" 

  ], 

  mapping = aes(colour = "Cubic Splines (Random Linear Splines)"), 

  ) + 

  labs(x = "AGE (years)", y = "BMI (kg/m\u00B2)", colour = "Model") 

 ```

Keep GWAS results separate

Error in run_eggla_gwas when use_info=TRUE

Bug description

I'm trying run_eggla_gwas again, and managed to run it with argument use_info=FALSE. With the following command

run_eggla_gwas(
  data = "selected_cleaned_PCs_NFBC1966.csv",
  results = results_archives,
  id_column = "ID",
  traits = c("slope_.*", "auc_.*"),
  covariates = c("sex", "PC1", "PC2", "PC3", "PC4", "PC5"),
  vcfs = list.files(
    path = gen_file_path,
    pattern = "\\.vcf$|\\.vcf.gz$",
    full.names = TRUE
  ),
  working_directory = work_dir_path,
  use_info = TRUE,
  build = "37",
  strand = "+",
  vep = NULL,
  bin_path = list(
    bcftools = "/usr/local/bin/bcftools",
    plink2 = "/usr/local/bin/plink2"
  ),
  threads = 1
)

I get error and message:
Formatting VCFs ...
[chr1.dose.vcf.gz] Performing PLINK2 regression ...
[chr1.dose.vcf.gz] Extracting INFO ...
[chr1.dose.vcf.gz] Combining results files ...
[chr1.dose.vcf.gz] Writing annotated results file ...
Error in colnamesInt(x, neworder, check_dups = FALSE) :
argument specifying columns specify non existing column(s): cols[21]='INFO'
Calls: run_eggla_gwas ... resolve.list -> signalConditionsASAP -> signalConditions
Execution halted
srun: error: localhost: task 0: Exited with exit code 1

Is there something that is missing in the arguments, or in my vcf files? I don't catch where it's coming from.

$ bcftools view -h "genetic_files/chr20.dose.vcf.gz"
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=GENOTYPED,Description="Site was genotyped">
##FILTER=<ID=GENOTYPED_ONLY,Description="Site was genotyped only">
##filedate=2016.10.6
##source=Minimac3
##contig=<ID=20>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
##bcftools_viewVersion=1.4+htslib-1.4
##bcftools_viewCommand=view -h chr20.dose.vcf.gz; Date=Thu May 17 10:11:34 2018
##bcftools_viewVersion=1.10.2+htslib-1.10.2
##bcftools_viewCommand=view -h genetic_files/chr20.dose.vcf.gz; Date=Mon Oct 24 10:39:04 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ...

`eggla` version output

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Locale:
  LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
  LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
  LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
  LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

Package version:
  backports_1.4.1      bayestestR_0.13.0    beeswarm_0.4.0      
  BH_1.75.0.0          bit_4.0.4            bit64_4.0.5         
  broom_1.0.1          broom.mixed_0.2.9.4  cli_3.4.1           
  clipr_0.8.0          coda_0.19.4          codetools_0.2.18    
  colorspace_2.0.3     compiler_4.2.1       cpp11_0.4.3         
  crayon_1.5.2         curl_4.3.3           data.table_1.14.4   
  datawizard_0.6.3     digest_0.6.30        distributional_0.3.1
  doParallel_1.0.17    dplyr_1.0.10         eggla_0.17.3        
  ellipsis_0.3.2       fansi_1.0.3          farver_2.1.1        
  forcats_0.5.2        foreach_1.5.2        furrr_0.3.1         
  future_1.28.0        generics_0.1.3       ggbeeswarm_0.6.0    
  ggdist_3.2.0         ggplot2_3.3.6        ggtext_0.1.2        
  globals_0.16.1       glue_1.6.2           graphics_4.2.1      
  grDevices_4.2.1      grid_4.2.1           gridtext_0.1.5      
  growthcleanr_2.0.0   gtable_0.3.1         haven_2.5.1         
  HDInterval_0.2.2     hms_1.1.2            insight_0.18.6      
  isoband_0.2.6        iterators_1.0.14     jpeg_0.1.9          
  labeling_0.4.2       labelled_2.10.0      lattice_0.20.44     
  lifecycle_1.0.3      listenv_0.8.0        magrittr_2.0.3      
  markdown_1.2         MASS_7.3.53          Matrix_1.3.3        
  methods_4.2.1        mgcv_1.8.35          mime_0.12           
  munsell_0.5.0        nlme_3.1.152         numDeriv_2016.8.1.1 
  parallel_4.2.1       parallelly_1.32.1    patchwork_1.1.2     
  performance_0.10.0   pillar_1.8.1         pkgconfig_2.0.3     
  plyr_1.8.7           png_0.1.7            prettyunits_1.1.1   
  progress_1.2.2       purrr_0.3.5          R6_2.5.1            
  RColorBrewer_1.1.3   Rcpp_1.0.9           readr_2.1.3         
  rlang_1.0.6          scales_1.2.1         splines_4.2.1       
  stats_4.2.1          stringi_1.7.8        stringr_1.4.1       
  tibble_3.1.8         tidyr_1.2.1          tidyselect_1.2.0    
  tools_4.2.1          tzdb_0.3.0           utf8_1.2.2          
  utils_4.2.1          vctrs_0.5.0          vipor_0.4.5         
  viridisLite_0.4.1    vroom_1.6.0          withr_2.5.0         
  xfun_0.34            xml2_1.3.3

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

HWE_P in run_eggla_gwas() results inconsistent with analysis plan

Bug description

These are from the older eggla version as I haven't got updated results yet, but in case you haven't noticed this (I didn't find in change log) the result formatting in run_eggla_gwas leaves values for HWE_P in every SNP when the analysis plan stated to set HWE_P missing for imputed SNPs.

I didn'pay attention to this when I checked the NFBC results and corrected it, but now found the same in OBE results:

> summary(OBE_AR$HWE_P) # there are no missing values
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.0000  1.0000  1.0000  0.8859  1.0000  1.0000

`eggla` version output

eggla_0.17.4 and older?

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

All parameters correlations

Outlier plot legend getting mixed

Bug description

I got that really nice outlier plot as an output from run_eggla_lmm(). The legend explaining colors seems to get mixed when I only have two different colors in my plot:

Looking at the "derived-ouliers.csv" from the same lot of output I have altogether 701 "Zscore outliers" and 1657 "IQR outliers", and actually all Zscore outliers are also IQR-outliers.

> table(outliers$Outlier_Zscore)

    0     1 
74658   701

> table(outliers$Outlier_IQR)

    0     1 
73702  1657 

> table(outliers$Outlier_IQR, outliers$Outlier_Zscore)
   
        0     1
  0 73702     0
  1   956   701

`eggla` version output

> packageVersion("eggla")
[1] ‘0.16.0’
> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Finnish_Finland.utf8  LC_CTYPE=Finnish_Finland.utf8    LC_MONETARY=Finnish_Finland.utf8
[4] LC_NUMERIC=C                     LC_TIME=Finnish_Finland.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.2     stringr_1.4.1     dplyr_1.0.10      purrr_0.3.5       readr_2.1.3       tidyr_1.2.1      
 [7] tibble_3.1.8      ggplot2_3.3.6     tidyverse_1.3.2   eggla_0.16.0      data.table_1.14.2

loaded via a namespace (and not attached):
 [1] cellranger_1.1.0    pillar_1.8.1        compiler_4.2.1      dbplyr_2.2.1        tools_4.2.1         lubridate_1.8.0    
 [7] jsonlite_1.8.0      googledrive_2.0.0   lifecycle_1.0.3     gargle_1.2.0        gtable_0.3.1        nlme_3.1-157       
[13] lattice_0.20-45     pkgconfig_2.0.3     rlang_1.0.6         reprex_2.0.2        DBI_1.1.3           cli_3.4.1          
[19] rstudioapi_0.14     patchwork_1.1.2     haven_2.5.1         xml2_1.3.3          withr_2.5.0         httr_1.4.4         
[25] hms_1.1.2           fs_1.5.2            generics_0.1.3      vctrs_0.4.2         googlesheets4_1.0.1 grid_4.2.1         
[31] tidyselect_1.2.0    glue_1.6.2          R6_2.5.1            fansi_1.0.3         readxl_1.4.1        tzdb_0.3.0         
[37] modelr_0.1.9        magrittr_2.0.3      ellipsis_0.3.2      scales_1.2.1        backports_1.4.1     rvest_1.0.3        
[43] assertthat_0.2.1    colorspace_2.0-3    utf8_1.2.2          stringi_1.7.8       munsell_0.5.0       broom_1.0.1        
[49] crayon_1.5.2

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Update default value for compute_apar function

Bug description

When predicting the age at adiposity peak and rebound in the cohorts, we found that the age at adiposity peak only had 4 or 5 values (e.g. 0.65, 0.7, 0.75, 0.8, 0.85years). This was causing issues when we then excluded outliers in some cohorts as all but one category was flagged as an outlier. The current default 'step' in the compute_apar() function is 0.05. Could we please change this to 0.01? This extra resolution seems to resolve the issues within the cohorts we have tested.

The code that needs changing is as follows (version 0.18.5):

Current code: function (fit, from = c("predicted", "observed"), start = 0.25, end = 10, step = 0.05, filter = NULL)
Updated code: function (fit, from = c("predicted", "observed"), start = 0.25, end = 10, step = 0.01, filter = NULL)

`eggla` version output

packageVersion('eggla')
[1] ‘0.18.5’

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Increase of number of iterations for the EM algorithm `niterEM`

Hi!
We have been troubleshooting one cohort where the chosen model was not running without warnings. We managed to solve this by increasing the number of iterations for the EM algorithm used to refine the initial estimates of the random effects variance-covariance coefficients from the default of 25 to 100.
The nlme control variable is niterEM

Please could the increase in iterations be incorporated into the egg_model() function in order to run_eggla_lmm() to run for the cohort?

Thanks!

Error in id.vars for covariates in dt_long <- melt() for run_eggla_lmm()

Bug description

I have a covariate to include in the run_eggla_lmm() function. The following error occurs:

Error in melt.data.table(data = data.table::as.data.table(data)[j = list(egg_id = as.character(get(id_variable)),  : 
  One or more values in 'id.vars' is invalid.

This relates to line 105: id.vars

May require addition (and expansion of multiple) covariate names in preceding j = list() (line 29).

Thanks!

`eggla` version output

‘0.11.2.9000’

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Archives not created `run_eggla_lmm()`

Bug description

I can see the various tables and images created from the archives function (line 152 run_eggla_lmm()) appear in the male and female directories that are created (line 157). however, once the run_eggla_lmm() function is completed for each sex the files are automatically removed. I believed this was due to line 158 r on.exit(unlink(results_directory, recursive = TRUE)) .
Commenting this line (158) resulted in the files being retained.

I'm working on RStudio version 1.4.1717 on Windows 10.

related code block:

 archives <- sapply(
    X = c(0, 1),
    FUN = function(isex) {
      sex_literal <- c("0" = "male", "1" = "female")[as.character(isex)]
      results_directory <- file.path(working_directory, sex_literal)
      dir.create(results_directory, recursive = TRUE)
      on.exit(unlink(results_directory, recursive = TRUE))
      results <- egg_model(
        formula = base_model,
        data = dt_clean[egg_sex %in% isex],
        id_var = "egg_id",
        random_complexity = random_complexity,
        use_car1 = use_car1,
        knots = knots,
        quiet = quiet
      )

      saveRDS(
        object = results,
        file = file.path(
          working_directory,
          sprintf("%s-%s-model-object.rds", Sys.Date(), sex_literal)
        )
      )

      writeLines(
        text = deparse1(results$call),
        con = file.path(results_directory, "model-call.txt")
      )

      data.table::fwrite(
        x = broom.mixed::tidy(results),
        file = file.path(results_directory, "model-coefficients.csv")
      )

      grDevices::png(
        filename = file.path(results_directory, "model-residuals.png"),
        width = 4 * 2.5,
        height = 3 * 2.5,
        units = "in",
        res = 120
      )
      print(
        plot_residuals(
          x = x_variable,
          y = y_variable,
          fit = results
        ) +
          patchwork::plot_annotation(
            title = sprintf(
              "Cubic Splines (Random Linear Splines) - BMI - %s",
              c("0" = "Male", "1" = "Female")[as.character(isex)]
            ),
            tag_levels = "A"
          )
      )
      invisible(grDevices::dev.off())

      data.table::fwrite(
        x = egg_slopes(
          fit = results,
          period = period,
          knots = knots
        ),
        file = file.path(results_directory, "derived-slopes.csv")
      )

      data.table::fwrite(
        x = egg_aucs(
          fit = results,
          period = period,
          knots = knots
        ),
        file = file.path(results_directory, "derived-aucs.csv")
      )

      data.table::fwrite(
        x = egg_outliers(
          fit = results,
          period = period,
          knots = knots
        ),
        file = file.path(results_directory, "derived-outliers.csv")
      )

      eggc <- egg_correlations(
        fit = results,
        period = period,
        knots = knots
      )

      data.table::fwrite(
        x = eggc[["AUC"]],
        file = file.path(results_directory, "derived-aucs-correlations.csv")
      )
      data.table::fwrite(
        x = eggc[["SLOPE"]],
        file = file.path(results_directory, "derived-slopes-correlations.csv")
      )

      owd <- getwd()
      on.exit(setwd(owd), add = TRUE)
      setwd(results_directory)
      archive_filename <- file.path(
        working_directory,
        sprintf("%s.zip", sex_literal)
      )
      utils::zip(
        zipfile = archive_filename,
        files = list.files()
      )
      archive_filename
    }
  )

  if (!quiet) {
    message("Results available at:")
    message(paste(sprintf("+ '%s'", archives), collapse = "\n"))
  }
  archives
}

`eggla` version output

‘0.11.2.9000’

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Covariates and slopes

Hi!
I'm having trouble marrying up the slopes to the average curve. The lines from one predicted period to another don't fall along the curve as expected.

The data for the predicted periods and the model fit are the same. I ended up playing with the prediction for the curve by setting my covariate to 0 (my covariate is coded 1/2) and the slopes matched much more closely:

Am I being paranoid here, doesn't the predicted periods used for deriving the slopes include covariates in the predictions? I know we specify "filter" in run_eggla_lmm() but I'm having trouble seeing how this is incorporated into predicting the slopes?

Thanks!

Update documentation from AR1 to CAR1

The code is currently using corCAR1, but the documentation and parameter's name, do not reflect that.

Error in renv::init(base = TRUE)

In run_eggla.sh script (and interactive code) the line (14):

-e 'renv::init(base = TRUE)' \

results in error:

renv::init(base = TRUE)
Error in renv::init(base = TRUE) : unused argument (base = TRUE)
Traceback (most recent calls last):
3: renv::init(base = TRUE)
2: renv_dots_check(...)
1: stop(err)

should the line instead be?:

-e 'renv::init(bare = TRUE)' \

Update knots from 2,8,12 to 1,8,12

Update default knots setting to 1, 8, 12.

Title of residual plots in run_eggla_lmm()

Bug description

The title of the residual plots output from run_eggla_lmm() is fixed at "Cubic Splines (Random Linear Splines) - BMI - [sex]"; which I understand is the default model complexity for the parameter complexity. However, if running more complex random effects the title is no longer correct.
Could the model title be removed? perhaps just the sex should remain?

[lines 245-256]

   p_model_residuals <- plot_residuals(
        x = x_variable,
        y = y_variable,
        fit = results
      ) +
        patchwork::plot_annotation(
          title = sprintf(
            "Cubic Splines (Random Linear Splines) - BMI - %s",
            c("0" = "Male", "1" = "Female")[as.character(isex)]
          ),
          tag_levels = "A"
        )

`eggla` version output

eggla 0.18.5

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

N and EAF definition

Bug description

I noticed a while ago that run_eggla_gwas output gives N, N0, N1 and N2 results using the total genotyped sample instead of the actual analysed sample which has both genotype + phenotype. We've had also the OBS_CT field from PLINK in the results, so this was not critical as the actual sample size was also in the summary stats. However, related to the meta-analysis we use the minor allele count for filtering the results, and related to this I did some check ups on how well the MAC and MAF from genotyped sample agrees with the values from the actual analysed sample. I found some really large differences between these in some SNPs, which made me go to the vcf files to see what was going on as I couldn't see from the run_eggla_gwas or my own codes an apparent explanation to this. I did my calculations on SNPtest because I couldn't work out how to select the subject IDs on bcftools.

Seems like in run_eggla_gwas that part of the summary stats that come from bcftools are calculated based on the total genotyped sample, and EAF is counted using the GT field in vcf files whereas in the regression analyses we use the dosage DS. Could we have all of the summary stats, including Ns, EAF and HWE_P to be based on the actual analysed sample, and EAF using the dosage? Already thinking of the QC for the meta-analysis, also the extra fields in results that are not stated in the analysis plan, i.e. T_STAT,OBS_CT,ERRCODE,FDR,BONFERRONI could be also dropped out.

`eggla` version output

eggla_0.18.2

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Error in `as.data.frame.ranef.lme`

Weird new error in GitHub Action

-- RMarkdown error -------------------------------------------------------------
Quitting from lines 182-187 (03-run-cubic-splines.Rmd) 
Error in as.data.frame.ranef.lme(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : 
  'optional' argument not implemented

Replace slopes_dt with aucs_dt in outlier removal [run_eggla_lmm()]

Bug description

In the code that removes outliers in run_eggla_lmm() [lines 461 to 469], should slopes_dt[["egg_id"]] [line 467] be aucs_dt[["egg_id"]] instead. It doesn't make a difference in my dataset, but may in others?

      if (outlier_exclude && sum(outliers_dt[["Outlier"]]) > 0) {
        outliers_to_exclude <- outliers_dt[Outlier == 1]
        for (icol in unique(outliers_to_exclude[["parameter"]])) {
          if (any(grepl("^auc_", icol))) {
            aucs_dt <- data.table::setDF(aucs_dt)
            aucs_dt[
              slopes_dt[["egg_id"]] %in% outliers_to_exclude[parameter %in% icol, ID],
              icol
            ] <- NA_real_

Thanks!

`eggla` version output

eggla 0.18.5

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Missing covariates in `stats::predict()` and compute AP/AR

As reported by @burrowsk compute_apar() does not keep covariates to predict values (using stats::predict()) in compute_apar() leading to an error.

11: eval(predvars, data, env)
10: eval(predvars, data, env)
9: model.frame.default(formula = asOneFormula(formula(reSt), fixed), 
       data = newdata, na.action = na.action, drop.unused.levels = TRUE, 
       xlev = lapply(contr, rownames))
8: model.frame(formula = asOneFormula(formula(reSt), fixed), data = newdata, 
       na.action = na.action, drop.unused.levels = TRUE, xlev = lapply(contr, 
           rownames))
7: predict.lme(object = fit, newdata = .SD, interval = "prediction")
6: stats::predict(object = fit, newdata = .SD, interval = "prediction")
5: eval(jsub, SDenv, parent.frame())
4: eval(jsub, SDenv, parent.frame())
3: `[.data.table`(data.table::setnames(x = data.table::data.table(egg_id = unique(fit[["groups"]][[id_var]]), 
       egg_ageyears = list(seq(from = start, to = end, by = step))), 
       old = c("egg_id", "egg_ageyears"), new = c(id_var, age_var))[j = `names<-`(list(unlist(.SD)), 
       age_var), .SDcols = c(age_var), by = c(id_var)], j = `:=`(bmi, 
       f(stats::predict(object = fit, newdata = .SD, interval = "prediction"))))
2: data.table::setnames(x = data.table::data.table(egg_id = unique(fit[["groups"]][[id_var]]), 
       egg_ageyears = list(seq(from = start, to = end, by = step))), 
       old = c("egg_id", "egg_ageyears"), new = c(id_var, age_var))[j = `names<-`(list(unlist(.SD)), 
       age_var), .SDcols = c(age_var), by = c(id_var)][j = `:=`(bmi, 
       f(stats::predict(object = fit, newdata = .SD, interval = "prediction")))]
1: compute_apar(fit = res_f_1, from = "predicted")

In R/compute_apar.R

eggla/R/compute_apar.R

Lines 84 to 111 in 2fce6b1

 out <- switch(EXPR = from, 

 "observed" = { 

 data.table::as.data.table(fit[["data"]])[ 

 j = .SD, 

 .SDcols = c(id_var, age_var, bmi_var) 

 ] 

 }, 

 "predicted" = { 

 data.table::setnames( 

 x = data.table::data.table( 

 egg_id = unique(fit[["groups"]][[id_var]]), 

 egg_ageyears = list(seq(from = start, to = end, by = step)) 

 ), 

 old = c("egg_id", "egg_ageyears"), 

 new = c(id_var, age_var) 

 )[ 

 j = `names<-`(list(unlist(.SD)), age_var), 

 .SDcols = c(age_var), 

 by = c(id_var) 

 ][ 

 j = bmi := f(stats::predict( 

 object = fit, 

 newdata = .SD, 

 interval = "prediction" 

 )) 

 ] 

 } 

 )

Add correlation and outlier detection for derived parameter

function to compute the correlation matrix of the derived parameters and possibly to plot that matrix (6839dcb)
function to check outliers, using Tukey's method (or something else) (78ab946)

AP/AR BMI predicted for all values of covariate within individuals

Bug description

When predicting data using compute_apar() on LMM data including a cohort specific covariate, the predictions of BMI are performed on both values of the binary variable for example:

Source covariate is coded as 1 = clinic, 2 = questionnaire

Individual 1 is coded as source=1
Individual 2 is coded as source=2

Output from predicted_data object (https://m.canouil.fr/eggla/articles/adiposity-peak-rebound.html)

egg_id	egg_source	egg_ageyears	egg_bmi	AP	AR
"1"	1	0.25	16.42	FALSE	FALSE
"1"	1	0.3	16.72	FALSE	FALSE
"1"	1	0.35	16.98	FALSE	FALSE
"1"	1	0.4	17.20	FALSE	FALSE
"1"	1	0.45	17.37	FALSE	FALSE
"1"	1	0.5	17.51	FALSE	FALSE
"1"	1	0.55	17.61	FALSE	FALSE
"1"	1	0.6	17.68	FALSE	FALSE
"1"	1	0.65	17.71	FALSE	FALSE
"1"	1	0.7	17.73	TRUE	FALSE

sex	egg_id	egg_source	egg_ageyears	egg_bmi	AP	AR
1	"2"	1	0.25	15.3911	FALSE	TRUE
1	"2"	2	0.25	15.0745	FALSE	FALSE
1	"2"	1	0.3	15.7286	FALSE	FALSE
1	"2"	2	0.3	15.4051	FALSE	FALSE
1	"2"	1	0.35	16.0235	FALSE	FALSE
1	"2"	2	0.35	15.6939	FALSE	FALSE
1	"2"	1	0.4	16.2767	FALSE	FALSE
1	"2"	2	0.4	15.9419	FALSE	FALSE
1	"2"	1	0.45	16.4898	FALSE	FALSE
1	"2"	2	0.45	16.1506	FALSE	FALSE

Note that AP and AR appear to be predicted for each age increment, holding source =1. So, there are no duplicate AP or ARs which is good. For the predictions, would we not want to predict based on the individuals value for source, i.e those with 1 predicted with 1 and those with 2 predicted with 2 rather than 1?

Thanks!

`eggla` version output

[1] ‘0.12.3’

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

do_eggla_gwas: no visible binding for global variable ‘trait_model’ (NOTE)

do_eggla_gwas: no visible binding for global variable ‘trait_model’
  (/home/runner/work/eggla/eggla/check/eggla.Rcheck/00_pkg_src/eggla/R/do_eggla_gwas.R:449-469)
Undefined global functions or variables:
  trait_model

Release `eggla` `v1.0.0`

eggla v1.0.0¹ should be released either upon manuscript submission (or when the analytical plan will be made available to the whole consortium).

Update release date with v1.0.0 release date (https://github.com/mcanouil/eggla/blob/main/CITATION.cff#L11)
```
cffr::cffr::cff_write()
```

v0.*.* versions are for "beta" or "work in progress" software/tool, whilst v1.*.* or later major versions means the software has all the main features it should have for a stable/production use. See https://semver.org/ for more details on semantic versioning. ↩

Incorrect parameter correlations

Bug description

I noticed that my correlations between the parameters were quite different from previous analyses. For instance using the derived-slopes.csv from run_eggla_lmm() for bmigrowth females, the correlations (using stats::cor) are:

term	slope_0--0.5	slope_1.5--3.5	slope_6.5--10	slope_12--17
slope_0--0.5	1	0.991665022459947	0.308438879534697	-0.685332215016685
slope_1.5--3.5	0.991665022459947	1	0.428429069334579	-0.585792675893659
slope_6.5--10	0.308438879534697	0.428429069334579	1	0.481341973221797
slope_12--17	-0.685332215016685	-0.585792675893659	0.481341973221797	1

The correlations for the same data given in derived-parameters-correlations.csv are:

term	slope_0--0.5	slope_1.5--3.5	slope_6.5--10	slope_12--17
slope_0--0.5	1	0.96025	0.02945	-0.48586
slope_1.5--3.5	0.96025	1	0.307301	-0.22257
slope_6.5--10	0.02945	0.307301	1	0.859348
slope_12--17	-0.48586	-0.22257	0.859348	1

I think the issue becomes apparent in the following code block of run_eggla_lmm()

      eggc_dt <- Reduce(
        f = function(x, y) merge(x, y, by = names(results$groups)),
        x = lapply(
          X = list(
            slopes_dt,
            aucs_dt,
            apar_dt
          ),
          FUN = function(data) {
            data <- data.table::setnames(data, "egg_id", names(results$groups), skip_absent = TRUE)
            data[grep(paste0("^auc_|^slope_|^AP_|^AR_|^", names(results$groups), "$"), names(data))]
          }
        )
      )

Object eggc_dt results in a reduced dataframe of the expected 13 variables but for only 5 individuals. This should be for all ~50 females right?

Example code run

data("bmigrowth")
fwrite(x = bmigrowth, file = "bmigrowth.csv")
results <- run_eggla_lmm(
     data = fread("bmigrowth.csv"),
     id_variable = "ID",
     age_days_variable = NULL,
     age_years_variable = "age",
     weight_kilograms_variable = "weight",
     height_centimetres_variable = "height",
     sex_variable = "sex",
     covariates = NULL,
     random_complexity = 2
  )

slopes <- read.csv("derived-slopes.csv")
eggc <- data.table::as.data.table(
  x = stats::cor(slopes[grep("^slope_", names(slopes))]),
  keep.rownames = "term"
)

`eggla` version output

packageVersion('eggla')
[1] ‘0.18.0’

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Add AP and AR in addition to slopes and AUCs for GWAS

AP and AR should be part of the GWAS.

plot_egg_slopes() to reflect egg_slope() knots args

Thanks for egg_slopes()! plot_egg_slopes() will also need to reflect knots args.

`v0.4.0`

Implement selected model: "cubic splines with random linear splines". (#2)
Add functionality to install R dependencies with specific version. (#3)
Check derived parameters
- AUCs
- Slopes
- Adiposity Peak
- Adiposity rebond

Inconsistent use of vep and vep_file in run_eggla_gwas()

Bug description

The parameter to add the path to a VEP annotation file is vep within run_eggla_gwas() [line 95]. the parameter vep is called in line 443 vep_file = vep, however, all other instances specify vep_file which causes the function to exit upon the first instance [lines 408-423].
Only applicable should a user want to specify a vep file.

if (nzchar(system.file(package = "future.apply"))) {
    eggla_lapply <- function(X, basename_file, vep_file, bin_path, bcftools_view_options, build, strand, info_type, FUN) {
      future.apply::future_lapply(
        X = X,
        basename_file = basename_file,
        vep_file = vep_file,
        bin_path = bin_path,
        bcftools_view_options = bcftools_view_options,
        build = build,
        strand = strand,
        info_type = info_type,
        future.globals = FALSE,
        future.packages = "data.table",
        FUN = FUN
      )
    }

`eggla` version output

eggla 0.18.5

Checklist

formatted your issue so it is easier for us to read?
included a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
documented the version of eggla you're running, by pasting the output from running packageVersion('eggla') in the "eggla version output" text area?
documented which operating system you're running? If on Linux, please provide the specific distribution as well.

Concordance of dataframe name for Daymont and non-Daymont datasets

Hi,
There is a very minor issue of data frame naming when following the pipeline according to the README. Could the resulting data frame from Daymont cleaning (visits_clean) be instead named pheno_dt?
The pipeline calls on pheno_dt and therefore does not use visits_clean. The result is no filtering of observations and missing data in BMI giving errors when plotting residuals.

Thanks!

# insert reprex here

	dt <- data.table::merge.data.table(
	x = data.table::fread(data),
	y = data.table::setnames(
	x = data.table::rbindlist(lapply(
	X = results_zip,
	path = path,
	FUN = function(izip, path) {
	utils::unzip(
	zipfile = izip,
	files = "derived-slopes.csv",
	exdir = path
	)
	utils::unzip(
	zipfile = izip,
	files = "derived-aucs.csv",
	exdir = path
	)
	on.exit(unlink(file.path(path, c("derived-slopes.csv", "derived-aucs.csv"))))
	data.table::merge.data.table(
	x = data.table::fread(file.path(path, "derived-slopes.csv")),
	y = data.table::fread(file.path(path, "derived-aucs.csv")),
	by = "egg_id"
	)
	}
	)),
	old = "egg_id",
	new = id_column
	),
	by = id_column
	)

	data.table::fwrite(
	x = data.table::merge.data.table(
	x = results,
	y = annot,
	by = c("CHROM", "POS", "ID", "REF", "ALT"), # intersect(names(results), names(annot))
	all.x = TRUE
	),
	file = output_results_file
	)

	```{r model-predict-plot}
	ggplot(
	data = good_res_models[
	j = rbindlist(list(
	(function(x) {
	data.table(age = seq(0, 16, 0.1))[
	j = bmi := if (inherits(x[[1]], "lme")) {
	exp(predict(x[[1]], .SD, level = 0))
	} else NA_real_,
	.SDcols = "age"
	]
	})(lme)
	)),
	by = list(sex, dataset, models)
	]
	) +
	aes(
	x = age,
	y = bmi,
	colour = c("0" = "Male", "1" = "Female")[as.character(sex)]
	) +
	geom_path() +
	labs(
	title = "Cubic Splines (Fixed Effects = Random Effects)",
	x = "AGE (years)",
	y = "BMI (kg/m\u00B2)",
	colour = NULL
	)
	```

	```{r}
	ggplot() +
	aes(x = age, y = bmi) +
	stat_smooth(
	data = pheno_dt_female,
	mapping = aes(colour = "Loess"),
	method = "loess", formula = y ~ x, linetype = 2, se = FALSE
	) +
	geom_path(
	data = data.table(
	age = seq(min(pheno_dt_female[["age"]]), max(pheno_dt_female[["age"]]), 0.1)
	)[
	j = bmi := exp(predict(res, .SD, level = 0)),
	.SDcols = "age"
	],
	mapping = aes(colour = "Cubic Splines (Random Linear Splines)"),
	) +
	labs(x = "AGE (years)", y = "BMI (kg/m\u00B2)", colour = "Model")
	```

	out <- switch(EXPR = from,
	"observed" = {
	data.table::as.data.table(fit[["data"]])[
	j = .SD,
	.SDcols = c(id_var, age_var, bmi_var)
	]
	},
	"predicted" = {
	data.table::setnames(
	x = data.table::data.table(
	egg_id = unique(fit[["groups"]][[id_var]]),
	egg_ageyears = list(seq(from = start, to = end, by = step))
	),
	old = c("egg_id", "egg_ageyears"),
	new = c(id_var, age_var)
	)[
	j = `names<-`(list(unlist(.SD)), age_var),
	.SDcols = c(age_var),
	by = c(id_var)
	][
	j = bmi := f(stats::predict(
	object = fit,
	newdata = .SD,
	interval = "prediction"
	))
	]
	}
	)

mcanouil / eggla Goto Github PK

eggla's Introduction

EGG Longitudinal Analysis

Installation

Docker Images

License

Code of Conduct

eggla's People

Contributors

Stargazers

Watchers

Forkers

eggla's Issues

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Footnotes

Bug description

eggla version output

Checklist

Bug description

eggla version output

Checklist

Recommend Projects

Recommend Topics

Recommend Org

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output

`eggla` version output