markvanderloo / simputation Goto Github PK

View Code? Open in Web Editor NEW

89.0 4.0 11.0 753 KB

Making imputation easy

License: GNU General Public License v3.0

R 97.51% C 1.46% Makefile 1.03%

imputation rstats r data-science officialstatistics

simputation's People

Contributors

Stargazers

Watchers

Forkers

paulponcet guhjy njtierney vhcg77 gridl gundie88 edwindj squallywally mitchellggreenhalgh ielbadisy owain-s sfallahpour

simputation's Issues

imputing new data

It would be great to be able to build imputation models on data set X₁ and apply them to other data sets (X₁, X₂, ...). From what I can tell, you can only impute the same data that you start with.

impute_rhd: pool = "complete": If a record has multiple missings, imputations are taken from a different donors

In package specification there is defined that for impute_rhd when pool value is "complete" and "a record has multiple missings, all imputations are taken from a single donor", while it does not apply.

Example:
test <- data.frame(seq(1,15), seq(1,15))
colnames(test) <- c("num1", "num2")
test$num1[2:6] <- NA
test$num2[4:8] <- NA
set.seed(1000)
head(test %>% impute_rhd(num1 + num2 ~ 1, pool = "complete"), 10)

Error by Interacting with tibble 1.x

From: https://blog.rstudio.com/2016/03/24/tibble-1-0-0/

Interacting with legacy code
A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame.

With this correction, impute_shd and impute_rhd functions are working as expected.

library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(simputation)

dat <- as_tibble(iris)
# empty a few fields
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA

dat %>% impute_shd(Sepal.Length ~ Sepal.Width + Species) %>% head(3)
#> Error: Can't use matrix or array for column indexing
dat %>% impute_rhd(Sepal.Length ~ Sepal.Width + Species) %>% head(3)
#> Error: Can't use matrix or array for column indexing

Issue with accessing the repo to install the package

@markvanderloo when i use this command to install the package from github

library(devtools)
install_github("markvanderloo/simputation")

i get the following error:

Error: Failed to install 'unknown package' from GitHub:
  HTTP error 404.
  No commit found for the ref MultipleTimeSeries

  Did you spell the repo owner (`markvanderloo`) and repo name (`simputation`) correctly?
  - If spelling is correct, check that you have the required permissions to access the repo.

what is alternative to use the updated version with all the fixes. The library(simputation) still doesnt include the current fixes so i thought to install it directly from github.

Provide interface to mlr

Regarding our brief conversation after your tutorial: https://mlr.mlr-org.com/reference/impute.html

Grouping fails in case of median imputation

Simputation-Grouping fails in median imputation.pdf

rearrange documentation

Everything is in one huge doc now. Should split things up as in the list in the README file.

Add quantile regression methods

This generalizes the (grouped) median approach. Needs dependency on the quantreg package.

add support for glm

yep. should add it.

missForest `...` arguments

The ... argument is not passed to missForest.

Choose imputation backend

There are several excellent packages offering imputation methodology, for example VIM which scales well and offers very detailed control over imputation methods.

It would be good to make the impute_xxx functions flexible so one can choose an imputation backend and use the ellipsis to pass extra arguments to the backend. Possible options are at least:

kNN
hotdeck methods

Back tick `variable names` fail in imputation formulae

Here's a reprex that highlights the issue:

library(tibble)
library(magrittr)
library(simputation)
tibble::tribble(
  ~`a col`, ~`b col`,
         1, NA,
         3, 3,
         5, 5,
         7, 7
) %>% 
  as.data.frame() %>%
  impute_knn(`b col` ~ `a col`, k = 3)  
#> Error in `[.data.frame`(dat, imp_vars): undefined columns selected

tibble::tribble(
  ~a.col, ~b.col,
  1, NA,
  3, 3,
  5, 5,
  7, 7
) %>% 
  as.data.frame() %>%
  impute_knn(b.col ~ a.col, k = 3)  
#>   a.col b.col
#> 1     1     3
#> 2     3     3
#> 3     5     5
#> 4     7     7

I think the issue is caused by the extraction of formula components in get_imputed and get_predictors . I see 'get_imputed(dat, `b col` ~ `a col`)' returns "`b col`" rather than "b col" which is required to index the column.

RandomForest imputation

add a where argument

impute_lm(y~x|z, where x > 0)

would first select records where x>0, fit the model, impute, and return the whole dataset.

Problem with impute_rhd (random hot deck) for multiple values

Hello
I have a problem with the random hot deck in the package simputation
I want to impute several variables that are linked (the sum equals 100) with the random hot deck on three criterias (with the option "complete")
capit_imput=impute_rhd(capit_mod,CAPITAL_W+CAPITAL_PERSMOR+CAPITAL_HORSW_FAM+CAPITAL_HORW_HORSFAM~ SIEGE_DEP+CDEX_COEF2017+OTE64_COEF2017,pool="complete")

But for some units, it seems that only one value is imputed

The first unit missing (with ind=TRUE) has to be imputed among two potential donors (SIEGE_DEP=10 CDEX_COEF2017=11, OTE64_COEF2017=1600)

It seems that the second unit is the donor with CAPITAL_W=10 but CAPITAL_PERSMOR is different of 90

Can you explain why ?
And when I set the model
capit_imput=impute_rhd(capit_mod,CAPITAL_PERSMOR+CAPITAL_W+CAPITAL_HORSW_FAM+CAPITAL_HORW_HORSFAM~ SIEGE_DEP+CDEX_COEF2017+OTE64_COEF2017,pool="complete")

The line is well imputated as if the order of the variables was worth with the complete option

Thanks for your help

impute_rhd fails with "incorrect number of probabilities" when using pool = "multivariate"

I am trying to use impute_rhd with distinct donor sets per missingness patters - but am stuck on this error message that I don't understand. I hope the reprex below helps, and would appreciate any advice how to work around this.

library(simputation)
iris_na <- mice::ampute(iris)
iris_na$amp |> impute_rhd(Sepal.Length + Sepal.Width ~ 1, pool = "multivariate")
#> Error in sample.int(length(x), size, replace, prob): incorrect number of probabilities
x <- iris_na$amp |> impute_rhd(Sepal.Length + Sepal.Width ~ 1)

^{Created on 2024-02-07 with reprex v2.0.2}

providing a variable name by a variable containing the string

Hi,
I wonder if this package can handle !!as.name-like way to provide variable name by providing the variable with the string.

Ex.
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]

instead of directly specifying the column name:
abundance ~ mean(abundance) | group

I'd like to do something like:
abundance_column_name <- "abundance"
!!as.name(abundance_column_name) ~ mean(!!as.name(abundance_column_name)) | group

will it be possible with the current implementation?

best,
hee jong kim

make imputation methods aware of dplyr::group_by

'nuff said

Installation problem in R

When I try install the package the console show that package isn't available to the R's version 3.6.3

install.packages("simputation", dependencies=TRUE)
Installing package into ‘/home/username/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
Warning in install.packages :
  package ‘simputation’ is not available (for R version 3.6.3)

dead links in readme

The readme links to a blogpost and slides, but the links are currently dead.

The links in question:
http://www.markvanderloo.eu/yaRb/2016/09/13/announcing-the-simputation-package-make-imputation-simple/
http://www.markvanderloo.eu/files/statistics/user2017markvanderloo.pdf

allow custom aggregation method for randomForest

by predicting with predict.all=TRUE we get the matrix of predictions. Row-wise aggergation can be customized (but it may cost a lot of memory as the matrix of predictions equals nrow(newdata) x ntree)

Using impute_mf with a tibble

It looks like the missForest package isn't tibble friendly. If you want to pass a tibble into the simputation function impute_mf, the following error pops up.

library(tidyverse)
library(simputation)
library(missForest)
#> Loading required package: randomForest
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
#> 
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:simputation':
#> 
#>     na.roughfix
#> The following object is masked from 'package:dplyr':
#> 
#>     combine
#> The following object is masked from 'package:ggplot2':
#> 
#>     margin
#> Loading required package: foreach
#> 
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#> 
#>     accumulate, when
#> Loading required package: itertools
#> Loading required package: iterators

df=tibble::tibble(y=rnorm(100),x1=rnorm(100),x2=rnorm(100))
for(i in seq_along(names(df))) df[sample(1:nrow(df),10),i]=NA
df
#> # A tibble: 100 x 3
#>           y      x1      x2
#>       <dbl>   <dbl>   <dbl>
#>  1   0.252    0.418  1.08  
#>  2   0.665    0.565 -1.73  
#>  3   0.592   -0.264 -0.356 
#>  4  NA       -1.36   0.884 
#>  5   0.0819  -1.42   0.481 
#>  6   0.407   NA      0.732 
#>  7  -0.750    0.862  0.409 
#>  8   1.21    -1.33  -0.0501
#>  9  NA        0.644  0.206 
#> 10   1.38    -2.77  -0.519 
#> # ... with 90 more rows

# This will throw an error because df is a tibble and missForest
# expects a data frame as an input for its xmis argument

df %>% impute_mf(formula=y~.)  
#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA

#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA

#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA
#>   missForest iteration 1 in progress...
#> Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
#> mtry, : The response has five or fewer unique values. Are you sure you want
#> to do regression?
#> Warning: Could not execute missForest::missForest: length of response must be the same as predictors
#>  Returning original data
#> # A tibble: 100 x 3
#>           y      x1      x2
#>       <dbl>   <dbl>   <dbl>
#>  1   0.252    0.418  1.08  
#>  2   0.665    0.565 -1.73  
#>  3   0.592   -0.264 -0.356 
#>  4  NA       -1.36   0.884 
#>  5   0.0819  -1.42   0.481 
#>  6   0.407   NA      0.732 
#>  7  -0.750    0.862  0.409 
#>  8   1.21    -1.33  -0.0501
#>  9  NA        0.644  0.206 
#> 10   1.38    -2.77  -0.519 
#> # ... with 90 more rows

# This runs okay because the tibble is converted to a data.frame
# before it is passed to the missForest function.

df_okay %>% as.data.frame() %>% impute_mf(y~.)
#> Error in eval(lhs, parent, parent): object 'df_okay' not found

head(df_okay)
#> Error in head(df_okay): object 'df_okay' not found

# This adjustment to the impute_mf function makes the initial code
# run without throwing an error. 

impute_mf_bcj<-function (dat, formula, ...) 
{
  stopifnot(inherits(formula, "formula"))
  if (simputation:::not_installed("missForest")) 
    return(dat)
  imputed <- simputation:::get_imputed(formula, dat)
  predictors <- simputation:::get_predictors(formula, dat, ...)
  vars <- unique(c(imputed, predictors))
  imp <- tryCatch(missForest::missForest(as.data.frame(dat[vars]), ...)[[1]], 
    error = function(e) {
      warnf("Could not execute missForest::missForest: %s\n Returning original data", 
        e$message)
      dat
    })
  if (length(imputed) == 0) {
    dat[vars] <- imp[vars]
  }
  else {
    dat[imputed] <- imp[imputed]
  }
  dat
}

df %>% impute_mf_bcj(formula=y~.)
#>   missForest iteration 1 in progress...done!
#>   missForest iteration 2 in progress...done!
#>   missForest iteration 3 in progress...done!
#>   missForest iteration 4 in progress...done!
#> # A tibble: 100 x 3
#>          y      x1      x2
#>      <dbl>   <dbl>   <dbl>
#>  1  0.252    0.418  1.08  
#>  2  0.665    0.565 -1.73  
#>  3  0.592   -0.264 -0.356 
#>  4  0.775   -1.36   0.884 
#>  5  0.0819  -1.42   0.481 
#>  6  0.407   NA      0.732 
#>  7 -0.750    0.862  0.409 
#>  8  1.21    -1.33  -0.0501
#>  9  1.07     0.644  0.206 
#> 10  1.38    -2.77  -0.519 
#> # ... with 90 more rows

^{Created on 2019-01-20 by the reprex package (v0.2.1)}

design decision for EMB/missForest etc

Interpretation of formula objects for methods that impute every variable simultaneously requires care since there is no difference between predictors and predicted. I see two options.

Every variable in the formula gets imputed.
We allow this:

x + y ~ z + w + q

so that (x,y,z,w,q) are used in the model, but only (x,y) are copied into the output dataset. Doing

 ~ x + y

would impute x and y but without taking z, w and q into account when modeling the imputations.

EM imputation

hot-deck methods do not preserve attributes

So information from e.g. dplyr::group_by is lost on imputation.

better error messages for data types pretending to be data.frame but really aren't

See #17

add random component

either parametric e ~ N(mu, sigma) or nonparametric e ~ residuals(m)

Native groupwise imputation by `|` syntax in formula object

'nuff said

Imputation fails with long formulas

Thanks for the really nicely designed package! In trying it out, I noticed the parsing in impute_lm or impute_rf will fail if the formula is long. Here's an example where I've made the iris dataset have long variable names.

I assume this is happening because the formula is hitting the width.cutoff of deparse, though I haven't explored in detail.

library(simputation)
# Demo from the docs:
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA

# But with long variable names:
names(dat) <- c(
  "this_is_very_long_name_for_sepal_length",
  "this_is_very_long_name_for_sepal_width",
  "this_is_very_long_name_for_petal_length",
  "this_is_very_long_name_for_petal_width",
  "this_is_very_long_name_for_species"
  )

# The basic lm() call is fine:
form <- this_is_very_long_name_for_sepal_length ~ 
  this_is_very_long_name_for_sepal_width + 
  this_is_very_long_name_for_species + 
  this_is_very_long_name_for_petal_width

lm(form, data=dat)
#> 
#> Call:
#> lm(formula = form, data = dat)
#> 
#> Coefficients:
#>                                  (Intercept)  
#>                                       2.5564  
#>       this_is_very_long_name_for_sepal_width  
#>                                       0.6912  
#> this_is_very_long_name_for_speciesversicolor  
#>                                       0.9590  
#>  this_is_very_long_name_for_speciesvirginica  
#>                                       1.2029  
#>       this_is_very_long_name_for_petal_width  
#>                                       0.3816

# But this fails:
da1 <- impute_lm(dat, form)
#> Error in parse(text = x, keep.source = FALSE): <text>:2:0: unexpected end of input
#> 1: this_is_very_long_name_for_sepal_length ~ this_is_very_long_name_for_sepal_width + this_is_very_long_name_for_species + 
#>    ^

^{Created on 2019-12-10 by the reprex package (v0.3.0)}

fault in predictive mean matching

the impute_pmm method uses the distance between donor and imputed values while it should use distance between predicted donor values and imputed values.

Example provided by Susie Jentoft by e-mail:

dat <- iris[1:15,]
dat[8,1] <- NA
impute_pmm(dat, Sepal.Length ~ Sepal.Width)

the value 5 is imputed, while 4.6 is expected.

dplyr groups are ignored

The compatibility with dplyr groups broke due to a change in the implementation of dplyr.

Error: Subscript `ina` is a matrix

Hi,

I’m getting the following error, and I wonder if you could help me understand its cause:

library(dplyr)
library(simputation)

kdata <- tribble(
    ~age, ~ct, ~pfratio, ~bmi,
    56,   86,   130,   30,
    58,   NA,   110,   NA,
    78,   NA,   NA,    28,
    54,   NA,   NA,    NA,
    45,   45,   230,   28,
    54,   45,   NA,    29
)

impute_knn(
    kdata,
        bmi ~ .,
        pool = "univariate"
    )
#> Warning: Requested k = 5 while 4 donors present. Using k = 4.
#> Error: Subscript `ina` is a matrix, the data `donors[ina]` must have size 1.

^{Created on 2021-05-30 by the reprex package (v2.0.0)}

The same happens if I had more variables to the formula’s left-hand side (e.g., bmi + ct + pfratio ~ .).

I understand the warning that appears in this reprex. However, my actual data is in the hundreds of observations, yet it does have its fair share of NAs, and occasionally there can be up to 3 NAs per row. Is the error related to NAs in predictor variables?

Thanks!

configure trainingsets with na.omit argument where relevant

error with using weight variable

when i try the following command

imp_simpuatation<-impute_rhd(
  data,
  GI~age+sex,
  pool = "univariate",
  prob = data$SamplingWeight
)

i get the following error:

Error in impute_rhd(data, earnings ~ AG + sex, pool = "univariate", prob = data$SamplingWeight) : 
  length(prob) != nrow(dat) is not TRUE

I used a debugger function and i think there is a bug in the package which i believe is here:

I believe the highlighted line should be
stopifnot(length(prob) == nrow(dat))

add proper ratio-imputation

Ratio imputation is based on a weighted single regressio regression Y = bX with weights 1/X. Would be nice to have a impute_ratio function. Also lm makes a different choice in selecting data under missingess than

b <- mean(Y,na.rm=TRUE)/mean(X,na.rm=TRUE)

Use on an image?

I need to teach imputation soon and am happy to have found this package!

I've been trying to think up a very simple, compelling, and visual example to work through and someone suggested working with an image, where certain pixels are missing.

Have you ever done that (or seen a nice example somewhere)? Even better, using this package?

If not, do you have a hunch whether this will work out nicely? Is it clear to you in advance that this is either a great or terrible idea? Thanks for any wisdom.

crash in impute_proxy with character variable

d <- data.frame(x=c(NA,'a','b'),stringsAsFactors = FALSE)
impute_proxy(d,x ~ "w")