Giter VIP home page Giter VIP logo

simputation's People

Contributors

edwindj avatar karldw avatar markvanderloo avatar sfallahpour avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

simputation's Issues

imputing new data

It would be great to be able to build imputation models on data set X1 and apply them to other data sets (X1, X2, ...). From what I can tell, you can only impute the same data that you start with.

impute_rhd: pool = "complete": If a record has multiple missings, imputations are taken from a different donors

In package specification there is defined that for impute_rhd when pool value is "complete" and "a record has multiple missings, all imputations are taken from a single donor", while it does not apply.

Example:
test <- data.frame(seq(1,15), seq(1,15))
colnames(test) <- c("num1", "num2")
test$num1[2:6] <- NA
test$num2[4:8] <- NA
set.seed(1000)
head(test %>% impute_rhd(num1 + num2 ~ 1, pool = "complete"), 10)

image

Error by Interacting with tibble 1.x

From: https://blog.rstudio.com/2016/03/24/tibble-1-0-0/

Interacting with legacy code
A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame.

With this correction, impute_shd and impute_rhd functions are working as expected.

library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(simputation)

dat <- as_tibble(iris)
# empty a few fields
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA

dat %>% impute_shd(Sepal.Length ~ Sepal.Width + Species) %>% head(3)
#> Error: Can't use matrix or array for column indexing
dat %>% impute_rhd(Sepal.Length ~ Sepal.Width + Species) %>% head(3)
#> Error: Can't use matrix or array for column indexing

Issue with accessing the repo to install the package

@markvanderloo when i use this command to install the package from github

library(devtools)
install_github("markvanderloo/simputation")

i get the following error:

Error: Failed to install 'unknown package' from GitHub:
  HTTP error 404.
  No commit found for the ref MultipleTimeSeries

  Did you spell the repo owner (`markvanderloo`) and repo name (`simputation`) correctly?
  - If spelling is correct, check that you have the required permissions to access the repo.

what is alternative to use the updated version with all the fixes. The library(simputation) still doesnt include the current fixes so i thought to install it directly from github.

Choose imputation backend

There are several excellent packages offering imputation methodology, for example VIM which scales well and offers very detailed control over imputation methods.

It would be good to make the impute_xxx functions flexible so one can choose an imputation backend and use the ellipsis to pass extra arguments to the backend. Possible options are at least:

  • kNN
  • hotdeck methods

Back tick `variable names` fail in imputation formulae

Here's a reprex that highlights the issue:

library(tibble)
library(magrittr)
library(simputation)
tibble::tribble(
  ~`a col`, ~`b col`,
         1, NA,
         3, 3,
         5, 5,
         7, 7
) %>% 
  as.data.frame() %>%
  impute_knn(`b col` ~ `a col`, k = 3)  
#> Error in `[.data.frame`(dat, imp_vars): undefined columns selected

tibble::tribble(
  ~a.col, ~b.col,
  1, NA,
  3, 3,
  5, 5,
  7, 7
) %>% 
  as.data.frame() %>%
  impute_knn(b.col ~ a.col, k = 3)  
#>   a.col b.col
#> 1     1     3
#> 2     3     3
#> 3     5     5
#> 4     7     7

I think the issue is caused by the extraction of formula components in get_imputed and get_predictors . I see 'get_imputed(dat, `b col` ~ `a col`)' returns "`b col`" rather than "b col" which is required to index the column.

add a where argument

impute_lm(y~x|z, where x > 0)

would first select records where x>0, fit the model, impute, and return the whole dataset.

Problem with impute_rhd (random hot deck) for multiple values

Hello
I have a problem with the random hot deck in the package simputation
I want to impute several variables that are linked (the sum equals 100) with the random hot deck on three criterias (with the option "complete")
capit_imput=impute_rhd(capit_mod,CAPITAL_W+CAPITAL_PERSMOR+CAPITAL_HORSW_FAM+CAPITAL_HORW_HORSFAM~ SIEGE_DEP+CDEX_COEF2017+OTE64_COEF2017,pool="complete")

But for some units, it seems that only one value is imputed

The first unit missing (with ind=TRUE) has to be imputed among two potential donors (SIEGE_DEP=10 CDEX_COEF2017=11, OTE64_COEF2017=1600)
image

It seems that the second unit is the donor with CAPITAL_W=10 but CAPITAL_PERSMOR is different of 90

Can you explain why ?
And when I set the model
capit_imput=impute_rhd(capit_mod,CAPITAL_PERSMOR+CAPITAL_W+CAPITAL_HORSW_FAM+CAPITAL_HORW_HORSFAM~ SIEGE_DEP+CDEX_COEF2017+OTE64_COEF2017,pool="complete")

The line is well imputated as if the order of the variables was worth with the complete option
image

Thanks for your help

impute_rhd fails with "incorrect number of probabilities" when using pool = "multivariate"

I am trying to use impute_rhd with distinct donor sets per missingness patters - but am stuck on this error message that I don't understand. I hope the reprex below helps, and would appreciate any advice how to work around this.

library(simputation)
iris_na <- mice::ampute(iris)
iris_na$amp |> impute_rhd(Sepal.Length + Sepal.Width ~ 1, pool = "multivariate")
#> Error in sample.int(length(x), size, replace, prob): incorrect number of probabilities
x <- iris_na$amp |> impute_rhd(Sepal.Length + Sepal.Width ~ 1)

Created on 2024-02-07 with reprex v2.0.2

providing a variable name by a variable containing the string

Hi,
I wonder if this package can handle !!as.name-like way to provide variable name by providing the variable with the string.

Ex.
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]

instead of directly specifying the column name:
abundance ~ mean(abundance) | group

I'd like to do something like:
abundance_column_name <- "abundance"
!!as.name(abundance_column_name) ~ mean(!!as.name(abundance_column_name)) | group

will it be possible with the current implementation?

best,
hee jong kim

Installation problem in R

When I try install the package the console show that package isn't available to the R's version 3.6.3

install.packages("simputation", dependencies=TRUE)
Installing package into ‘/home/username/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
Warning in install.packages :
  package ‘simputation’ is not available (for R version 3.6.3)

allow custom aggregation method for randomForest

by predicting with predict.all=TRUE we get the matrix of predictions. Row-wise aggergation can be customized (but it may cost a lot of memory as the matrix of predictions equals nrow(newdata) x ntree)

Using impute_mf with a tibble

It looks like the missForest package isn't tibble friendly. If you want to pass a tibble into the simputation function impute_mf, the following error pops up.

library(tidyverse)
library(simputation)
library(missForest)
#> Loading required package: randomForest
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
#> 
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:simputation':
#> 
#>     na.roughfix
#> The following object is masked from 'package:dplyr':
#> 
#>     combine
#> The following object is masked from 'package:ggplot2':
#> 
#>     margin
#> Loading required package: foreach
#> 
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#> 
#>     accumulate, when
#> Loading required package: itertools
#> Loading required package: iterators

df=tibble::tibble(y=rnorm(100),x1=rnorm(100),x2=rnorm(100))
for(i in seq_along(names(df))) df[sample(1:nrow(df),10),i]=NA
df
#> # A tibble: 100 x 3
#>           y      x1      x2
#>       <dbl>   <dbl>   <dbl>
#>  1   0.252    0.418  1.08  
#>  2   0.665    0.565 -1.73  
#>  3   0.592   -0.264 -0.356 
#>  4  NA       -1.36   0.884 
#>  5   0.0819  -1.42   0.481 
#>  6   0.407   NA      0.732 
#>  7  -0.750    0.862  0.409 
#>  8   1.21    -1.33  -0.0501
#>  9  NA        0.644  0.206 
#> 10   1.38    -2.77  -0.519 
#> # ... with 90 more rows

# This will throw an error because df is a tibble and missForest
# expects a data frame as an input for its xmis argument

df %>% impute_mf(formula=y~.)  
#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA

#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA

#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA
#>   missForest iteration 1 in progress...
#> Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
#> mtry, : The response has five or fewer unique values. Are you sure you want
#> to do regression?
#> Warning: Could not execute missForest::missForest: length of response must be the same as predictors
#>  Returning original data
#> # A tibble: 100 x 3
#>           y      x1      x2
#>       <dbl>   <dbl>   <dbl>
#>  1   0.252    0.418  1.08  
#>  2   0.665    0.565 -1.73  
#>  3   0.592   -0.264 -0.356 
#>  4  NA       -1.36   0.884 
#>  5   0.0819  -1.42   0.481 
#>  6   0.407   NA      0.732 
#>  7  -0.750    0.862  0.409 
#>  8   1.21    -1.33  -0.0501
#>  9  NA        0.644  0.206 
#> 10   1.38    -2.77  -0.519 
#> # ... with 90 more rows

# This runs okay because the tibble is converted to a data.frame
# before it is passed to the missForest function.

df_okay %>% as.data.frame() %>% impute_mf(y~.)
#> Error in eval(lhs, parent, parent): object 'df_okay' not found

head(df_okay)
#> Error in head(df_okay): object 'df_okay' not found

# This adjustment to the impute_mf function makes the initial code
# run without throwing an error. 

impute_mf_bcj<-function (dat, formula, ...) 
{
  stopifnot(inherits(formula, "formula"))
  if (simputation:::not_installed("missForest")) 
    return(dat)
  imputed <- simputation:::get_imputed(formula, dat)
  predictors <- simputation:::get_predictors(formula, dat, ...)
  vars <- unique(c(imputed, predictors))
  imp <- tryCatch(missForest::missForest(as.data.frame(dat[vars]), ...)[[1]], 
    error = function(e) {
      warnf("Could not execute missForest::missForest: %s\n Returning original data", 
        e$message)
      dat
    })
  if (length(imputed) == 0) {
    dat[vars] <- imp[vars]
  }
  else {
    dat[imputed] <- imp[imputed]
  }
  dat
}

df %>% impute_mf_bcj(formula=y~.)
#>   missForest iteration 1 in progress...done!
#>   missForest iteration 2 in progress...done!
#>   missForest iteration 3 in progress...done!
#>   missForest iteration 4 in progress...done!
#> # A tibble: 100 x 3
#>          y      x1      x2
#>      <dbl>   <dbl>   <dbl>
#>  1  0.252    0.418  1.08  
#>  2  0.665    0.565 -1.73  
#>  3  0.592   -0.264 -0.356 
#>  4  0.775   -1.36   0.884 
#>  5  0.0819  -1.42   0.481 
#>  6  0.407   NA      0.732 
#>  7 -0.750    0.862  0.409 
#>  8  1.21    -1.33  -0.0501
#>  9  1.07     0.644  0.206 
#> 10  1.38    -2.77  -0.519 
#> # ... with 90 more rows

Created on 2019-01-20 by the reprex package (v0.2.1)

design decision for EMB/missForest etc

Interpretation of formula objects for methods that impute every variable simultaneously requires care since there is no difference between predictors and predicted. I see two options.

  1. Every variable in the formula gets imputed.
  2. We allow this:
x + y ~ z + w + q

so that (x,y,z,w,q) are used in the model, but only (x,y) are copied into the output dataset. Doing

 ~ x + y

would impute x and y but without taking z, w and q into account when modeling the imputations.

Imputation fails with long formulas

Thanks for the really nicely designed package! In trying it out, I noticed the parsing in impute_lm or impute_rf will fail if the formula is long. Here's an example where I've made the iris dataset have long variable names.

I assume this is happening because the formula is hitting the width.cutoff of deparse, though I haven't explored in detail.

library(simputation)
# Demo from the docs:
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA

# But with long variable names:
names(dat) <- c(
  "this_is_very_long_name_for_sepal_length",
  "this_is_very_long_name_for_sepal_width",
  "this_is_very_long_name_for_petal_length",
  "this_is_very_long_name_for_petal_width",
  "this_is_very_long_name_for_species"
  )

# The basic lm() call is fine:
form <- this_is_very_long_name_for_sepal_length ~ 
  this_is_very_long_name_for_sepal_width + 
  this_is_very_long_name_for_species + 
  this_is_very_long_name_for_petal_width

lm(form, data=dat)
#> 
#> Call:
#> lm(formula = form, data = dat)
#> 
#> Coefficients:
#>                                  (Intercept)  
#>                                       2.5564  
#>       this_is_very_long_name_for_sepal_width  
#>                                       0.6912  
#> this_is_very_long_name_for_speciesversicolor  
#>                                       0.9590  
#>  this_is_very_long_name_for_speciesvirginica  
#>                                       1.2029  
#>       this_is_very_long_name_for_petal_width  
#>                                       0.3816

# But this fails:
da1 <- impute_lm(dat, form)
#> Error in parse(text = x, keep.source = FALSE): <text>:2:0: unexpected end of input
#> 1: this_is_very_long_name_for_sepal_length ~ this_is_very_long_name_for_sepal_width + this_is_very_long_name_for_species + 
#>    ^

Created on 2019-12-10 by the reprex package (v0.3.0)

fault in predictive mean matching

the impute_pmm method uses the distance between donor and imputed values while it should use distance between predicted donor values and imputed values.

Example provided by Susie Jentoft by e-mail:

dat <- iris[1:15,]
dat[8,1] <- NA
impute_pmm(dat, Sepal.Length ~ Sepal.Width)

the value 5 is imputed, while 4.6 is expected.

Error: Subscript `ina` is a matrix

Hi,

I’m getting the following error, and I wonder if you could help me understand its cause:

library(dplyr)
library(simputation)

kdata <- tribble(
    ~age, ~ct, ~pfratio, ~bmi,
    56,   86,   130,   30,
    58,   NA,   110,   NA,
    78,   NA,   NA,    28,
    54,   NA,   NA,    NA,
    45,   45,   230,   28,
    54,   45,   NA,    29
)

impute_knn(
    kdata,
        bmi ~ .,
        pool = "univariate"
    )
#> Warning: Requested k = 5 while 4 donors present. Using k = 4.
#> Error: Subscript `ina` is a matrix, the data `donors[ina]` must have size 1.

Created on 2021-05-30 by the reprex package (v2.0.0)

The same happens if I had more variables to the formula’s left-hand side (e.g., bmi + ct + pfratio ~ .).

I understand the warning that appears in this reprex. However, my actual data is in the hundreds of observations, yet it does have its fair share of NAs, and occasionally there can be up to 3 NAs per row. Is the error related to NAs in predictor variables?

Thanks!

error with using weight variable

when i try the following command

imp_simpuatation<-impute_rhd(
  data,
  GI~age+sex,
  pool = "univariate",
  prob = data$SamplingWeight
)

i get the following error:

Error in impute_rhd(data, earnings ~ AG + sex, pool = "univariate", prob = data$SamplingWeight) : 
  length(prob) != nrow(dat) is not TRUE

I used a debugger function and i think there is a bug in the package which i believe is here:
image

I believe the highlighted line should be
stopifnot(length(prob) == nrow(dat))

add proper ratio-imputation

Ratio imputation is based on a weighted single regressio regression Y = bX with weights 1/X. Would be nice to have a impute_ratio function. Also lm makes a different choice in selecting data under missingess than

b <- mean(Y,na.rm=TRUE)/mean(X,na.rm=TRUE)

Use on an image?

I need to teach imputation soon and am happy to have found this package!

I've been trying to think up a very simple, compelling, and visual example to work through and someone suggested working with an image, where certain pixels are missing.

Have you ever done that (or seen a nice example somewhere)? Even better, using this package?

If not, do you have a hunch whether this will work out nicely? Is it clear to you in advance that this is either a great or terrible idea? Thanks for any wisdom.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.