markvanderloo / simputation Goto Github PK
View Code? Open in Web Editor NEWMaking imputation easy
License: GNU General Public License v3.0
Making imputation easy
License: GNU General Public License v3.0
It would be great to be able to build imputation models on data set X1 and apply them to other data sets (X1, X2, ...). From what I can tell, you can only impute the same data that you start with.
In package specification there is defined that for impute_rhd when pool value is "complete" and "a record has multiple missings, all imputations are taken from a single donor", while it does not apply.
Example:
test <- data.frame(seq(1,15), seq(1,15))
colnames(test) <- c("num1", "num2")
test$num1[2:6] <- NA
test$num2[4:8] <- NA
set.seed(1000)
head(test %>% impute_rhd(num1 + num2 ~ 1, pool = "complete"), 10)
From: https://blog.rstudio.com/2016/03/24/tibble-1-0-0/
Interacting with legacy code
A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame.
With this correction, impute_shd
and impute_rhd
functions are working as expected.
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(simputation)
dat <- as_tibble(iris)
# empty a few fields
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA
dat %>% impute_shd(Sepal.Length ~ Sepal.Width + Species) %>% head(3)
#> Error: Can't use matrix or array for column indexing
dat %>% impute_rhd(Sepal.Length ~ Sepal.Width + Species) %>% head(3)
#> Error: Can't use matrix or array for column indexing
@markvanderloo when i use this command to install the package from github
library(devtools)
install_github("markvanderloo/simputation")
i get the following error:
Error: Failed to install 'unknown package' from GitHub:
HTTP error 404.
No commit found for the ref MultipleTimeSeries
Did you spell the repo owner (`markvanderloo`) and repo name (`simputation`) correctly?
- If spelling is correct, check that you have the required permissions to access the repo.
what is alternative to use the updated version with all the fixes. The library(simputation) still doesnt include the current fixes so i thought to install it directly from github.
Regarding our brief conversation after your tutorial: https://mlr.mlr-org.com/reference/impute.html
Everything is in one huge doc now. Should split things up as in the list in the README file.
This generalizes the (grouped) median approach. Needs dependency on the quantreg package.
yep. should add it.
The ...
argument is not passed to missForest
.
There are several excellent packages offering imputation methodology, for example VIM which scales well and offers very detailed control over imputation methods.
It would be good to make the impute_xxx
functions flexible so one can choose an imputation backend and use the ellipsis to pass extra arguments to the backend. Possible options are at least:
Here's a reprex that highlights the issue:
library(tibble)
library(magrittr)
library(simputation)
tibble::tribble(
~`a col`, ~`b col`,
1, NA,
3, 3,
5, 5,
7, 7
) %>%
as.data.frame() %>%
impute_knn(`b col` ~ `a col`, k = 3)
#> Error in `[.data.frame`(dat, imp_vars): undefined columns selected
tibble::tribble(
~a.col, ~b.col,
1, NA,
3, 3,
5, 5,
7, 7
) %>%
as.data.frame() %>%
impute_knn(b.col ~ a.col, k = 3)
#> a.col b.col
#> 1 1 3
#> 2 3 3
#> 3 5 5
#> 4 7 7
I think the issue is caused by the extraction of formula components in get_imputed
and get_predictors
. I see 'get_imputed(dat, `b col` ~ `a col`)' returns "`b col`" rather than "b col" which is required to index the column.
impute_lm(y~x|z, where x > 0)
would first select records where x>0
, fit the model, impute, and return the whole dataset.
Hello
I have a problem with the random hot deck in the package simputation
I want to impute several variables that are linked (the sum equals 100) with the random hot deck on three criterias (with the option "complete")
capit_imput=impute_rhd(capit_mod,CAPITAL_W+CAPITAL_PERSMOR+CAPITAL_HORSW_FAM+CAPITAL_HORW_HORSFAM~ SIEGE_DEP+CDEX_COEF2017+OTE64_COEF2017,pool="complete")
But for some units, it seems that only one value is imputed
The first unit missing (with ind=TRUE) has to be imputed among two potential donors (SIEGE_DEP=10 CDEX_COEF2017=11, OTE64_COEF2017=1600)
It seems that the second unit is the donor with CAPITAL_W=10 but CAPITAL_PERSMOR is different of 90
Can you explain why ?
And when I set the model
capit_imput=impute_rhd(capit_mod,CAPITAL_PERSMOR+CAPITAL_W+CAPITAL_HORSW_FAM+CAPITAL_HORW_HORSFAM~ SIEGE_DEP+CDEX_COEF2017+OTE64_COEF2017,pool="complete")
The line is well imputated as if the order of the variables was worth with the complete option
Thanks for your help
I am trying to use impute_rhd with distinct donor sets per missingness patters - but am stuck on this error message that I don't understand. I hope the reprex below helps, and would appreciate any advice how to work around this.
library(simputation)
iris_na <- mice::ampute(iris)
iris_na$amp |> impute_rhd(Sepal.Length + Sepal.Width ~ 1, pool = "multivariate")
#> Error in sample.int(length(x), size, replace, prob): incorrect number of probabilities
x <- iris_na$amp |> impute_rhd(Sepal.Length + Sepal.Width ~ 1)
Created on 2024-02-07 with reprex v2.0.2
Hi,
I wonder if this package can handle !!as.name-like way to provide variable name by providing the variable with the string.
Ex.
IMPUTED_VARIABLES ~ MODEL_SPECIFICATION [ | GROUPING_VARIABLES ]
instead of directly specifying the column name:
abundance ~ mean(abundance) | group
I'd like to do something like:
abundance_column_name <- "abundance"
!!as.name(abundance_column_name) ~ mean(!!as.name(abundance_column_name)) | group
will it be possible with the current implementation?
best,
hee jong kim
'nuff said
When I try install the package the console show that package isn't available to the R's version 3.6.3
install.packages("simputation", dependencies=TRUE)
Installing package into ‘/home/username/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘simputation’ is not available (for R version 3.6.3)
The readme links to a blogpost and slides, but the links are currently dead.
The links in question:
http://www.markvanderloo.eu/yaRb/2016/09/13/announcing-the-simputation-package-make-imputation-simple/
http://www.markvanderloo.eu/files/statistics/user2017markvanderloo.pdf
by predicting with predict.all=TRUE
we get the matrix of predictions. Row-wise aggergation can be customized (but it may cost a lot of memory as the matrix of predictions equals nrow(newdata) x ntree)
It looks like the missForest package isn't tibble friendly. If you want to pass a tibble into the simputation function impute_mf, the following error pops up.
library(tidyverse)
library(simputation)
library(missForest)
#> Loading required package: randomForest
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
#>
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:simputation':
#>
#> na.roughfix
#> The following object is masked from 'package:dplyr':
#>
#> combine
#> The following object is masked from 'package:ggplot2':
#>
#> margin
#> Loading required package: foreach
#>
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#>
#> accumulate, when
#> Loading required package: itertools
#> Loading required package: iterators
df=tibble::tibble(y=rnorm(100),x1=rnorm(100),x2=rnorm(100))
for(i in seq_along(names(df))) df[sample(1:nrow(df),10),i]=NA
df
#> # A tibble: 100 x 3
#> y x1 x2
#> <dbl> <dbl> <dbl>
#> 1 0.252 0.418 1.08
#> 2 0.665 0.565 -1.73
#> 3 0.592 -0.264 -0.356
#> 4 NA -1.36 0.884
#> 5 0.0819 -1.42 0.481
#> 6 0.407 NA 0.732
#> 7 -0.750 0.862 0.409
#> 8 1.21 -1.33 -0.0501
#> 9 NA 0.644 0.206
#> 10 1.38 -2.77 -0.519
#> # ... with 90 more rows
# This will throw an error because df is a tibble and missForest
# expects a data frame as an input for its xmis argument
df %>% impute_mf(formula=y~.)
#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA
#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA
#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA
#> missForest iteration 1 in progress...
#> Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
#> mtry, : The response has five or fewer unique values. Are you sure you want
#> to do regression?
#> Warning: Could not execute missForest::missForest: length of response must be the same as predictors
#> Returning original data
#> # A tibble: 100 x 3
#> y x1 x2
#> <dbl> <dbl> <dbl>
#> 1 0.252 0.418 1.08
#> 2 0.665 0.565 -1.73
#> 3 0.592 -0.264 -0.356
#> 4 NA -1.36 0.884
#> 5 0.0819 -1.42 0.481
#> 6 0.407 NA 0.732
#> 7 -0.750 0.862 0.409
#> 8 1.21 -1.33 -0.0501
#> 9 NA 0.644 0.206
#> 10 1.38 -2.77 -0.519
#> # ... with 90 more rows
# This runs okay because the tibble is converted to a data.frame
# before it is passed to the missForest function.
df_okay %>% as.data.frame() %>% impute_mf(y~.)
#> Error in eval(lhs, parent, parent): object 'df_okay' not found
head(df_okay)
#> Error in head(df_okay): object 'df_okay' not found
# This adjustment to the impute_mf function makes the initial code
# run without throwing an error.
impute_mf_bcj<-function (dat, formula, ...)
{
stopifnot(inherits(formula, "formula"))
if (simputation:::not_installed("missForest"))
return(dat)
imputed <- simputation:::get_imputed(formula, dat)
predictors <- simputation:::get_predictors(formula, dat, ...)
vars <- unique(c(imputed, predictors))
imp <- tryCatch(missForest::missForest(as.data.frame(dat[vars]), ...)[[1]],
error = function(e) {
warnf("Could not execute missForest::missForest: %s\n Returning original data",
e$message)
dat
})
if (length(imputed) == 0) {
dat[vars] <- imp[vars]
}
else {
dat[imputed] <- imp[imputed]
}
dat
}
df %>% impute_mf_bcj(formula=y~.)
#> missForest iteration 1 in progress...done!
#> missForest iteration 2 in progress...done!
#> missForest iteration 3 in progress...done!
#> missForest iteration 4 in progress...done!
#> # A tibble: 100 x 3
#> y x1 x2
#> <dbl> <dbl> <dbl>
#> 1 0.252 0.418 1.08
#> 2 0.665 0.565 -1.73
#> 3 0.592 -0.264 -0.356
#> 4 0.775 -1.36 0.884
#> 5 0.0819 -1.42 0.481
#> 6 0.407 NA 0.732
#> 7 -0.750 0.862 0.409
#> 8 1.21 -1.33 -0.0501
#> 9 1.07 0.644 0.206
#> 10 1.38 -2.77 -0.519
#> # ... with 90 more rows
Created on 2019-01-20 by the reprex package (v0.2.1)
Interpretation of formula objects for methods that impute every variable simultaneously requires care since there is no difference between predictors and predicted. I see two options.
x + y ~ z + w + q
so that (x,y,z,w,q) are used in the model, but only (x,y) are copied into the output dataset. Doing
~ x + y
would impute x and y but without taking z, w and q into account when modeling the imputations.
So information from e.g. dplyr::group_by
is lost on imputation.
See #17
either parametric e ~ N(mu, sigma) or nonparametric e ~ residuals(m)
'nuff said
Thanks for the really nicely designed package! In trying it out, I noticed the parsing in impute_lm
or impute_rf
will fail if the formula is long. Here's an example where I've made the iris
dataset have long variable names.
I assume this is happening because the formula is hitting the width.cutoff
of deparse
, though I haven't explored in detail.
library(simputation)
# Demo from the docs:
dat <- iris
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA
# But with long variable names:
names(dat) <- c(
"this_is_very_long_name_for_sepal_length",
"this_is_very_long_name_for_sepal_width",
"this_is_very_long_name_for_petal_length",
"this_is_very_long_name_for_petal_width",
"this_is_very_long_name_for_species"
)
# The basic lm() call is fine:
form <- this_is_very_long_name_for_sepal_length ~
this_is_very_long_name_for_sepal_width +
this_is_very_long_name_for_species +
this_is_very_long_name_for_petal_width
lm(form, data=dat)
#>
#> Call:
#> lm(formula = form, data = dat)
#>
#> Coefficients:
#> (Intercept)
#> 2.5564
#> this_is_very_long_name_for_sepal_width
#> 0.6912
#> this_is_very_long_name_for_speciesversicolor
#> 0.9590
#> this_is_very_long_name_for_speciesvirginica
#> 1.2029
#> this_is_very_long_name_for_petal_width
#> 0.3816
# But this fails:
da1 <- impute_lm(dat, form)
#> Error in parse(text = x, keep.source = FALSE): <text>:2:0: unexpected end of input
#> 1: this_is_very_long_name_for_sepal_length ~ this_is_very_long_name_for_sepal_width + this_is_very_long_name_for_species +
#> ^
Created on 2019-12-10 by the reprex package (v0.3.0)
the impute_pmm method uses the distance between donor and imputed values while it should use distance between predicted donor values and imputed values.
Example provided by Susie Jentoft by e-mail:
dat <- iris[1:15,]
dat[8,1] <- NA
impute_pmm(dat, Sepal.Length ~ Sepal.Width)
the value 5 is imputed, while 4.6 is expected.
The compatibility with dplyr groups broke due to a change in the implementation of dplyr.
Hi,
I’m getting the following error, and I wonder if you could help me understand its cause:
library(dplyr)
library(simputation)
kdata <- tribble(
~age, ~ct, ~pfratio, ~bmi,
56, 86, 130, 30,
58, NA, 110, NA,
78, NA, NA, 28,
54, NA, NA, NA,
45, 45, 230, 28,
54, 45, NA, 29
)
impute_knn(
kdata,
bmi ~ .,
pool = "univariate"
)
#> Warning: Requested k = 5 while 4 donors present. Using k = 4.
#> Error: Subscript `ina` is a matrix, the data `donors[ina]` must have size 1.
Created on 2021-05-30 by the reprex package (v2.0.0)
The same happens if I had more variables to the formula’s left-hand side (e.g., bmi + ct + pfratio ~ .
).
I understand the warning that appears in this reprex. However, my actual data is in the hundreds of observations, yet it does have its fair share of NA
s, and occasionally there can be up to 3 NA
s per row. Is the error related to NAs in predictor variables?
Thanks!
when i try the following command
imp_simpuatation<-impute_rhd(
data,
GI~age+sex,
pool = "univariate",
prob = data$SamplingWeight
)
i get the following error:
Error in impute_rhd(data, earnings ~ AG + sex, pool = "univariate", prob = data$SamplingWeight) :
length(prob) != nrow(dat) is not TRUE
I used a debugger function and i think there is a bug in the package which i believe is here:
I believe the highlighted line should be
stopifnot(length(prob) == nrow(dat))
Ratio imputation is based on a weighted single regressio regression Y = bX with weights 1/X. Would be nice to have a impute_ratio
function. Also lm
makes a different choice in selecting data under missingess than
b <- mean(Y,na.rm=TRUE)/mean(X,na.rm=TRUE)
I need to teach imputation soon and am happy to have found this package!
I've been trying to think up a very simple, compelling, and visual example to work through and someone suggested working with an image, where certain pixels are missing.
Have you ever done that (or seen a nice example somewhere)? Even better, using this package?
If not, do you have a hunch whether this will work out nicely? Is it clear to you in advance that this is either a great or terrible idea? Thanks for any wisdom.
d <- data.frame(x=c(NA,'a','b'),stringsAsFactors = FALSE)
impute_proxy(d,x ~ "w")
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.