ajdamico / convey Goto Github PK
View Code? Open in Web Editor NEWvariance of distribution measures estimation of survey data
License: GNU General Public License v3.0
variance of distribution measures estimation of survey data
License: GNU General Public License v3.0
it looks like the user-level functions svy*
should have the na.rm=
parameter (default to FALSE like the survey package) and then that na.rm= parameter should get passed to the lower-level functions.
you can see how the na.rm
object gets used in
survey:::svymean.survey.design
library(survey)
library(vardpoor)
data(eusilc)
des_eusilc <- svydesign( ids = ~rb030 , strata = ~db040 , weights = ~rb050 , data = eusilc )
des_eusilc <- convey_prep( des_eusilc )
svyarpr( ~ py010n , design = des_eusilc )
svyarpr( ~ py010n , design = des_eusilc , na.rm = TRUE )
svyarpr( ~ py010n , design = des_eusilc_rep )
svyarpr( ~ py010n , design = des_eusilc_rep , na.rm = TRUE )
they are currently in two separate locations:
https://github.com/DjalmaPessoa/convey/blob/master/R/all_funs.R#L342
https://github.com/DjalmaPessoa/convey/blob/master/R/all_funs.R#L436
i'm not sure why it's a standalone script? can it be used within testthat? (maybe it already is and i'm just misunderstanding)
@DjalmaPessoa could you add travis-ci to convey? it will check that the build is OK for cran and all of the testthat
scripts as well, automatically. here's a write-up of how to add it..
http://juliasilge.com/blog/Beginners-Guide-to-Travis/
and more detail here
http://r-pkgs.had.co.nz/check.html#travis
working with data frame
pnad2011<- dbGetQuery( db , 'select v4618 , v4617 , pre_wgt , v4609 , v4720, v8005 from pnad2011' )
pnad2011_des <- svydesign(
id = ~v4618 ,
strata = ~v4617 ,
data = pnad2011 ,
weights = ~pre_wgt ,
nest = TRUE
)
pop.post<- data.frame(v4609= as.character(unique(pnad2011$v4609)), Freq= unique(pnad2011$v4609))
pnad2011_des_pos <- postStratify(pnad2011_des, ~v4609, pop.post)
pnad2011_des_pos_sub <- subset(pnad2011_des_pos, v4720!=0 & is.na(v4720)==FALSE & v8005>=15 )
library(convey)
pnad2011_des_pos_sub<-convey_prep(pnad2011_des_pos_sub)
svypoormed(~as.numeric(v4720),pnad2011_des_pos_sub, na.rm=TRUE )
svygini (~as.numeric(v4720),pnad2011_des_pos_sub, na.rm=TRUE)
densfun (~as.numeric(v4720), pnad2011_des_pos_sub, 1000, fun = "F", na.rm=TRUE)
Perhaps it would be useful to include the following information:
https://github.com/DjalmaPessoa/convey/blob/master/R/all_funs.R#L35
should h_fun include a na.rm=` parameter to deal with this possibility?
des_eusilc <- survey:::svydesign(ids = ~rb030, strata =~db040, weights = ~rb050, data = eusilc)
des_eusilc_rep <- as.svrepdesign(des_eusilc, type= "bootstrap")
des_eusilc_rep <- convey_prep(des_eusilc_rep)
b1 <- svyarpr(~eqIncome, design = des_eusilc_rep, 0.5, 0.6)
confint(b1)
hi, you should be able to copy over what djalma has already done for the other functions pretty easily.. thanks!
https://github.com/DjalmaPessoa/convey/tree/master/tests/testthat
https://github.com/DjalmaPessoa/convey/blob/master/all_funs.R#L267-L289
# CORRECT method to break out by groups
dati <- data.frame(IDd = 1:nrow(eusilc), eusilc)
d1 <- linarpt(Y="eqIncome", id="IDd", weight = "rb050", Dom = "db040",
dataset = dati, percentage = 60, order_quant=50)
# INCORRECT SUBSET
ics <- subset(dati,db040=='Tyrol')
d2 <- linarpt(Y="eqIncome", id="IDd", weight = "rb050", Dom = NULL,
dataset = ics , percentage = 60, order_quant=50)
# your function matches the incorrect subsetting technique
d3 <- svyarpt(~eqIncome, subset( des_eusilc , db040 == 'Tyrol' ) , .5, .6)
# this is the correct way to use `linarpt`
table( d1$lin$lin_arpt__db040.Tyrol )
# NOTICE that the zeroes are maintained!
# this is the incorrect way to use `linarpt`
summary(d2$lin)
# your function matches the incorrect method
table(d3$lin)
this means convey functions need to "reach around" the svyby in order to get the full design object with sys.call() or something similar so that the SE_lin can be run within the main wrapper function rather than secondarily. once that's been implemented, re-integrate the correct print methods
https://github.com/DjalmaPessoa/convey/blob/master/R/all_funs.R#L46
i am not sure if it should look like line #96 in the same script?
https://github.com/DjalmaPessoa/convey/blob/master/R/all_funs.R#L94
library(vardpoor)
data(eusilc)
dati = data.frame(1:nrow(eusilc), eusilc)
colnames(dati)[1] <- "IDd"
library(survey)
# create a design object
#des_eusilc <- svydesign(ids = ~db040, weights = ~rb050, data = eusilc)
des_eusilc <- svydesign(ids = ~rb030, strata =~db040, weights = ~rb050, data = eusilc)
des_eusilc_rep <- as.svrepdesign(des_eusilc, type = "bootstrap")
library(convey)
des_eusilc <- convey_prep(des_eusilc)
des_eusilc_rep<- convey_prep(des_eusilc_rep)
svyby(~eqIncome, by = ~db040, des_eusilc, FUN=svyfgt, alpha=1, deff = FALSE)
svyby(~eqIncome, by = ~db040, des_eusilc_rep, FUN=svyfgt, alpha=1, deff = FALSE)
## the se estimates are too different, probably an error in the svyfgt.survey.design
including github issues and all methods
prevent dbi-backed svydesign & svrepdesign objects from being used in any convey functions
There are some useful information not yet contained in the output of the functions.
One possibility would to include additional attributes like for instance:
svyarpr - threshold , quantile
svyarpt - quantile, order, percent
svypoormed - arpr
svyqsr - upper quantile, lower quantile, upper total, lower total
svyrmir - median of income of people older than 65, median of people younger than 65
svyrmpg - median poor
svyfgt - type of poverty threshold, value of threshold
this is what recall for the moment.
https://github.com/DjalmaPessoa/convey/blob/master/R/all_funs.R#L225
rather than passing in an object
you will need to pass in an expression
that will then update the survey design. remember all the trouble we had with t=
expecting one length and object
being a different length? this is the root of the problem..
does it make sense to eliminate SE_lin2
and write the appropriate code directly into each of the main functions? appropriate code means keeping everything within the data set, never creating an external object
that is separated from the design
this might mean you need to use bquote
and model.matrix
more often (which is what lumley does) but i am not sure
library(vardpoor)
data(eusilc)
dati = data.frame(1:nrow(eusilc), eusilc)
colnames(dati)[1] <- "IDd"
library(survey)
library(convey)
des_eusilc <- svydesign(ids = ~db040, weights = ~rb050, data = eusilc)
set.seed(123)
nas<-sample(rownames(eusilc),100,replace=FALSE)
eusilc$eqIncome0<-eusilc$eqIncome
eusilc[nas, "eqIncome0"]<-NA
nas1<-sample(rownames(eusilc),50,replace=FALSE)
eusilc$age0<-eusilc$age
eusilc[nas1, "age0"]<-NA
des_eusilc <- svydesign(ids = ~rb030, strata =~db040, weights = ~rb050, data = eusilc)
des_eusilc_rep <- as.svrepdesign(des_eusilc, type = "bootstrap")
library(convey)
des_eusilc <- convey_prep(des_eusilc)
des_eusilc_rep<- convey_prep(des_eusilc_rep)
svyrmir.survey.design( ~eqIncome0 , design = des_eusilc ,age = ~age0, agelim=65, na.rm=TRUE)
the way that you are slimming the data.frame object inside of all of the functions down to only the variables for the current computation will not work if the variable is constructed on-the-fly.
here is a reproducible example that produces the bug. notice the as.numeric() causes the problem. this should work if we are going to align with the survey package
library(convey)
library(survey)
library(vardpoor)
data(eusilc)
# linearized design
des_eusilc <- svydesign( ids = ~rb030 , strata = ~db040 , weights = ~rb050 , data = eusilc )
des_eusilc <- convey_prep( des_eusilc )
# works
svyarpt( ~eqIncome , design = des_eusilc )
# breaks
svyarpt( ~as.numeric(eqIncome) , design = des_eusilc )
you can view the point that this breaks by typing
debug(convey:::svyarpt.survey.design)
here are the lines leading up the error
Browse[2]>
debug: inc <- terms.formula(formula)[[2]]
Browse[2]>
debug: df <- model.frame(design)
Browse[2]>
debug: incvar <- df[[as.character(inc)]]
Browse[2]>
Error in .subset2(x, i, exact = exact) : no such index at level 1
look at the code inside
survey:::svymean.survey.design
and also
survey:::svymean.survey.design2
for ideas about how to work around this issue?
@DjalmaPessoa @guilhermejacob what do you think about adding a message at library(convey)
startup that alerts users to the presence of vignettes? some discussion here, thanks
just replacing does not work:
i have attached an R file that attempts to reproduce the gini coefficient with the survey of consumer finances but fails.. any idea?
showing simple usage of main convey
functions
https://github.com/ajdamico/asdfree/tree/master/Panel%20Study%20of%20Income%20Dynamics
library(devtools)
install_github( "djalmapessoa/convey" )
library(convey)
library(vardpoor)
data(eusilc)
library(survey)
des_eusilc <- svydesign(ids=~db040, weights=~rb050, data=eusilc)
des_eusilc <- convey_prep( des_eusilc )
gini_eqIncome <- svygini(~eqIncome, design=des_eusilc)
svygini( ~ eqIncome , des_eusilc )
svyby( ~ eqIncome , ~ rb090 , des_eusilc , svygini )
svygini( ~ eqIncome , subset( des_eusilc , rb090 == 'male' ) )
svygini( ~ eqIncome , subset( des_eusilc , rb090 == 'female' ) )
see the examples at
https://github.com/DjalmaPessoa/convey/blob/master/R/svyrmir.R#L42-L52
could you update these with a variable that makes sense? thanks! each of these break with
> svyrmir( ~eqIncome , design = des_eusilc_rep )
Error in terms.formula(age) : argument "age" is missing, with no default
>
> # linearized design using a variable with missings
> svyrmir( ~ py010n , design = des_eusilc )
Error in terms.formula(age) : argument "age" is missing, with no default
> svyrmir( ~ py010n , design = des_eusilc , na.rm = TRUE )
Error in terms.formula(age) : argument "age" is missing, with no default
> # replicate-weighted design using a variable with missings
> svyrmir( ~ py010n , design = des_eusilc_rep )
Error in terms.formula(age) : argument "age" is missing, with no default
> svyrmir( ~ py010n , design = des_eusilc_rep , na.rm = TRUE )
Error in terms.formula(age) : argument "age" is missing, with no default
>
https://github.com/DjalmaPessoa/convey/blob/master/.Rhistory
i am not sure why it's there? if it has a purpose, that's fine :)
i have changed it in future commits, so now lin
gets passed in rather than lin_qsr
since i think it was wrong when comp=TRUE
?
The functions svygpg has the sex and svyrmir has the age as extra
arguments. Perhaps the NA treatment has to be extended also to these
variables.
For svyrmir:
svyrmir( ~eqIncome , subset(des_eusilc,db040=="Tyrol") , age = ~age,
agelim=65, na.rm=TRUE)
works, but for
svyby(~eqIncome,by=~db040, design=des_eusilc, FUN=svyrmir, age = ~age,
agelim=65, na.rm=TRUE, deff=FALSE )
I got:
Error in svyrmir.survey.design(data, design[byfactor %in% byfactor[i], :
unused argument (deff = deff). Any suggestion?
It will be needed to revise all tests. Most of them used the function SE_lin2 which was deleted from convey
library(convey)
options( monetdb.debug.query = TRUE )
setwd( "C:\Djalma\PnadMonetdb\PNAD" )
pnad.dbfolder <- paste0( getwd() , "/MonetDB" )
db <- dbConnect( MonetDBLite() , pnad.dbfolder )
dbListTables(db)
options(survey.lonely.psu = "adjust")
source_url( "https://raw.githubusercontent.com/ajdamico/asdfree/master/Pesquisa%20Nacional%20por%20Amostra%20de%20Domicilios/pnad.survey.R" , prompt = FALSE )
sample.pnad <-
svydesign(
id = ~v4618 ,
strata = ~v4617 ,
data = 'pnad2011' ,
weights = ~pre_wgt ,
nest = TRUE ,
dbtype = "MonetDBLite" ,
dbname = pnad.dbfolder
)
y <-
pnad.postStratify(
design = sample.pnad ,
strata.col = 'v4609' ,
oldwgt = 'pre_wgt'
)
y.sub <- subset (y, !is.na(v4720) & v4720!=0 & v8005>=15)
dim(y)
dim(y.sub)
Now check:
dim.des <-function (formula, design, na.rm = FALSE, ...)
{
incvar <- model.frame(formula, design$variables, na.action = na.pass)[[1]]
w <- 1/design$prob
ncom <- names(w)
if (na.rm) {
nas <- is.na(incvar)
design <- design[!nas, ]
incvar <- incvar[!nas]
w <- w[!nas]
}
list(dim(design), length(incvar))
}
dim.des(~ v4720,y.sub, na.rm=TRUE)
for each of the main svy*() functions you have written,
a1 <- svy*( ~variable , design )
run on linearized design
a2 <- svyby( ~ variable , ~ byvar , design , svy*() )
run on linearized design
b1 <- svy*( ~variable , design )
run on svyrep design
b2 <- svyby( ~ variable , ~ byvar , design , svy*() )
run on svyrep design
and then could you also write testing code that checks and confirms that
coef
and SE
and confint
all work on the outputted objects a1
, a2
, b1
, and b2
?
so for each function, 4 examples x 3 helper functions
could you clean it up so it's more similar to the others? i am not sure if it is like that for a reason?
could you also add a replicate-weighted example?
If the �goal is only to show the NA treatment, we could create a new variable inserting NA values in the variable eqincome:
indNA<- rbinom(nrow(eusilc),1, .10)
eusilc$eqincome.miss <- eusilc$iqincome
eusilc$eqincome.miss[indNA==1] <- NA
and use the variable eqincome.miss instead of py010n in the svyrmir example?
showing simple usage of main convey
functions
https://github.com/ajdamico/asdfree/tree/master/Censo%20Demografico
after you write the first draft, please change the assignee to me and i will clean up
example using pnad2011
library(RSQLite)
library(downloader)
library(survey)
library(convey)
setwd('C:\\Djalma\\PNAD2012')
pnad.dbname <- "pnad.db"
db<- dbConnect(SQLite(),pnad.dbname )
dbListTables(db)
dbListFields(db, "pnad2011")
options( survey.lonely.psu = "adjust" )
source_url( "https://raw.github.com/ajdamico/usgsd/master/Pesquisa Nacional por Amostra de Domicilios/pnad.survey.R" , prompt = FALSE )
create design
sample.pnad <-
svydesign(
id = ~v4618 ,
strata = ~v4617 ,
data = "pnad2011" ,
weights = ~pre_wgt ,
nest = TRUE ,
dbtype = "SQLite" ,
dbname = "pnad.db"
)
y <-
pnad.postStratify(
design = sample.pnad ,
strata.col = 'v4609' ,
oldwgt = 'pre_wgt'
)
filter used by IBGE to define Gini index
ysub<- subset(y, v4720!=0 & is.na(v4720)==FALSE & v8005>=15)
ysub<-convey_prep(ysub)
estimate the mean of v4720
svymean(~as.numeric(v4720), ysub, na.rm=TRUE )
svyquantile(~as.numeric(v4720), ysub, .5, na.rm=TRUE )
estimate the arpt:
svyarpt(~as.numeric(v4720), ysub, na.rm=TRUE )
svygini(~as.numeric(v4720), ysub, na.rm=TRUE)
svyarpt is breaking. After some debuging, it first breaks at htot <- h_fun(incvec, wf)
invec and wf have different length. I don't see why?
library(vardpoor)
data(eusilc)
dati = data.frame(1:nrow(eusilc), eusilc)
colnames(dati)[1] <- "IDd"
library(survey)
# create a design object
des_eusilc <- svydesign(ids = ~rb030, strata =~db040, weights = ~rb050, data = eusilc)
des_eusilc_rep <- as.svrepdesign(des_eusilc, type = "bootstrap")
library(convey)
des_eusilc <- convey_prep(des_eusilc)
des_eusilc_rep<- convey_prep(des_eusilc_rep)
# svyfgt for a domain:
svyfgt(~eqIncome, subset(des_eusilc, db040=="Tyrol"), alpha=0)
for example, svyarpr.svyrep.design contains a ComputeArpr
function
https://github.com/DjalmaPessoa/convey/blob/master/R/svyarpr.R#L119
where it only gets used once within the function. i'm confused why these aren't placed outside of the svyrep.design functions? and - since they're only used once - i'm not sure why they need to be functions? (you may have a good reason) :)
still get
Error in UseMethod("SE_lin2", design) :
no applicable method for 'SE_lin2' applied to an object of class
"c('survey.design2', 'survey.design')"
in svyarpt.
library(RSQLite)
library(downloader)
library(survey)
library(convey)
setwd('C:\\Djalma\\PNAD2012')
pnad.dbname <- "pnad.db"
db<- dbConnect(SQLite(),pnad.dbname )
dbListTables(db)
dbListFields(db, "pnad2011")
options( survey.lonely.psu = "adjust" )
source_url( "https://raw.github.com/ajdamico/usgsd/master/Pesquisa Nacional por Amostra de Domicilios/pnad.survey.R" , prompt = FALSE )
create design:
sample.pnad <-
svydesign(
id = ~v4618 ,
strata = ~v4617 ,
data = "pnad2011" ,
weights = ~pre_wgt ,
nest = TRUE ,
dbtype = "SQLite" ,
dbname = "pnad.db"
)
post-stratify:
y <-
pnad.postStratify(
design = sample.pnad ,
strata.col = 'v4609' ,
oldwgt = 'pre_wgt'
)
filter used by IBGE to define Gini index:
ysub<- subset(y, v4720!=0 & is.na(v4720)==FALSE & v8005>=15)
ysub<-convey_prep(ysub)
estimate the mean of v4720:
svymean(~as.numeric(v4720), ysub, na.rm=TRUE )
estimate the arpt:
svyarpt(~as.numeric(v4720), ysub, na.rm=TRUE )
got the error:
Error in .subset2(x, i, exact = exact) : no such index at level 1
any idea?
you previously had lines like
#' arpt_eqIncome <-svyarpt(~eqIncome, design=des_eusilc, .5, .6,comp=TRUE)
but the 0.5 and 0.6 and comp=TRUE are just the default for the function, so they are unnecessary. do you want to add examples to each svy*() function that show what the non-default parameters do? for example, what is the difference between
svyarpt( ~eqIncome , des_eusilc )
and
svyarpt( ~eqIncome , des_eusilc , order = 0.5 )
and
svyarpt( ~eqIncome , des_eusilc , comp = TRUE )
some notes about how each parameter changes each function would make it easier for users.. if you do not add notes, then the svy*() examples should only use the defaults, i think?
using the example code from ?svyrmir
library(survey)
library(vardpoor)
data(eusilc) ; names( eusilc ) <- tolower( names( eusilc ) )
# linearized design
des_eusilc <- svydesign( ids = ~rb030 , strata = ~db040 , weights = ~rb050 , data = eusilc )
svyrmir( ~eqincome , design = des_eusilc , age = ~age , agelim = 65 , med_old = TRUE )
# replicate-weighted design
des_eusilc_rep <- survey:::as.svrepdesign( des_eusilc , type = "bootstrap" )
svyrmir( ~eqincome , design = des_eusilc_rep, age= ~age, agelim = 65, med_old = TRUE )
# linearized design using a variable with missings
svyrmir( ~ py010n , design = des_eusilc,age= ~age, agelim = 65)
svyrmir( ~ py010n , design = des_eusilc , age= ~age, agelim = 65, na.rm = TRUE )
# replicate-weighted design using a variable with missings
svyrmir( ~ py010n , design = des_eusilc_rep,age= ~age, agelim = 65 )
svyrmir( ~ py010n , design = des_eusilc_rep ,age= ~age, agelim = 65, na.rm = TRUE )
are these zeroes expected?
svyrmir( ~ py010n , design = des_eusilc , age= ~age, agelim = 65, na.rm = TRUE )
rmir SE
py010n 0 0.0013svyrmir( ~ py010n , design = des_eusilc_rep ,age= ~age, agelim = 65, na.rm = TRUE )
rmir SE
py010n 0 0
i believe this is necessary for the pnad?
the survey::svyCprod
function uses it, but i'm not sure if you carried it over?
https://github.com/DjalmaPessoa/convey/blob/2778c24fa906701dc33e85c94cc1bdb58c0acb26/R/svyarpt.R#L65
i do not believe you need to deal with post-stratification for replicate-weighted designs, but i am not 100% sure?
on the ?help pages, it looks like the examples only show a usage case for the linearized designs?
does it make sense to add an example that shows the functions work properly on replicate-weighted survey design objects as well?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.