dlinzer / polca Goto Github PK

Polytomous Variable Latent Class Analysis (R package)

Home Page: https://dlinzer.github.io/poLCA/

R 78.81% C 21.19%

polca's Introduction

poLCA

Polytomous Variable Latent Class Analysis

poLCA is a software package for the estimation of latent class models and latent class regression models for polytomous outcome variables, implemented in the R statistical computing environment.

Latent class analysis (also known as latent structure analysis) can be used to identify clusters of similar "types" of individuals or observations from multivariate categorical data, estimating the characteristics of these latent groups, and returning the probability that each observation belongs to each group. These models are also helpful in investigating sources of confounding and nonindependence among a set of categorical variables, as well as for density estimation in cross-classification tables. Typical applications include the analysis of opinion surveys; rater agreement; lifestyle and consumer choice; and other social and behavioral phenomena.

The basic latent class model is a finite mixture model in which the component distributions are assumed to be multi-way cross-classification tables with all variables mutually independent. The model stratifies the observed data by a theoretical latent categorical variable, attempting to eliminate any spurious relationships between the observed variables. The latent class regression model makes it possible for the researcher to further estimate the effects of covariates (or "concomitant" variables) on predicting latent class membership.

poLCA uses expectation-maximization and Newton-Raphson algorithms to find maximum likelihood estimates of the parameters of the latent class and latent class regression models.

Package authors

Drew A. Linzer

Jeffrey Lewis

Installation

To install the package directly through R, type

install.packages("poLCA", dependencies = TRUE)

and select a CRAN mirror. Once the installation is complete, enter

library(poLCA)

to load the package into memory for use.

poLCA is distributed through the Comprehensive R Archive Network, CRAN. The compiled package source and MacOS and Windows binary files can be downloaded from https://cran.r-project.org/web/packages/poLCA.

The poLCA package appears in CRAN Task Views for Cluster Analysis & Finite Mixture Models, and Psychometric Models and Methods. poLCA is provided free of charge, subject to version 2 of the GPL or any later version.

Documentation

Download user's manual (PDF). The package is also documented internally upon installation. For help in R, type

?poLCA

Citation

Users of poLCA are requested to cite the software package as:

Linzer, Drew A. and Jeffrey Lewis. 2022. "poLCA: Polytomous Variable Latent Class Analysis." R package version 1.6. https://dlinzer.github.com/poLCA.

and

Linzer, Drew A. and Jeffrey Lewis. 2011. "poLCA: an R Package for Polytomous Variable Latent Class Analysis." Journal of Statistical Software. 42(10): 1-29. https://www.jstatsoft.org/v42/i10

Contact

Please direct all inquiries, comments, and reports of bugs to [email protected].

polca's People

Contributors

Stargazers

Watchers

Forkers

andrew-christianson clbustos jeffreyblewis stonegold546 winedix daob mikebarkmin gootjes mzheng3 kentaro-kamada matt5mitchell petrushev qmul oxford-pharmacoepi willgertsch ysylyd

polca's Issues

add caic

add another information criterion the caic

CAIC = -2*log(L) + p(log(n) + 1)

Source: Bozdogan, H. Psychometrika (1987) 52: 345. https://doi.org/10.1007/BF02294361

Add local independence tools

Is possible add any statistics for check the local independence assumption? The most used are bivariate residuals, modification indices or expected parameter change.

Covariance matrix for basic model parameters?

Hi,

Is it possible to extract the covariance matrices VAR(�\pi) and VAR(p), instead of just their diagonal entries in probs.se please? Thanks!

about multiple group LCA

I want to make multi-group LCA with poLCA. however, I am new to the syntax and cannot figure out how to do it. based on a paper by your lanza 2013 (https://www.ncbi.nlm.nih.gov/pubmed/21318625) that is enclosed,

In this paper, the authors have proposed two methods to investigate the effect of treatment on outcome in subgroups identified by LCA. 1) classify-analyze approach and 2) model-based approach. I want to implement the two approach with R and created following code:
set.seed(8)
probs <- list(matrix(c(0.6,0.2,0.2, 0.6,0.3,0.1, 0.3,0.1,0.6 ),ncol=3,byrow=TRUE), # Y1
matrix(c(0.2,0.8, 0.7,0.3, 0.3,0.7 ),ncol=2,byrow=TRUE), # Y2
matrix(c(0.3,0.6,0.1, 0.1,0.3,0.6, 0.3,0.6,0.1 ),ncol=3,byrow=TRUE), # Y3
matrix(c(0.1,0.1,0.5,0.3, 0.5,0.3,0.1,0.1, 0.3,0.1,0.1,0.5),ncol=4,byrow=TRUE), # Y4
matrix(c(0.1,0.2,0.7, 0.1,0.8,0.1, 0.8,0.1,0.1 ),ncol=3,byrow=TRUE)) # Y5
simdat <- poLCA.simdata(N=1000,probs,P=c(0.2,0.3,0.5))
trt<-as.factor(sample(c("trt","ctrl"),replace=T,size=1000))
z <- 1 - as.numeric(trt)-2simdat$trueclass+0.5as.numeric(trt)simdat$trueclass
pr <- 1/(1+exp(-z))
outcome <- rbinom(1000,1,pr)
dat<-data.frame(simdat$dat,trt=trt,outcome=outcome)
#classify-analyze approach
f1 <- cbind(Y1,Y2,Y3,Y4,Y5)1
lc1 <- poLCA(f1,simdat$dat,nclass=3,nrep=5)
mod<-glm(outcometrt*as.factor(lc1$predclass),
family="binomial")
summary(mod)
##model based approach
f2<-cbind(Y1,Y2,Y3,Y4,Y5)~outcome
dat.trt<-dat[dat$trt=="trt",]
dat.ctrl<-dat[dat$trt=="ctrl",]
lc2.trt<-poLCA(f2,dat.trt,nclass=3,nrep=5)
lc2.ctrl<-poLCA(f2,dat.ctrl,nclass=3,nrep=5)
table(lc2.trt$predclass,dat.trt$outcome)
prop<-rbind(ctrl=prop.table(table(lc2.ctrl$predclass,dat.ctrl$outcome),1)[4:6],
trt=prop.table(table(lc2.trt$predclass,dat.trt$outcome),1)[4:6])
colnames(prop)<-c('class 1',"class 2","class 3")
barplot(prop,beside =T,
legend.text=c('ctrl',"trt"))

however, it appears that the model-based approach is wrong. How can I implement the model-based approach with poLCA?

add sample-size adjusted bic

add another information criterion the sample-size adjusted bic

n*=(n+2)/24
SABIC = -2*log(L)+p*log(n*)

Source: Sclove SL. Application of model-selection criteria to some problems in multivariate analysis. Psychometrika. 1987;52(3):333–343.

`poLCA.entropy()` crashes when observable variables have different number of categories

When a LCA model is fit to variables with different number of categories, poLCA.entropy() crashes when trying to create a data frame of response probabilities:

library(poLCA)
#> Loading required package: scatterplot3d
#> Loading required package: MASS

set.seed(786154)

toy_data <- data.frame(
  i1 = sample.int(4, size = 100, replace = TRUE),
  i2 = sample.int(3, size = 100, replace = TRUE)
)

lca_fit <- poLCA(
  cbind(i1, i2) ~ 1,
  data = toy_data
)
#> Conditional item response (column) probabilities,
#>  by outcome variable, for each class (row) 
#>  
#> $i1
#>            Pr(1)  Pr(2)  Pr(3)  Pr(4)
#> class 1:  0.2458 0.0649 0.3840 0.3053
#> class 2:  0.2891 0.4139 0.0191 0.2779
#> 
#> $i2
#>            Pr(1)  Pr(2)  Pr(3)
#> class 1:  0.3984 0.1454 0.4562
#> class 2:  0.3119 0.4934 0.1947
#> 
#> Estimated class population shares 
#>  0.4409 0.5591 
#>  
#> Predicted class memberships (by modal posterior prob.) 
#>  0.44 0.56 
#>  
#> ========================================================= 
#> Fit for 2 latent classes: 
#> ========================================================= 
#> number of observations: 100 
#> number of estimated parameters: 11 
#> residual degrees of freedom: 0 
#> maximum log-likelihood: -244.5202 
#>  
#> AIC(2): 511.0404
#> BIC(2): 539.6973
#> G^2(2): 0.7217094 (Likelihood ratio/deviance statistic) 
#> X^2(2): 0.7257384 (Chi-square goodness of fit) 
#> 

poLCA.entropy(lca_fit)
#> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 4, 3

^{Created on 2021-03-22 by the reprex package (v1.0.0)}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  Spanish_Spain.1252          
#>  ctype    Spanish_Spain.1252          
#>  tz       Europe/Paris                
#>  date     2021-03-22                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package       * version date       lib source        
#>  assertthat      0.2.1   2019-03-21 [1] CRAN (R 4.0.3)
#>  cli             2.3.1   2021-02-23 [1] CRAN (R 4.0.4)
#>  digest          0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
#>  evaluate        0.14    2019-05-28 [1] CRAN (R 4.0.3)
#>  fs              1.5.0   2020-07-31 [1] CRAN (R 4.0.3)
#>  glue            1.4.2   2020-08-27 [1] CRAN (R 4.0.3)
#>  highr           0.8     2019-03-20 [1] CRAN (R 4.0.3)
#>  htmltools       0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#>  knitr           1.31    2021-01-27 [1] CRAN (R 4.0.3)
#>  magrittr        2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
#>  MASS          * 7.3-53  2020-09-09 [2] CRAN (R 4.0.4)
#>  poLCA         * 1.4.1   2014-01-10 [1] CRAN (R 4.0.4)
#>  ps              1.6.0   2021-02-28 [1] CRAN (R 4.0.4)
#>  reprex          1.0.0   2021-01-27 [1] CRAN (R 4.0.3)
#>  rlang           0.4.10  2020-12-30 [1] CRAN (R 4.0.3)
#>  rmarkdown       2.7     2021-02-19 [1] CRAN (R 4.0.4)
#>  rstudioapi      0.13    2020-11-12 [1] CRAN (R 4.0.3)
#>  scatterplot3d * 0.3-41  2018-03-14 [1] CRAN (R 4.0.3)
#>  sessioninfo     1.1.1   2018-11-05 [1] CRAN (R 4.0.3)
#>  stringi         1.5.3   2020-09-09 [1] CRAN (R 4.0.3)
#>  stringr         1.4.0   2019-02-10 [1] CRAN (R 4.0.3)
#>  withr           2.4.1   2021-01-26 [1] CRAN (R 4.0.3)
#>  xfun            0.21    2021-02-10 [1] CRAN (R 4.0.3)
#>  yaml            2.2.1   2020-02-01 [1] CRAN (R 4.0.3)
#> 
#> [1] C:/Users/Mori.P16/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.4/library

add p-values for Chisq and Gsq

Calculate the p-values for Chisq and Gsq like:

C <- max(K.j) # number of categories
I <- J # number of items
df <- C^I - ret$npar - 1 # Degrees of freedom
Chisq.pvalue <- 1-pchisq(ret$Chisq,df)
Gsq.pvalue <- 1-pchisq(ret$Gsq,df)

Error in ginv(info) : could not find function "ginv"

Need to add MASS:: to ginv() here
Otherwise you get errors of

Error in ginv(info) : could not find function "ginv"

if you run poLCA::poLCA() (without attaching via library(poLCA))

add cassie-read statistic

Add cassie-read statistic as another goodness of fit statistic

add aic3

New simulation studies have shown that AIC_3 is a good information criterion for the model. Therefore I would recommend to output this index.

AIC_3 = -2lnL + 3n_p

See: Bacher, J., Pöge, A., & Wenzig, K. (2010). Clusteranalyse. Anwendungsorientierte
Einführung in Klassifikationsverfahren. München: Oldenbourg.

bad starting values when nrep>1

I came across an issue, and wrote a small hack to circumvent it that seems to work--here goes:

When running a model with nrep>1, I'll sometimes get output that looks like this:

...
Model 78: llik = -22780 ... best llik = -19583
Model 79: llik = -21839 ... best llik = -19583
Error in poLCA:::poLCA.ylik.C(vp, y) :
NA/NaN/Inf in foreign function call (arg 1)

I'm not entirely sure what the error is, but (since the model fit the first 79 times) I assume it has to do with bad randomly-chosen starting values. In this situation, I'd rather the software just discard the bad result and move on, rather than crash entirely. By tweaking the poLCA code as follows, it does just that.
First, a little function (I bet there's a more elegant way of doing this):

tryNA <- function(x){
    x <- try(x)
    if(inherits(x,'try-error')) return(NA)
    x
}

the replace the line

llik[iter] <- sum(log(rowSums(prior *poLCA.ylik.C(vp,`
                    y))))

with

llik[iter] <- tryNA(sum(log(rowSums(prior *poLCA.ylik.C(vp,
                      y)))))

and now everything seems to work just fine.

Add complex survey correction to estimation

Would it be legitimate to use survey::svymle to maximize the likelihood at each iteration (see here) of the optimization algorithm to apply complex survey adjustments (weights, clustering, stratification) to the results?

Entropy calculation error (vector memory exhausted)

Hello,

I'm using poLCA in my research at the moment & have had some issues using the poLCA.entropy() function. Specifically, when I attempt to run my entire model through this function, vector memory is exhausted.

I have tried:

Allocating more memory to R (does not work)
Running a model with ~25% of the original variables (does work)
Running a model with ~50% of the original variables (does work)
Running a model with ~75% of the original variables (does work)

I am working with data from roughly 1,900 participants across 15 variables. I cease being able to calculate entropy once 12 variables are included in the model.

Is this a known issue? Is there a maximum amount of data which can be input to poLCA for entropy to be calculated?

I am running R version 4.0.3 in R Studio version 1.4.1103 on macOS Monterey (12.2.1).

Thank you!