ktw5691 / psychtm Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 1.0 768 KB

Text Mining for Psychological Research

License: Other

R 59.18% C++ 37.01% C 3.81%

psychtm's People

Contributors

Stargazers

Watchers

psychtm's Issues

Write function to compute log-likelihood of sLDA model

Fit logistic regression (no topic model component) for use in model comparison

Write functions to compute posterior mean/median estimates of theta and beta matrices since upcoming update will no longer return chains for them

Have models return predictive posterior likelihood and compute WAIC and SE(WAIC_1 - WAIC_0) separately

Comparing models with/without topics

How can models fitted with/without topics be compared?

For instance, in the following code:

with_category = gibbs_sldax(log(HoursActual) ~ log(HoursEstimate)+Category+ProjectCode, data = cl_info,
                         m = 450, burn = 300,
                         docs = docs_vocab$documents,
                         V = vocab_len,
                         K = 10, model = "sldax", display_progress = TRUE)

what would be the impact on goodness of fit, from just fitting:
log(HoursActual) ~ log(HoursEstimate)+Category+ProjectCode?

That is, how can the impact of topics be separated out from the other covariates?

Return topics and/or topic empirical frequencies for LDA/sLDA

AIC in summary output

A quality of fit metric, such as AIC, would be useful output from summary, assuming that upstream functions generate the necessary information.

Make missing data check specific to variables used in SLDA or SLDAX models

Currently, gibbs_sldax() checks for missing data in ANY variable in the data frame supplied to the data argument. It will stop and warn the user about missing values even if the variables being used for modeling have no missing data.

waic_all apparently unnecessary parameter

Being clueless, I typed in the code from the help page, i.e., iter=5, and it worked, or rather it produced numbers without any complaint.

Then I noticed that the output from summary contained a chain value (for my code, 1), so I tried that. It produced different output.

Then I noticed that in the example, the call to gibbs_mlr included the argument m=5. My call to gibbs_sldax has m=450, and waic_all produces different output when given 450.

The second parameter of waic_all, l_pred, is an 'iter' x D matrix.

The value of iter can be obtained from the argument passed. Why is the first parameter, iter, needed?

The help is not helpful with regard to legitimate values of iter. Not giving one does produce an error message.

Add option to write chain to file to reduce memory requirements

bigmemory R package?

Error in `gibbs_logistic()`

Bad initialization of regression coefficients can start the algorithm with infinite values from which it never recovers.

Documentation fix: est_beta() documentation is incorrect

Print new line between iteration output for est_theta()

Update the prep_docs function to remove stop words and do stemming

Having to do stop-word removal and stemming is a hassle.

Be helpful, like the textProcessor function in the stm package, do it automatically (by default).

Add support for 1-topic model?

Release psychtm 2020.1

First release:

usethis::use_cran_comments()
Update (aspirational) install instructions in README
Proofread Title: and Description:
Check that all exported functions have @return and @examples
Check that Authors@R: includes a copyright holder (role 'cph')
Check licensing of included files
Review https://github.com/DavisVaughan/extrachecks

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Write vignettes

Provide vignette to illustrate package use.

Implement label switching correction

Need to use one of topic assignment samples, topic proportion samples, or vocabulary samples.
Keep the permutation indicators and use to realign all relevant model parameters (regression coefficients, topic proportions, topic-vocabulary distributions).

Write function for gibbs_sldax() to accept a data.frame for x

Write model summary functions for LDA/sLDA

Break up Gibbs sampling functions into subfunctions

Package checking using the goodpractice R package (which uses the cyclocomp package) flagged the gibbs_*() functions as having high cyclometric complexity. This could be reduced by writing more helper functions to take up tasks in the gibbs_*() functions.

Write validator functions for S4 classes

See https://adv-r.hadley.nz/s4.html#s4-classes Section 15.3.5

gg_coef() does not correctly add variable names

Possible bug: returned chain length does not respect desired behavior of m - burnin if burnin or m are large

Meaningless output about eta and zbar(d1) when fitting LDA

With verbose = TRUE and model = "lda", output:
1 eta: 0
500 zbar d1: 0 6.9351e-310 nan
1000 zbar d1: 0 6.9351e-310 nan
...

Write LDA sampler in Rcpp

Should be able to adapt code for sLDA sampler

Add default of 2.38 for proposal_sd arg to gibbs_sldax_logit()

is.na check is too broad

The check in prep_docs: if (any(is.na(data))) stop("Cannot handle missing values in 'data'") is too broad. Only the columns specified in col need to be checked.

Having to remove NA from columns that are not referenced by the called psychtm function is a pain.

Error using get_toptopics(): "No slot `theta` for class Sldax"

Add Piepel (1986) contrasts to `post_regression()`

Currently, the default contrasts computed in post_regression use equal weighting for all reference topics being tested against for each topic effect. It would be nice to implement Piepel-type (1986) contrasts to take into account possible range restriction in the empirical topic proportions as well as their sampling variability.

# Example usage
post_regression(sldax_fit, ctype = c("equal", "piepel"))

Expand unit tests

Expand unit tests: next target is 25%.

See tinytest for C++ code unit tests.

Compute DIC for model comparison

Automatically remove rows containing too few words

The message:
"Each document (row) in 'docs' must contain
at least 2 positive integers"
is not helpful. Why not say: "The text must contain at least two words."

Even better, why not automatically remove rows that don't contain some minimum number of words?

It took me 30 minutes to figure out which columns I needed to remove in the original data. For code+data see: https://shape-of-code.com/2022/01/23/including-natural-language-text-topics-in-a-regression-model/

Write extractor functions for slots of S4 objects

Suggested by @alexbrodersen

Implement Bayesian $R^2$ computation

A Bayesian $R^2$ metrics for supervised models with a continuous outcome (MLR, SLDA, SLDAX) would be helpful.

This is straightforward for MLR.
For SLDA and SLDAX, this requires computing empirical topic proportions (or keeping the topic assignments and then computing after sampling, which can be memory expensive).
- Alternatively, an approximate $R^2$ could be computed using $\hat{\Theta}$ instead.

# Example usage
bayes_r2(sldax_fit)

Consider reducing `Rcpp` load/compile overhead by switching to lighter Rcpp headers

Rcpp v.1.0.8 adds support for alternative degrees of full functionality from Rcpp:
- inst/include/Rcpp/Rcpp: Added as new entry point
- inst/include/Rcpp/Light: Added as lighter-weight entry point
- inst/include/Rcpp/Lighter: Idem
- inst/include/Rcpp/Lightest: Idem
See changed files for specific differences.

Integrate through GitHub Pages?
Or through personal site?

Add test for equality of number of rows in `docs` and `data` arguments to gibbs_sldax()

If they mismatch:

Error in .gibbs_sldax_cpp(docs, V, m, burn, thin, K, model, y, x, mu0,  : join_rows() / join_horiz(): number of rows must be the same

Memory efficiency: add option to just return an estimate of theta and beta, not the entire chain of topic draws (D x max_n x chain_length)

I think this would be doable. The posterior mean estimates should just be weighted sums that could be updated at each non-thinned iteration after burn-in.

ktw5691 / psychtm Goto Github PK

psychtm's People

Contributors

Stargazers

Watchers

psychtm's Issues

Recommend Projects

Recommend Topics

Recommend Org