Giter VIP home page Giter VIP logo

psychtm's People

Contributors

ktw5691 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

psychtm's Issues

Comparing models with/without topics

How can models fitted with/without topics be compared?

For instance, in the following code:

with_category = gibbs_sldax(log(HoursActual) ~ log(HoursEstimate)+Category+ProjectCode, data = cl_info,
                         m = 450, burn = 300,
                         docs = docs_vocab$documents,
                         V = vocab_len,
                         K = 10, model = "sldax", display_progress = TRUE)

what would be the impact on goodness of fit, from just fitting:
log(HoursActual) ~ log(HoursEstimate)+Category+ProjectCode?

That is, how can the impact of topics be separated out from the other covariates?

AIC in summary output

A quality of fit metric, such as AIC, would be useful output from summary, assuming that upstream functions generate the necessary information.

waic_all apparently unnecessary parameter

Being clueless, I typed in the code from the help page, i.e., iter=5, and it worked, or rather it produced numbers without any complaint.

Then I noticed that the output from summary contained a chain value (for my code, 1), so I tried that. It produced different output.

Then I noticed that in the example, the call to gibbs_mlr included the argument m=5. My call to gibbs_sldax has m=450, and waic_all produces different output when given 450.

The second parameter of waic_all, l_pred, is an 'iter' x D matrix.

The value of iter can be obtained from the argument passed. Why is the first parameter, iter, needed?

The help is not helpful with regard to legitimate values of iter. Not giving one does produce an error message.

Error in `gibbs_logistic()`

Bad initialization of regression coefficients can start the algorithm with infinite values from which it never recovers.

Release psychtm 2020.1

First release:

Prepare for release:

  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • rhub::check(platform = 'ubuntu-rchk')
  • rhub::check_with_sanitizers()
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted ๐ŸŽ‰
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Implement label switching correction

  • Need to use one of topic assignment samples, topic proportion samples, or vocabulary samples.
  • Keep the permutation indicators and use to realign all relevant model parameters (regression coefficients, topic proportions, topic-vocabulary distributions).

Break up Gibbs sampling functions into subfunctions

Package checking using the goodpractice R package (which uses the cyclocomp package) flagged the gibbs_*() functions as having high cyclometric complexity. This could be reduced by writing more helper functions to take up tasks in the gibbs_*() functions.

is.na check is too broad

The check in prep_docs: if (any(is.na(data))) stop("Cannot handle missing values in 'data'") is too broad. Only the columns specified in col need to be checked.

Having to remove NA from columns that are not referenced by the called psychtm function is a pain.

Add Piepel (1986) contrasts to `post_regression()`

Currently, the default contrasts computed in post_regression use equal weighting for all reference topics being tested against for each topic effect. It would be nice to implement Piepel-type (1986) contrasts to take into account possible range restriction in the empirical topic proportions as well as their sampling variability.

# Example usage
post_regression(sldax_fit, ctype = c("equal", "piepel"))

Expand unit tests

Expand unit tests: next target is 25%.

  • See tinytest for C++ code unit tests.

Automatically remove rows containing too few words

The message:
"Each document (row) in 'docs' must contain
at least 2 positive integers"
is not helpful. Why not say: "The text must contain at least two words."

Even better, why not automatically remove rows that don't contain some minimum number of words?

It took me 30 minutes to figure out which columns I needed to remove in the original data. For code+data see: https://shape-of-code.com/2022/01/23/including-natural-language-text-topics-in-a-regression-model/

Implement Bayesian $R^2$ computation

A Bayesian $R^2$ metrics for supervised models with a continuous outcome (MLR, SLDA, SLDAX) would be helpful.

  • This is straightforward for MLR.
  • For SLDA and SLDAX, this requires computing empirical topic proportions (or keeping the topic assignments and then computing after sampling, which can be memory expensive).
    • Alternatively, an approximate $R^2$ could be computed using $\hat{\Theta}$ instead.
# Example usage
bayes_r2(sldax_fit)

`packagedown` site

Set up a website for the package using packagedown.

  • Integrate through GitHub Pages?
  • Or through personal site?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.