stan-dev / posterior Goto Github PK

View Code? Open in Web Editor NEW

165.0 15.0 23.0 4.07 MB

The posterior R package

Home Page: https://mc-stan.org/posterior/

License: Other

R 100.00%

r-package bayesian mcmc bayes

posterior's People

Contributors

Stargazers

Watchers

posterior's Issues

.iteration / .chain / .draw versus .chain / .iteration / .draw?

Minor point, but I was wondering if there was a rationale behind the .iteration / .chain / .draw ordering of columns in draws_df --- the order .chain / .iteration / .draw strikes me as more natural (more general -> more specific)?

Update README

We should update the README before the beta release. @jgabry do you think this is something you could do? My english writing is still much less clear than yours and @mjskay's.

Glossary of nouns and verbs for the package

Pinging off of @jgabry's comment in #1 I thought I'd open an issue to gather thoughts on names of basic concepts in the package. Since it's easy to change names (as long as we do it before an official release) I don't think we should wait on finalizing these before continuing to build stuff, but I wanted to make sure we keep track of what names we need to discuss.

I suggest we edit this first comment to keep track of naming decisions and use discussion on this issue to make those decisions. For now I've filled in what we seem to have so far, not to suggest that these are set in stone at all.

core nouns / concepts

chain: chain index
iteration: within-chain index
draw: unique index across all chains
variable: a single variable / parameter / etc
draws: a collection of variables, chain and iteration info, and draws

verbs / operations

as_draws_{format}(): convert draws to {format}
subset_draws(), subset(): select a subset of variables/iterations/chains/draws
thin_draws(): thin draws
summarise_draws(), summarize_draws(), summary(): compute summary measures for each variable
extract_variable_matrix(): select a given variable and return its draws in a iterations x chains format; for use in convergence diagnostics and other summary measures
repair_draws(): repair indices (iterations, chains, draws) of draws objects to be consistent after subsetting or related operations
order_draws(): (re-)order draws objects after subsetting or related operations
bind_draws: bind multiple draws objects together
rename_variables(): change names of variables
mutate_variables(): transform variables and add them to the draws object

If I've missed anything please edit this issue to add it.

summarise_draws errors when called from another package

I noticed this when trying to incorporate posterior into cmdstanr, but here's an example that doesn't require cmdstanr:

# simulate calling posterior from another package
detach(package:posterior) # or just start a new session and don't load the package
f <- function(...) {
  draws <- posterior::example_draws()
  posterior::summarise_draws(draws, ...)
}
Error: value for ‘rhat’ not found

@paul-buerkner looking at the code here

https://github.com/jgabry/posterior/blob/c9ec5096474a083664040099e3827360d03c9390/R/summarise_draws.R#L79-L84

maybe this is because the functions like rhat aren't being looked for in the posterior package namespace?

ESS bug

> x <- as.data.frame(matrix(rnorm(10000), ncol = 10))
> posterior::ess_bulk(x)
[1] 9007.076
> posterior::ess_tail(x)
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) : 
  undefined columns selected

Which diagnostics are included?

At a minimum I’d say

Effective sample sizes (new robust version)
R-hat (new robust version)
Monte Carlo standard errors

Questions:

Which other general MCMC diagnostics are worth including?
Do we include algorithm-specific diagnostics? For example for hmc and variants.
For things like ESS and R-hat that have new versions, do we include the old versions for comparison?

should we eventually move this repo to stan-dev?

Here are some pros and cons of moving this package eventually from a personal repo to the stan-dev org. I think can think of more pros than cons but let me know if you think of more, in particular cons that I haven't thought of.

Cons:

it's not Stan-specific
possibly others, but they don't immediately come to mind

Pros:

users would be able to ask questions about the package on the Stan forums (this may also be able to be negotiated even without moving to stan-dev, but it woudn't be automatic)
more attention from GitHub users and potential contributors if repo is in stan-dev (probably, right?)
we are planning to use this together with/in many Stan related packages
stan-dev also has packages that can be used with any MCMC software (e.g. loo, bayesplot, even shinystan actually, etc.) so it's not new for stan-dev to host packages that are useful for Stan users and devs and also for non-Stan users and devs

Anyway, no pressure to agree with me on this! And this isn't an urgent question. Just wanted to raise the possibility.

Convert mcmc and mcmc.list objects to posterior formats

The mcmc and mcmc.list objects of the coda package have been used in a lot of places. It would be beneficial to transform them into formats the posterior package supports natively. However, for reasons detailed in #4, we don't want to fully support those formats, that is, don't want to write all the methods and transformation functions for them.

Arrays of draws of arbitrary dimension

In brms, I need draws arrays in various places that hold mulitple dimensions of draws and I feel we should support them natively in posterior. Such arrays should have the following properties:

Iterations/draws are always the first dimension
All other dimensions can be freely named and otherwise receive default names
no chain information is stored unless manually done be the user (in contrast to the current draws_array class, which will remain in the current form)
subsetting should not lead to dropping dimensions
methods subset, summarise_draws, etc. work canonically on these objects

We should come up with a good name for this class. The question is mostly which of the two array classes should be called draws_array and which should receive another name.

summary with one line per _named_ parameter

The discussion in #32 reminded me that one thing @andrewgelman and others have been wanting for a while (and that I think would also be useful as an alternative to the standard summary output) is a summary where each vector/matrix/array parameter only occupies a single line just like scalars. That is, each line corresponds to a named parameter rather than a parameter element.

(This could be its own thing or just an option to a custom print method for our existing summary objects.)

The quantities to display are debatable (e.g. maybe min of all the individual elements’ ESS, max of the Rhats, etc), so it would be good to get input from Andrew and anyone else interested.

How to handle discrete parameters?

Im not sure if you have discussed this before? I guess it is important to handle since many non-stan models may have discrete parameters.

summarize_draws: column name incorrect if only one value supplied to `probs`

This results in the column name .quantile instead of q25:

summarize_draws(example_draws(), probs = 0.25)

# A tibble: 10 x 9
   variable  mean median    sd   mad .quantile  rhat ess_bulk ess_tail
   <chr>    <dbl>  <dbl> <dbl> <dbl>     <dbl> <dbl>    <dbl>    <dbl>
 1 mu        4.56   4.49  3.36  3.45      2.26 1.00      881.     300.
 2 tau       3.85   2.90  3.32  2.65      1.46 0.998     386.     311.
 3 theta[1]  6.57   5.47  6.45  4.92      2.65 1.000     552.     272.
 4 theta[2]  4.74   4.53  4.63  4.14      1.85 1.04      767.     344.
 5 theta[3]  4.22   4.52  5.03  4.63      1.20 1.02      554.     246.
 6 theta[4]  4.79   4.95  4.45  4.65      1.74 0.998     656.     370.
 7 theta[5]  3.75   3.85  4.89  4.25      1.10 1.000     609.     326.
 8 theta[6]  4.28   4.36  4.88  4.65      1.39 0.998     644.     305.
 9 theta[7]  6.53   6.18  5.38  4.52      3.27 0.995     624.     345.
10 theta[8]  5.00   4.52  5.21  4.55      1.81 1.01      618.     332.

Add class `draws_summary` to object returned by `summarise_draws()`?

It could be useful to have a custom class for the tibble returned by summarise_draws(), e.g. for defining print methods and other things like that. Is there a downside to doing this?

Round summaries according to Monte Carlo standard error

@avehtari and I discussed the option today to add an option to round summaries in summarise_draws according to the corresponding MCSE, to reflect the actual precision of the estimates. This should not be activated by default and we need to think of a good argument name for it.

Adding generics of general Bayesian methods

We discussed at some point, that we want to have a common place where a lot of the generics should life that packages such as rstanarm, brms, etc. use. Right now, a lot of them are in rstantools but I don't think this is a good place for them as the methods are not only relevant from rstan based packages. Some methods are better suited for special packages and as such should life there, for instance, in bayesplot for all plotting related generics and in loo for all cross-validation related procedures.

Please amend this initial comment to add more methods that we may want to put into posterior.

posterior_predict compute predictions and return an arrary of draws
posterior_linpred compute linear predictor values and return an arrary of draws
posterior_epred compute means of the posterior predictive distribution and return an array of draws (see paul-buerkner/brms#644)
predictive_error computed differences between observed and predicted responses and return an array of draws
posterior_interval basically summarise_draws with just two quantiles
predictive_interval basically summarise_draws with just two quantiles after calling posterior_predict

Should duplicate variable names be supported?

I am currently implementing rename_variables(). To simplify things I am also implementing variables<-() (in the same way that names<-() and whatnot work). In the process of writing tests for these functions I've noticed that duplicate variable names are not handled consistently across types:

> m = matrix(11:20, ncol = 2, dimnames = list(NULL, c("a", "a")))
> variables(as_draws_matrix(m))
[1] "a" "a"
> variables(as_draws_array(m))
[1] "a" "a"
> variables(as_draws_list(m))
[1] "a"  "V2"
> variables(as_draws_df(m))
[1] "a"  "V2"

Since in theory all of these formats could support duplicate names, I think the only question is if we want to support duplicate names. Natural choices seem to be:

Support duplicate names (and adjust as_draws_list / as_draws_df accordingly)
Make duplicate names throw an error

confused by different summary output for list and array formats

@paul-buerkner Am I missing something or is what I'm seeing below weird? I labelled this as "bug" but maybe it's not and I'm just overlooking something simple.

Presumably summarise_draws() should return the identical diagnostic results for all the formats that keep the chain information and different results for the matrix format (for any diagnostics that use chain info). This is what happens except for the list format, which gives very different results:

x <- draws_eight_schools
x_arr <- as_draws_array(x)
x_list <- as_draws_list(x)
x_df <- as_draws_df(x)
x_mat <- as_draws_matrix(x)

Here's for matrix (different than others but makes sense since it thinks it's all one chain):

summarise_draws(x_mat, c("rhat", "ess_bulk", "ess_tail"))
# A tibble: 10 x 4
   variable  rhat ess_bulk ess_tail
   <chr>    <dbl>    <dbl>    <dbl>
 1 mu       1.00      801.     269.
 2 tau      1.00      337.     306.
 3 theta[1] 1.00      512.     263.
 4 theta[2] 1.03      669.     285.
 5 theta[3] 0.999     528.     226.
 6 theta[4] 0.999     615.     329.
 7 theta[5] 1.00      570.     309.
 8 theta[6] 1.000     598.     294.
 9 theta[7] 1.00      538.     323.
10 theta[8] 1.01      587.     309.

Here's for array (different than matrix, which makes sense):

summarise_draws(x_arr, c("rhat", "ess_bulk", "ess_tail"))
# A tibble: 10 x 4
   variable  rhat ess_bulk ess_tail
   <chr>    <dbl>    <dbl>    <dbl>
 1 mu       1.00      878.     300.
 2 tau      0.998     387.     311.
 3 theta[1] 1.000     551.     272.
 4 theta[2] 1.04      765.     344.
 5 theta[3] 1.02      553.     246.
 6 theta[4] 0.998     655.     370.
 7 theta[5] 1.000     608.     326.
 8 theta[6] 0.998     643.     305.
 9 theta[7] 0.995     622.     345.
10 theta[8] 1.01      618.     332.

Here's for df (same as array, which makes sense):

summarise_draws(x_df, c("rhat", "ess_bulk", "ess_tail"))
# A tibble: 10 x 4
   variable  rhat ess_bulk ess_tail
   <chr>    <dbl>    <dbl>    <dbl>
 1 mu       1.00      878.     300.
 2 tau      0.998     387.     311.
 3 theta[1] 1.000     551.     272.
 4 theta[2] 1.04      765.     344.
 5 theta[3] 1.02      553.     246.
 6 theta[4] 0.998     655.     370.
 7 theta[5] 1.000     608.     326.
 8 theta[6] 0.998     643.     305.
 9 theta[7] 0.995     622.     345.
10 theta[8] 1.01      618.     332.

But here's for list (very different from the others):

summarise_draws(x_list, c("rhat", "ess_bulk", "ess_tail"))
# A tibble: 10 x 4
   variable  rhat ess_bulk ess_tail
   <chr>    <dbl>    <dbl>    <dbl>
 1 mu       1.00     3204.    1133.
 2 tau      1.00     1383.    1226.
 3 theta[1] 1.00     2049.    1069.
 4 theta[2] 1.01     2690.    1289.
 5 theta[3] 0.998    2119.    1048.
 6 theta[4] 0.998    2466.    1449.
 7 theta[5] 1.00     2281.    1246.
 8 theta[6] 0.999    2408.    1180.
 9 theta[7] 1.00     2195.    1335.
10 theta[8] 1.00     2348.    1246.

Example dataset with a matrix and >1 chains

Following on #18, it might be helpful to have an example dataset with a 2-or-more dimensional variable (maybe a covariance matrix) and 2 or more chains for the purposes of examples and testing. Anyone have a good (and small) example like that on hand?

support data frame and tibble?

Really just tibble but let users get regular data frame if they want, e.g. as_draws_df() just wraps as_draws_tibble(), etc.

Bind multiple draws objects

Kind of the inverse operation to subset, we should allow to find draws objects together. This involves (at least) two different kinds of bind operations.

bind variables together, that is, extend the variables dimension after maing sure the draws objects have the same number of draws (and iterations and chains). If draws are in the draws_df format this basically corresponds to a cbind operation.
bind draws together, that is extend the draws (or chains and iterations) dimension. If draws are in the draws_df format this basically corresponds to a rbind operation.

I am not sure yet about good names for these two functions. Some ad-hoc ones could be bind_* where * could be one of variables, chains, iterations, or draws. What do you think?

Rename and mutate variables

For users it might be convenient to have rename and mutate functionality to make changes to variables similar to how we do it with the related dplyr functions. To avoid function masking, we may want to name them rename_variables and mutate_variables which would also be consistent with our naming conventions.

If we decide to implement such functions, we need to think of a proper backend for non-standard evaluation. We could of course use the one of tidyverse but I am not sure how much additional dependencies this implies. @mjskay you may have more experience with this than I have.

autocorrelation values, autocorrelation time, Geyer's truncation lag

It would be good to have for diagnostics

autocorrelation values up to some lag (to be checked as numbers or as a plot)
autocorrelation time (this is related to ESS, but can be useful separately to choose thin value)
Geyer's truncation lag (this is a safer option for thin value)
lag which has 95% (or user defined) cumulative autocorrelation time (produces smaller but almost as good thin value as Geyer's rule)

And make a thin function using subsetting method.

Vignette(s)

We don’t necessarily need it for the beta release, but just starting an issue to decide what content we should include in a vignette or several. I could imagine one long vignette or perhaps several smaller vignettes focused on separate things e.g. one about the formats and one about summaries and diagnostics, etc.).

Tentative list (add more items by editing this comment):

Vignette demonstrating the different draws formats and manipulating them (e.g., subset, mutate, etc.)
Vignette demonstrating summaries and diagnostics
Vignette explaining how to interpret and use MCSEs (see also #44)

Choose license

I don't have a strong opinion about it but it may be sensible to not use GPL3 as this causes a lot of trouble in industry (not necessarily for good reasons but still). GPL2 seems fine as does more permissive licenses as used by most tidyverse packages for instance.

Are these functions being used anywhere?

@paul-buerkner It seems that internal functions like .as_draws_list, .as_draws_array, .as_draws_df, etc. (i.e., the ones with names starting with .), don't seem to be called by any other functions. At least if I search for those function names in the entire package source code the only hits I get are their implementations but nothing else. Are these just older versions that can be safely removed or are they actually used somewhere that I'm just missing?

Repairing indices in draws objects

When working with standard indexing operations it is easy to leave the iteration, chain, etc. indices in weird (most likely unwanted) states. For instance,

# example draws
x <- round(rnorm(200), 2)
x <- array(x, dim = c(10, 4, 5))
dimnames(x) <- list(NULL, NULL, paste0("theta", 1:5))

post <- as_draws_array(x)
(post_sub <- post[c(4, 6), , ])

gives us a valid draws object but with two iterations now named 4 and 6 instead of 1 and 2.
What I am proposing is two define a repair_indices function (we may name it differently of course),
that repairs the indices to be continuously numbered again. That is,

repair_indices(post_sub)

gives us iteration indices 1 and 2 again (instead of 4 and 6) as previously. This function should also be used automatically in subset to ensure validitidy and continuous numbering of both input and output. In fact, I would argue, the only index structure that our internal functions should expect and output is the repaired form, but this is my discussion point (1) for which I would like to have your input. I have implemented a prototype of repair_indices for us to play around with.

Discussion point (2) is how to handle broken orderings. For instance

(post_sub <- post[c(6, 4), , ])

would result in the "wrong" ordering ot the two iterations. The question is now how repair_indices (and therefore functions which depend on it) should handle those. Shall we have the mapping 6 = 1, and 4 = 2, that is just take the supplied order (not considering iteration names), or shall we have 4 = 1 and 6 = 2, that is reorder samples according to the iteration names. Here I am less certain about what is preferable. The second one may be very surprising to users as functions may "randomly" change the ordering I guess. Any thoughts from your side?

rv-like interface

I wanted to gather conversations about a potential rv-like interface here so as not to derail other conversations (like #4).

I think a solid rv-like interface would be incredibly useful. More specifically, something that:

Acts as a high-level interface to "random variables" which can be vectors, matrices, arrays, etc.
Allows math operations on those
Supports nice syntax for thing like P() and E()

Rv already has those down, but would be even better if it:

Used a naming scheme consistent with ours
Supported pretty-printing in tibbles
Kept chain information

With all of those requirements in place, I could see tidybayes moving largely towards using tables of these rv-like objects. It would be very useful for a lot of the posterior manipulation/summarization/visualization tasks tidybayes is designed for.

If you all are interested in supporting such a format here, then the question is what's the best way to get there? The options might be:

Reach out to the rv maintainer and ask if they are willing to let us take it over and make backwards-incompatible changes to it, followed by a new major release deprecating some stuff.
Bring some of the existing rv code that we want to build on into this package, come up with a new class name (to not clobber "rv") and go from there.
Start from scratch here.
Start from scratch in a different package and add that dependency here.

(1) Could work if the new maintainer is not planning much with the package and if there aren't a lot of users. Currently the package has ~400 downloads/month and no revdeps on CRAN.
(2) Would be doable depending on license preferences (it is GPL-2). This could also be aided by the fact that rv looks to have been written by one of Andrew Gelman's former students, Jouni Kerman (@jgabry do you know him?).

I would be willing to float (1) to the current rv maintainer, Joseph Stachelek --- I've interacted with him once or twice on twitter and github so it wouldn't be a complete cold email (unless either of you know him better). If we'd rather go for (2) or (3) it might be good to reach out to Jouni Kerman to get his thoughts (either on using his code or on things he would have done differently if he wrote the package again).

version numbers for alpha and beta releases?

Now that the repo is public should we start using version numbers and tagging releases? For example, we could do something like 0.0.1 for the alpha release (incrementing the 1 if we make important changes before beta), 0.1.0 for the beta release (same comment about incrementing), and 1.0.0 for the first CRAN release. Thoughts?

Should we have both `summarise_` and `summarize_`?

To accommodate spelling conventions in different regions of the world dplyr has both summarise() and summarize(). I think it would be good to do that in this package too. Thoughts?

Subsetting draws vs. subsetting iterations

I currently think we should have both a subset_draws and a subset_iterations methods for consistency with our currenty naming conventions #5. Whay I am unsure about is what these methods should do for specific formats.

For instance, how shall we handle subsetting draws in the draws_array format. If we took draws literally there, we would have to insert missing values when the subsetted draws are not symmetric over chains.

For other formats, such as data.frames with .iteration and .draw we can easily subset for both, but when we subset via draws and then try to transform into a draws_array we may end up running to the same problem as above.

Any suggestion on how to handle this?

Representation as json

Im impressed by the progress. Since Im finalizing the posteriordb beta, I realized that it would be good to formalize also how a posterior is represented as json. This would also enable cross-language representations.

Have you thought about this?

I currently represent draws as a named object of numeric arrays. Think a named list of a vector in R, with one vector per parameter.

Do the different representations need custom classes?

@paul-buerkner and I were discussing this today. it should be possible to do the conversions between representations without using methods by just checking the input objects, but we could use classes and methods if we want. Both have their appeal. We also have to consider the diagnostics and summaries. Without clases and methods we’d end up doing lots of checking of inputs in many different parts of the package, whereas the methods know the types of their inputs already. Anyway, just something to consider and we can proceed with implementing the internals of the conversion, summary, and diagnostic functions without having deciding this.

Rethinking the package name for a second

I love the name posterior and am strongly in favor of it but I still want to bring up this issue. In the process of the initial discussions, we bascially switched our naming conventions away from posterior_ to draws_ as the draws we deal with may not necessarily come from a posterior distribution. Accordingly, a somewhat more accurate package name would be draws but it is much less catchy and more difficult to remember.

As I said above, I would like to keep the name posterior but make sure we have thought about it at least for a second before openly releasing it.

I would appreciate your thoughts on this! :-)

Is there a reason `.iter` is int but `.draw` is dbl?

See types in the print output below.

> print(as_draws_df(eight_schools))
# A tibble: 400 x 13
   .chain .iteration .draw    mu    tau `theta[1]` `theta[2]` `theta[3]` `theta[4]`
    <int>      <int> <dbl> <dbl>  <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1      1          1     1 -4.78  1.62      -1.61       -4.96      -5.41    -2.88  
 2      1          2     2  6.92  3.40      11.0         9.34       8.60     8.37  
...

Is there a reason or just an oversight?

Which representations of posterior draws are included?

Possibilities:

matrix (chains merged)
array (with chain dimension, maybe use new rray types?)
list (chains merged, in the style of rstan::extract() when it returns a list)
list of lists style 1 (separate chains, i.e., same as list but with separate sub-lists for each chain)
maybe a list of lists style 2 (nested lists corresponding to grouping structure of params, e.g. x$group_var$slope$level)
tidy data frames (maybe just tibbles?, coordinate with tidybayes. does @mjskay want to coauthor this package?)

Questions:

Which of the above types do we include?
Are there other types not listed to include?
Will we need these types to have additional classes (e.g. 'posterior_matrix') or can/should we try to avoid that?

Subset variables via regular expressions

Not sure what would be the best interface for that in subset but it would definitely be nice to extract variables using regular expressions. We could either have a new argument variable_regex to specifc the regular expression or a flag (called regex for instance) to indicate whether variable should be interepreted as a regular expression. What would you prefer? Or would you prefer yet another approach?

Get travis and codecov working

@jgabry can you activate travis for the posterior package? I cannot do it myself since it lives on your github page.

ESS and Rhat with NA in draws

See stan-dev/rstan#706

Which summaries are included?

mean, median
sd, mad_sd
quantiles

Questions:

Which other summaries to include?
What are the naming conventions?
What is (are) the output type(s)

Improve documentation to an acceptable level

This should include at least

basic templating of documentation snippets
references between documentation pages
everything required for R CMD Check to pass without documentation related warnings or errors

Specifically, the following parts need to be documented:

convergence diagnostics
as_draws_{format} generics/methods
summarise_draws generic/methods
subset methods
repair_draws generic/methods
order_draws generic/methods
extract_one_variable_matrix generic/methods
package doc page in posterior-package.R

Should we rename `draws_df` to `draws_tibble` (or `draws_tbl`)?

Since we're using tibbles I guess the names draws_tibble, draws_tbl, or draws_tbl_df would be more consistent with the naming convention of draws_{class}. What do you think? Personally I like the name draws_df better (it just looks nicer I think) but perhaps we should change it?

Default print() output

As I've been writing rename_variables() I've found it's a little awkward to work with draws objects when the default print output at the console is typically gigantic. This also makes examples a little verbose, as it feels necessary to call summarise_draws() constantly.

Two thoughts:

Any objections to making the default print() for draws objects call summarise_draws()?
If we agree to do (1), we will now have three ways of getting the same info (print, summary, and summarise_draws). That possibly feels a bit overkill? I can see how they are typically used in different ways, so having them all as aliases is probably fine, but it is worth considering.

Add a 'draws' parent class for all support formats

Compound subsetting?

The "subset" operation is currently a fairly generic "slice" operation, right? In which case, I wonder if it makes sense to allow combined subsetting / slicing operations, like:

subset(posterior, chain = 1, iteration = 1:100, variable = c("x", "y"))

This might be a more straightforward API than a separate function for slicing along each dimension. Although it would have to enforce no slicing along draws and chains/iterations simultaneously (or whatever the solution to #6 is).

The current syntax may lead to constructions like this:

posterior %>%
  subset_chain(1) %>%
  subset_iteration(1:100) %>%
  subset_variable(c("x", "y"))

Which I think is also nice, but may be more verbose than the compound version without additional clarity. subset() is also already a base-R generic intended for this kind of slicing, I think.

Potential 'draws' name ambiguities

Currently, we use 'draws' for two different things, the object themselves and and draw indices. I think this is acceptable in general and induces not too much confusion. However, some function names are not optimal in that regard. For instance, one would assume that the draws() function, if it exists, creates a draws object, not that it returns the draw indices of an existing object.

Accordingly, I think we should at least rethink how we call the draw index extraction function to avoid any uncessary confusion.

Extractor functions of iterations, chains etc.

I think we need extractor function of iterations, chains, draws, and variables. Two questions:

How do we name them? Simply iterations, chains, draws and variables?
For the former 3, shall we return a vector of indices, e.g., (1, 2, 3, 4) in case of 4 chains, or simply the number of chains, etc.?

Depending on the answer for (2), we may want to go for niterations etc. as names to indicate that is is about the number of iterations not the indices themselves. Or should we even support both?

Convert a draws_list to a draws_df

> class(x)
[1] "pdb_gold_standard_draws" "draws_list"              "draws"                  
[4] "list"                   
> posterior::as_draws_df(x)
Error in class(out) <- class_draws_df() : 
  attempt to set an attribute on NULL
>   xdf <- posterior::as_draws_matrix(x)
Error in class(out) <- class_draws_df() : 
  attempt to set an attribute on NULL
> packageVersion("posterior")
[1] ‘0.0.1’

Include an example draws object

We could, for instance, use some posterior draws of the eight-schools example.

Add a basic set of unit tests

This should at least cover all the transformation and post-processing functions.

as_draws_{format} methods
summarise_draws and aliases
convergence functions
extract_one_variable_matrix methods
iterations/chains etc. extracting functions
rename_variables
subset and thin methods
repair_draws and order_draws methods

Use `subset_draws()` with alias `subset()` for consistency

Because we already have the traditional summary as an alias for summarise_draws, I propose using subset_draws with subset as an alias. (This is also consistent with names like repair_draws, order_draws, example_draws, etc.).

stan-dev / posterior Goto Github PK

posterior's People

Contributors

Stargazers

Watchers

Forkers

posterior's Issues

core nouns / concepts

verbs / operations

Recommend Projects

Recommend Topics

Recommend Org