stan-dev / posterior Goto Github PK
View Code? Open in Web Editor NEWThe posterior R package
Home Page: https://mc-stan.org/posterior/
License: Other
The posterior R package
Home Page: https://mc-stan.org/posterior/
License: Other
Minor point, but I was wondering if there was a rationale behind the .iteration / .chain / .draw ordering of columns in draws_df --- the order .chain / .iteration / .draw strikes me as more natural (more general -> more specific)?
Pinging off of @jgabry's comment in #1 I thought I'd open an issue to gather thoughts on names of basic concepts in the package. Since it's easy to change names (as long as we do it before an official release) I don't think we should wait on finalizing these before continuing to build stuff, but I wanted to make sure we keep track of what names we need to discuss.
I suggest we edit this first comment to keep track of naming decisions and use discussion on this issue to make those decisions. For now I've filled in what we seem to have so far, not to suggest that these are set in stone at all.
chain
: chain indexiteration
: within-chain indexdraw
: unique index across all chainsvariable
: a single variable / parameter / etcdraws
: a collection of variables, chain and iteration info, and drawsas_draws_{format}()
: convert draws to {format}subset_draws()
, subset()
: select a subset of variables/iterations/chains/drawsthin_draws()
: thin drawssummarise_draws()
, summarize_draws()
, summary()
: compute summary measures for each variableextract_variable_matrix()
: select a given variable and return its draws in a iterations x chains format; for use in convergence diagnostics and other summary measuresrepair_draws()
: repair indices (iterations, chains, draws) of draws objects to be consistent after subsetting or related operationsorder_draws()
: (re-)order draws objects after subsetting or related operationsbind_draws
: bind multiple draws objects togetherrename_variables()
: change names of variablesmutate_variables()
: transform variables and add them to the draws objectIf I've missed anything please edit this issue to add it.
I noticed this when trying to incorporate posterior into cmdstanr, but here's an example that doesn't require cmdstanr:
# simulate calling posterior from another package
detach(package:posterior) # or just start a new session and don't load the package
f <- function(...) {
draws <- posterior::example_draws()
posterior::summarise_draws(draws, ...)
}
Error: value for ‘rhat’ not found
@paul-buerkner looking at the code here
maybe this is because the functions like rhat
aren't being looked for in the posterior package namespace?
> x <- as.data.frame(matrix(rnorm(10000), ncol = 10))
> posterior::ess_bulk(x)
[1] 9007.076
> posterior::ess_tail(x)
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) :
undefined columns selected
At a minimum I’d say
Questions:
Here are some pros and cons of moving this package eventually from a personal repo to the stan-dev org. I think can think of more pros than cons but let me know if you think of more, in particular cons that I haven't thought of.
Cons:
Pros:
Anyway, no pressure to agree with me on this! And this isn't an urgent question. Just wanted to raise the possibility.
The mcmc and mcmc.list objects of the coda package have been used in a lot of places. It would be beneficial to transform them into formats the posterior package supports natively. However, for reasons detailed in #4, we don't want to fully support those formats, that is, don't want to write all the methods and transformation functions for them.
In brms, I need draws arrays in various places that hold mulitple dimensions of draws and I feel we should support them natively in posterior. Such arrays should have the following properties:
draws_array
class, which will remain in the current form)subset
, summarise_draws
, etc. work canonically on these objectsWe should come up with a good name for this class. The question is mostly which of the two array classes should be called draws_array
and which should receive another name.
The discussion in #32 reminded me that one thing @andrewgelman and others have been wanting for a while (and that I think would also be useful as an alternative to the standard summary output) is a summary where each vector/matrix/array parameter only occupies a single line just like scalars. That is, each line corresponds to a named parameter rather than a parameter element.
(This could be its own thing or just an option to a custom print method for our existing summary objects.)
The quantities to display are debatable (e.g. maybe min of all the individual elements’ ESS, max of the Rhats, etc), so it would be good to get input from Andrew and anyone else interested.
Im not sure if you have discussed this before? I guess it is important to handle since many non-stan models may have discrete parameters.
This results in the column name .quantile
instead of q25
:
summarize_draws(example_draws(), probs = 0.25)
# A tibble: 10 x 9
variable mean median sd mad .quantile rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mu 4.56 4.49 3.36 3.45 2.26 1.00 881. 300.
2 tau 3.85 2.90 3.32 2.65 1.46 0.998 386. 311.
3 theta[1] 6.57 5.47 6.45 4.92 2.65 1.000 552. 272.
4 theta[2] 4.74 4.53 4.63 4.14 1.85 1.04 767. 344.
5 theta[3] 4.22 4.52 5.03 4.63 1.20 1.02 554. 246.
6 theta[4] 4.79 4.95 4.45 4.65 1.74 0.998 656. 370.
7 theta[5] 3.75 3.85 4.89 4.25 1.10 1.000 609. 326.
8 theta[6] 4.28 4.36 4.88 4.65 1.39 0.998 644. 305.
9 theta[7] 6.53 6.18 5.38 4.52 3.27 0.995 624. 345.
10 theta[8] 5.00 4.52 5.21 4.55 1.81 1.01 618. 332.
It could be useful to have a custom class for the tibble returned by summarise_draws()
, e.g. for defining print methods and other things like that. Is there a downside to doing this?
@avehtari and I discussed the option today to add an option to round summaries in summarise_draws
according to the corresponding MCSE, to reflect the actual precision of the estimates. This should not be activated by default and we need to think of a good argument name for it.
We discussed at some point, that we want to have a common place where a lot of the generics should life that packages such as rstanarm, brms, etc. use. Right now, a lot of them are in rstantools but I don't think this is a good place for them as the methods are not only relevant from rstan based packages. Some methods are better suited for special packages and as such should life there, for instance, in bayesplot for all plotting related generics and in loo for all cross-validation related procedures.
Please amend this initial comment to add more methods that we may want to put into posterior.
posterior_predict
compute predictions and return an arrary of drawsposterior_linpred
compute linear predictor values and return an arrary of drawsposterior_epred
compute means of the posterior predictive distribution and return an array of draws (see paul-buerkner/brms#644)predictive_error
computed differences between observed and predicted responses and return an array of drawsposterior_interval
basically summarise_draws with just two quantilespredictive_interval
basically summarise_draws with just two quantiles after calling posterior_predict
I am currently implementing rename_variables()
. To simplify things I am also implementing variables<-()
(in the same way that names<-()
and whatnot work). In the process of writing tests for these functions I've noticed that duplicate variable names are not handled consistently across types:
> m = matrix(11:20, ncol = 2, dimnames = list(NULL, c("a", "a")))
> variables(as_draws_matrix(m))
[1] "a" "a"
> variables(as_draws_array(m))
[1] "a" "a"
> variables(as_draws_list(m))
[1] "a" "V2"
> variables(as_draws_df(m))
[1] "a" "V2"
Since in theory all of these formats could support duplicate names, I think the only question is if we want to support duplicate names. Natural choices seem to be:
as_draws_list
/ as_draws_df
accordingly)@paul-buerkner Am I missing something or is what I'm seeing below weird? I labelled this as "bug" but maybe it's not and I'm just overlooking something simple.
Presumably summarise_draws()
should return the identical diagnostic results for all the formats that keep the chain information and different results for the matrix format (for any diagnostics that use chain info). This is what happens except for the list format, which gives very different results:
x <- draws_eight_schools
x_arr <- as_draws_array(x)
x_list <- as_draws_list(x)
x_df <- as_draws_df(x)
x_mat <- as_draws_matrix(x)
Here's for matrix (different than others but makes sense since it thinks it's all one chain):
summarise_draws(x_mat, c("rhat", "ess_bulk", "ess_tail"))
# A tibble: 10 x 4
variable rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl>
1 mu 1.00 801. 269.
2 tau 1.00 337. 306.
3 theta[1] 1.00 512. 263.
4 theta[2] 1.03 669. 285.
5 theta[3] 0.999 528. 226.
6 theta[4] 0.999 615. 329.
7 theta[5] 1.00 570. 309.
8 theta[6] 1.000 598. 294.
9 theta[7] 1.00 538. 323.
10 theta[8] 1.01 587. 309.
Here's for array (different than matrix, which makes sense):
summarise_draws(x_arr, c("rhat", "ess_bulk", "ess_tail"))
# A tibble: 10 x 4
variable rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl>
1 mu 1.00 878. 300.
2 tau 0.998 387. 311.
3 theta[1] 1.000 551. 272.
4 theta[2] 1.04 765. 344.
5 theta[3] 1.02 553. 246.
6 theta[4] 0.998 655. 370.
7 theta[5] 1.000 608. 326.
8 theta[6] 0.998 643. 305.
9 theta[7] 0.995 622. 345.
10 theta[8] 1.01 618. 332.
Here's for df (same as array, which makes sense):
summarise_draws(x_df, c("rhat", "ess_bulk", "ess_tail"))
# A tibble: 10 x 4
variable rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl>
1 mu 1.00 878. 300.
2 tau 0.998 387. 311.
3 theta[1] 1.000 551. 272.
4 theta[2] 1.04 765. 344.
5 theta[3] 1.02 553. 246.
6 theta[4] 0.998 655. 370.
7 theta[5] 1.000 608. 326.
8 theta[6] 0.998 643. 305.
9 theta[7] 0.995 622. 345.
10 theta[8] 1.01 618. 332.
But here's for list (very different from the others):
summarise_draws(x_list, c("rhat", "ess_bulk", "ess_tail"))
# A tibble: 10 x 4
variable rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl>
1 mu 1.00 3204. 1133.
2 tau 1.00 1383. 1226.
3 theta[1] 1.00 2049. 1069.
4 theta[2] 1.01 2690. 1289.
5 theta[3] 0.998 2119. 1048.
6 theta[4] 0.998 2466. 1449.
7 theta[5] 1.00 2281. 1246.
8 theta[6] 0.999 2408. 1180.
9 theta[7] 1.00 2195. 1335.
10 theta[8] 1.00 2348. 1246.
Following on #18, it might be helpful to have an example dataset with a 2-or-more dimensional variable (maybe a covariance matrix) and 2 or more chains for the purposes of examples and testing. Anyone have a good (and small) example like that on hand?
Really just tibble but let users get regular data frame if they want, e.g. as_draws_df()
just wraps as_draws_tibble()
, etc.
Kind of the inverse operation to subset, we should allow to find draws objects together. This involves (at least) two different kinds of bind operations.
draws_df
format this basically corresponds to a cbind
operation.draws_df
format this basically corresponds to a rbind
operation.I am not sure yet about good names for these two functions. Some ad-hoc ones could be bind_*
where *
could be one of variables, chains, iterations, or draws. What do you think?
For users it might be convenient to have rename
and mutate
functionality to make changes to variables similar to how we do it with the related dplyr functions. To avoid function masking, we may want to name them rename_variables
and mutate_variables
which would also be consistent with our naming conventions.
If we decide to implement such functions, we need to think of a proper backend for non-standard evaluation. We could of course use the one of tidyverse but I am not sure how much additional dependencies this implies. @mjskay you may have more experience with this than I have.
It would be good to have for diagnostics
And make a thin function using subsetting method.
We don’t necessarily need it for the beta release, but just starting an issue to decide what content we should include in a vignette or several. I could imagine one long vignette or perhaps several smaller vignettes focused on separate things e.g. one about the formats and one about summaries and diagnostics, etc.).
Tentative list (add more items by editing this comment):
I don't have a strong opinion about it but it may be sensible to not use GPL3 as this causes a lot of trouble in industry (not necessarily for good reasons but still). GPL2 seems fine as does more permissive licenses as used by most tidyverse packages for instance.
@paul-buerkner It seems that internal functions like .as_draws_list
, .as_draws_array
, .as_draws_df
, etc. (i.e., the ones with names starting with .
), don't seem to be called by any other functions. At least if I search for those function names in the entire package source code the only hits I get are their implementations but nothing else. Are these just older versions that can be safely removed or are they actually used somewhere that I'm just missing?
When working with standard indexing operations it is easy to leave the iteration, chain, etc. indices in weird (most likely unwanted) states. For instance,
# example draws
x <- round(rnorm(200), 2)
x <- array(x, dim = c(10, 4, 5))
dimnames(x) <- list(NULL, NULL, paste0("theta", 1:5))
post <- as_draws_array(x)
(post_sub <- post[c(4, 6), , ])
gives us a valid draws object but with two iterations now named 4 and 6 instead of 1 and 2.
What I am proposing is two define a repair_indices
function (we may name it differently of course),
that repairs the indices to be continuously numbered again. That is,
repair_indices(post_sub)
gives us iteration indices 1 and 2 again (instead of 4 and 6) as previously. This function should also be used automatically in subset
to ensure validitidy and continuous numbering of both input and output. In fact, I would argue, the only index structure that our internal functions should expect and output is the repaired form, but this is my discussion point (1) for which I would like to have your input. I have implemented a prototype of repair_indices
for us to play around with.
Discussion point (2) is how to handle broken orderings. For instance
(post_sub <- post[c(6, 4), , ])
would result in the "wrong" ordering ot the two iterations. The question is now how repair_indices
(and therefore functions which depend on it) should handle those. Shall we have the mapping 6 = 1, and 4 = 2, that is just take the supplied order (not considering iteration names), or shall we have 4 = 1 and 6 = 2, that is reorder samples according to the iteration names. Here I am less certain about what is preferable. The second one may be very surprising to users as functions may "randomly" change the ordering I guess. Any thoughts from your side?
I wanted to gather conversations about a potential rv
-like interface here so as not to derail other conversations (like #4).
I think a solid rv-like interface would be incredibly useful. More specifically, something that:
P()
and E()
Rv already has those down, but would be even better if it:
With all of those requirements in place, I could see tidybayes moving largely towards using tables of these rv-like objects. It would be very useful for a lot of the posterior manipulation/summarization/visualization tasks tidybayes is designed for.
If you all are interested in supporting such a format here, then the question is what's the best way to get there? The options might be:
rv
maintainer and ask if they are willing to let us take it over and make backwards-incompatible changes to it, followed by a new major release deprecating some stuff.rv
code that we want to build on into this package, come up with a new class name (to not clobber "rv") and go from there.(1) Could work if the new maintainer is not planning much with the package and if there aren't a lot of users. Currently the package has ~400 downloads/month and no revdeps on CRAN.
(2) Would be doable depending on license preferences (it is GPL-2). This could also be aided by the fact that rv
looks to have been written by one of Andrew Gelman's former students, Jouni Kerman (@jgabry do you know him?).
I would be willing to float (1) to the current rv
maintainer, Joseph Stachelek --- I've interacted with him once or twice on twitter and github so it wouldn't be a complete cold email (unless either of you know him better). If we'd rather go for (2) or (3) it might be good to reach out to Jouni Kerman to get his thoughts (either on using his code or on things he would have done differently if he wrote the package again).
Now that the repo is public should we start using version numbers and tagging releases? For example, we could do something like 0.0.1
for the alpha release (incrementing the 1 if we make important changes before beta), 0.1.0
for the beta release (same comment about incrementing), and 1.0.0
for the first CRAN release. Thoughts?
To accommodate spelling conventions in different regions of the world dplyr has both summarise()
and summarize()
. I think it would be good to do that in this package too. Thoughts?
I currently think we should have both a subset_draws
and a subset_iterations
methods for consistency with our currenty naming conventions #5. Whay I am unsure about is what these methods should do for specific formats.
For instance, how shall we handle subsetting draws in the draws_array
format. If we took draws
literally there, we would have to insert missing values when the subsetted draws are not symmetric over chains.
For other formats, such as data.frames with .iteration
and .draw
we can easily subset for both, but when we subset via draws and then try to transform into a draws_array
we may end up running to the same problem as above.
Any suggestion on how to handle this?
Im impressed by the progress. Since Im finalizing the posteriordb beta, I realized that it would be good to formalize also how a posterior is represented as json. This would also enable cross-language representations.
Have you thought about this?
I currently represent draws as a named object of numeric arrays. Think a named list of a vector in R, with one vector per parameter.
@paul-buerkner and I were discussing this today. it should be possible to do the conversions between representations without using methods by just checking the input objects, but we could use classes and methods if we want. Both have their appeal. We also have to consider the diagnostics and summaries. Without clases and methods we’d end up doing lots of checking of inputs in many different parts of the package, whereas the methods know the types of their inputs already. Anyway, just something to consider and we can proceed with implementing the internals of the conversion, summary, and diagnostic functions without having deciding this.
I love the name posterior and am strongly in favor of it but I still want to bring up this issue. In the process of the initial discussions, we bascially switched our naming conventions away from posterior_
to draws_
as the draws we deal with may not necessarily come from a posterior distribution. Accordingly, a somewhat more accurate package name would be draws but it is much less catchy and more difficult to remember.
As I said above, I would like to keep the name posterior but make sure we have thought about it at least for a second before openly releasing it.
I would appreciate your thoughts on this! :-)
See types in the print output below.
> print(as_draws_df(eight_schools))
# A tibble: 400 x 13
.chain .iteration .draw mu tau `theta[1]` `theta[2]` `theta[3]` `theta[4]`
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 -4.78 1.62 -1.61 -4.96 -5.41 -2.88
2 1 2 2 6.92 3.40 11.0 9.34 8.60 8.37
...
Is there a reason or just an oversight?
Possibilities:
Questions:
Not sure what would be the best interface for that in subset
but it would definitely be nice to extract variables using regular expressions. We could either have a new argument variable_regex
to specifc the regular expression or a flag (called regex
for instance) to indicate whether variable
should be interepreted as a regular expression. What would you prefer? Or would you prefer yet another approach?
@jgabry can you activate travis for the posterior package? I cannot do it myself since it lives on your github page.
Questions:
This should include at least
Specifically, the following parts need to be documented:
as_draws_{format}
generics/methodssummarise_draws
generic/methodssubset
methodsrepair_draws
generic/methodsorder_draws
generic/methodsextract_one_variable_matrix
generic/methodsposterior-package.R
Since we're using tibbles I guess the names draws_tibble
, draws_tbl
, or draws_tbl_df
would be more consistent with the naming convention of draws_{class}
. What do you think? Personally I like the name draws_df
better (it just looks nicer I think) but perhaps we should change it?
As I've been writing rename_variables()
I've found it's a little awkward to work with draws objects when the default print
output at the console is typically gigantic. This also makes examples a little verbose, as it feels necessary to call summarise_draws()
constantly.
Two thoughts:
print()
for draws objects call summarise_draws()
?print
, summary
, and summarise_draws
). That possibly feels a bit overkill? I can see how they are typically used in different ways, so having them all as aliases is probably fine, but it is worth considering.The "subset" operation is currently a fairly generic "slice" operation, right? In which case, I wonder if it makes sense to allow combined subsetting / slicing operations, like:
subset(posterior, chain = 1, iteration = 1:100, variable = c("x", "y"))
This might be a more straightforward API than a separate function for slicing along each dimension. Although it would have to enforce no slicing along draws and chains/iterations simultaneously (or whatever the solution to #6 is).
The current syntax may lead to constructions like this:
posterior %>%
subset_chain(1) %>%
subset_iteration(1:100) %>%
subset_variable(c("x", "y"))
Which I think is also nice, but may be more verbose than the compound version without additional clarity. subset()
is also already a base-R generic intended for this kind of slicing, I think.
Currently, we use 'draws' for two different things, the object themselves and and draw indices. I think this is acceptable in general and induces not too much confusion. However, some function names are not optimal in that regard. For instance, one would assume that the draws()
function, if it exists, creates a draws
object, not that it returns the draw indices of an existing object.
Accordingly, I think we should at least rethink how we call the draw index extraction function to avoid any uncessary confusion.
I think we need extractor function of iterations, chains, draws, and variables. Two questions:
How do we name them? Simply iterations
, chains
, draws
and variables
?
For the former 3, shall we return a vector of indices, e.g., (1, 2, 3, 4) in case of 4 chains, or simply the number of chains, etc.?
Depending on the answer for (2), we may want to go for niterations
etc. as names to indicate that is is about the number of iterations not the indices themselves. Or should we even support both?
> class(x)
[1] "pdb_gold_standard_draws" "draws_list" "draws"
[4] "list"
> posterior::as_draws_df(x)
Error in class(out) <- class_draws_df() :
attempt to set an attribute on NULL
> xdf <- posterior::as_draws_matrix(x)
Error in class(out) <- class_draws_df() :
attempt to set an attribute on NULL
> packageVersion("posterior")
[1] ‘0.0.1’
We could, for instance, use some posterior draws of the eight-schools example.
This should at least cover all the transformation and post-processing functions.
as_draws_{format}
methodssummarise_draws
and aliasesextract_one_variable_matrix
methodsrename_variables
subset
and thin
methodsrepair_draws
and order_draws
methodsBecause we already have the traditional summary
as an alias for summarise_draws
, I propose using subset_draws
with subset
as an alias. (This is also consistent with names like repair_draws
, order_draws
, example_draws
, etc.).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.