stan-dev / posteriordb Goto Github PK

View Code? Open in Web Editor NEW

164.0 8.0 26.0 73.18 MB

Database with posteriors of interest for Bayesian inference

R 12.79% Stan 81.13% TeX 5.71% Python 0.38%

bayes bayesian posterior

posteriordb's People

Stargazers

Watchers

posteriordb's Issues

Terminology

I feel there might be some disconnect between how we use terms like posterior, dataset (or just data) and model and how users would use the terms.

Here's some possible descriptions the users and we might use (feel free to suggest improvements)

Users might describe posterior as a posterior distribution for model parameters while we (as in the PDB implementators) use posterior to mean a model, a dataset and a posterior distribution for them. We also give each posterior a name.

Users might describe dataset as a matrix of observations and attributes. We might describe it as information (like citation that introduced the dataset) along with the matrix. We also give each dataset a name.

Users might describe model as a statistical relationship between (some) attributes of the data (which could be specified with a programming language). We might describe it as information (like citation) along with the statistical relationship. We also give each model a name.

Python examples

Model

from posteriordb import PosteriorDatabase
posterior_database = PosteriorDatabase("folder_of_pdb")
posterior_database.model("8_schools_centered")

The users might think this returns the model code for the 8 schools model, however getting the model code requires posterior_database.model("8_schools").code("stan")

Dataset

posterior_database.dataset("8_schools")

The users might think this returns the loaded JSON dataset. That instead requires posterior_database.dataset("8_schools").data()

Posterior

posterior_database.posterior("8_schools-8_schools_centered")

The users might think this returns the samples from the posterior distribution of model parameters, however to do that requires posterior_database.posterior("8_schools-8_schools_centered").gold_standard().posterior_draws() (this part is not implemented yet)

R

The R library doesn't fully suffer from the same issues as get_data(po) returns the loaded JSON and model_code(po, "stan") returns the actual model code.

However this also means that model and dataset are not first-class entities in R. Let's say we want to search for all the hierarchical models and for each model print the references of that model. This will be more difficult if model is not a first-class entity.

What to do?

I don't know if there's a good solution for this mismatch in how we use the terms compared to how users might use them. Ideally we would try to find a set of terms that would have the same meaning for us and the users and have the API use those terms. This might be hard.

What do you think? @paul-buerkner @MansMeg
Also, if you can think of better definitions of these terms both from the users perspective and from our perspective (so the implementors perspective) those are welcome

Add metadata filter functions

Add meta-data filter:
can you filter after meta-data of the posterior? Both for posteriors, data and models.

Include explicit version control in database

Ie. add git hash for the database object in R and python. In this way we always know what version of the DB that was used.

Creata vignette for the R package

Documentation: adding a model

Add a guide that goes through the steps to add a new model to the posterior db

Add contributor and date the model was added

In data and model info, it would be good to have field for contributor and date added, maybe version for model
Contributor and date would help to increase trust that models are verified and track the contributor. We want the models to follow good practices and that means that when the prior recommendations or language changes we need sometimes to update models.

Improving the python API

Here's some potential changes for the python API that I feel could make it better. The R API is not affected.

You can view a README with the new API here: https://github.com/eerolinna/posterior_database/blob/patch-1/python/README.md

Constructing a posterior

Old: po = Posterior(posterior_name, my_pdb)
New: po = my_pdb.posterior(posterior_name)
(the old way will still work)

The same applies for model and dataset, so we have my_pdb.model(model_name) and my_pdb.dataset(data_name)

I feel this can makes it clearer that the posterior comes from the posterior database.

Accessing model code

Old (all are equivalent)

mo = po.model
mo.model_code("stan")
mo.stan_code()
po.model_code("stan")
po.stan_code()

New

mo.code("stan")
mo.stan_code()
po.code("stan")
po.stan_code() # this could also maybe be dropped, but keeping this is fine too

This drops the unnecessary model prefix. po.model_code_file_path("stan") is also shortened to po.code_file("stan")

po.code("stan") maybe should still be po.model_code, or we could just drop it and use po.model.code("stan"). po.stan_code() could perhaps also be dropped, then we'd use po.model.stan_code()

Accessing model information

Old

mo.model_info
po.model_info

New

mo.information
po.model.information

Drops the model prefix and removes the shortened form

The result from calling these is something like

{'keywords': ['bda3_example', 'hiearchical'],
 'description': 'A centered hiearchical model for the 8 schools example of Rubin (1981)',
 'urls': ['http://www.stat.columbia.edu/~gelman/arm/examples/schools'],
 'title': 'A centered hiearchical model for 8 schools',
 'references': ['rubin1981estimation', 'gelman2013bayesian'],
 'added_by': 'Mans Magnusson',
 'added_date': '2019-08-12'}

The slot model_code is dropped from the result as this is just an implementation detail and mo.code_file("stan") already contains this information.

Accessing data file

Old

da = po.dataset
da.data()
po.data()

This one I'm not sure what to do about, I feel it's confusing to have both po.data() (the actual data, in other words the loaded JSON) and po.dataset (which is the PDB concept of a dataset that has a name like "8_schools" and contains both the actual data po.dataset.data() and the information about the dataset po.dataset.information)

Accessing dataset information

Old

da.data_info
po.data_info

New

da.information
po.dataset.information

Same changes as in model information

The result is something like

{'keywords': ['bda3_example'],
 'description': 'A study for the Educational Testing Service to analyze the effects of\nspecial coaching programs on test scores. See Gelman et. al. (2014), Section 5.5 for details.',
 'urls': ['http://www.stat.columbia.edu/~gelman/arm/examples/schools'],
 'title': 'The 8 schools dataset of Rubin (1981)',
 'references': ['rubin1981estimation', 'gelman2013bayesian'],
 'added_by': 'Mans Magnusson',
 'added_date': '2019-08-12'}

The slot data_file is dropped from the result as this is just an implementation detail and da.data_file() already contains this information.

Clean database

Remove parts of the posterior database that is not follow the documented content, i.e. does not have enough documentation.

Posteriors in the database that do not fulfil the requirements can be moved to another folder temporarily.

Add model_names and dataset_names

Currently we have pos <- posterior_names(my_pdb). Should we also have model_names and dataset_names?

Rename stan_data_file_path

Should we just have data_file_path instead of stan_data_file_path? As datasets are probably not going to be framework specific.

Linting should only be done for Python

It is a lot of unnecessary work in r to handle whitespace. Since R doesn't use whitespace in syntax it doesn't matter much, but we could direct the linting tests to the python folder. But now it is just a hassle to get failing test because of this so I disable that test in Travis file.

Change posterior draws to arrays by parameter (named) in gold standard

The array should be named by parameter names.

Adding models of different frameworks needs duplication in many posterior files

Let's say I want to add a PyMC version of the logistic model.

Then I need to modify these posterior files

wells_centered_educ4_interact|logistic.json
wells_centered_educ4|logistic.json
wells_centered|logistic.json
wells_noncentered_log|logistic.json
wells_noncentered|logistic.json

and change

"model": {
    "stan": "content/models/logistic/logistic.stan"
  },

"model": {
    "stan": "content/models/logistic/logistic.stan",
    "pymc": "path/to/model.py"
  },

in all of them.

Can we make it so that this change would need to be made only in one file?

Add a graph over the database in the README

We need to have a graph of the data base (need to add it to the README) to simply explain the database and how it looks Steve may have ideas here.

Rename posterior class

In posteriordb, how do we name the class containing a posterior? Simply "posterior" right? Perhaps we could think of a more specific class name as I fear this could crash with other packages' class names

Documentation: adding a dataset

Add a guide that goes through the steps to add a new dataset to the posterior db

R: make stan_code, dataset etc more generic

Right now we have

po <- posterior("8_schools-8_schools_centered", my_pdb)
dataset(po)
stan_code(po)

It would be nice to also have something like

mo <- model("8_schools_centered", my_pdb)
stan_code(mo)

and

da <- data("8_schools", my_pdb)
dataset(da)

Add future work section on what is next

add python library
add more posteriors
add more gold standards

Consider using a different separator than "|" for posterior names

Browsing the posterior database directory with the command line shell is a bit inconvenient because | is a special character in the shell. This is not a dealbreaker though.

Documentation: adding a posterior

Add a guide that goes through the steps to add a new posterior to the posterior db

Add CI check for generated README

It would be good to make sure in CI that the generated README is up-to-date. So essentially test that

read.file("README.md") == generate_from_Rmd("README.Rmd")

where generate_from_Rmd would be a function that generates markdown file from R-markdown file (I guess a function like this might exist already but I don't know the real name)

Add installing package and pointing the database to github

Now we need to clone the db to use it. It would be better to just install an R package and then as default point the database to the github repository. Then users do not need to download as much (but will only be able to use the latest models.

Python codecov seems to be missing

Looks like the commit that enabled it has mysteriously vanished, I'll see if I can enable it again

Add print methods for new S3 objects

Not important to do it immediately but we still need to think of this.

Add README code to testsuite

Add gittag in database as ISO date

Add contributor for gold standard samples

Why copy code and data to R temp directory?

I'm not too familiar with R temp directory, what is the use case for copying model and data files to it? Is it needed?

I'm asking because python doesn't really have an equivalent concept and it would be great to keep the python and R versions as close to each other as possible.

Renaming data column names in posterior files

If we have this model linreg

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}

and dataset sales that is like

{
  "advertising_spend": ["..."],
  "sales": ["..."]
}

it would be great if in the posterior file we could have something like

{
  "model": "linreg",
  "data": "sales",
  "rename_data": {
    "advertising_spend": "x",
    "sales": "y"
  },
  "...": "..."
}

so that we could form a posterior of linreg and sales

Disabled Python tests + think I removed codecov accidentally

I got a merge conflict with Travis, due to rearranging of the posterior database folder. Now it has been moved but I needed to temporarily disable the Python tests, and I think I accidentally removed codecov for Python as well.

The rearranging of the pdb needs to be fixed in Python.

Documentation: adding model code in another framework for an existing model

Add a guide that goes through the steps to add a new model implementation for an existing model

Commented out Python 3.7

It fails in Travis so I commented it out. I leave this as an issue.

Get test suite coverage 90% +

Include Michaels, Arviz and Bens models

Michael Betancourt and Ben Bales has a large number of models and data that should be included in the database.

Also see: https://github.com/arviz-devs/arviz/blob/master/arviz/data/datasets.py

Add integrity checks

Each model has a distinct stan code file, in other words no two different models have the same stan code file. If two models have the same stan file they are duplicates and one should be removed.
Each dataset has a distinct .json.zip file, for the same reason as above.
Same for model information and dataset information.

I will be adding python tests for these. If you can come with more integrity checks let me know.

Reference format

Currently references in the info files are in a format like this

"references": ["Rubin (1981)", "Gelman et. al. (2014)"]

This format can be ambigious, for example if there are two papers published by Rubin in 1981.

Should we include the paper title too? Or should we use a full bibtex entry?

Better repository structure

I think we should clean up the repo before prototype release, especially since we now have both Python and R data. I have the following suggestions:

Remove contents/datasets-raw? I guess this is just legacy stuff left.
Change "content" to "pdb" and move posterior into "pdb"?

Include additional posterior components of interest

posterior (default)
posterior predictive
prior
prior predictive
elementwise loglikelihoods

The question is how to do this efficiently. Probably by including likelihood functions that take the data and posterior gold standard and produce it.

Setup checklists for how to add things in the DB

What do we need to do to check a gold_standard, posterior content, model content, model code (validated stan priors) e.t.c. What has been validated?

Requirements and rules to add new stuff to the posterior data base. Ie what should be done to add and also what can be tested more or less automatically.

How precise should the submitted posteriors be? What are the precision metrics? (minimum tail >ESS?)

Practically independent draws obtained by long chains with no divergences thinned to desired number of draws

Fix pdb_local()

Now it looks for posterior_database and fails if it doesn't exist.

Add code_file_path

Currently we have stan_code_file_path(po). I guess we should also have code_file_path(po, framework = "stan/pymc/something else") to make it possible to use models of other frameworks. Then stan_code_file_path(po) could be defined as code_file_path(po, framework = "stan")

Model info files for models in many frameworks

Currently model infos are found by replacing the model file extension with .info.json. See for example content/models/8_schools/8_schools_centered.stan and content/models/8_schools/8_schools_centered.info.json

What if we have models also in other frameworks like PyMC and Pyro? Then we might have the model files 8_schools_centered_pymc.py and 8_schools_centered_pyro.py (directories omitted). Do we also need 8_schools_centered_pymc.info.jsonand 8_schools_centered_pyro.info.json? Or can we have just one info file that describes the models of all frameworks?

Add templates and schemas

From README: "templates for different jsons can be found in content/templates and schemas in schemas". They are currently missing

Remove added by and added date

This is information we get from git

Setup gold standards so we can test the gold standard definition automatically

Two approaches to generate gold standards are included now. Stan_sampling and analytically. Stan sampling posteriors can be tested automatically to fulfill the requirements, but that requires that we create a posterior draws an object that includes chain information. The analytical posterior requires code for reproducibility.

Add all slots of relevance to rerun the gold standard and chains to the draws object.

Rename to posterior_db?

posterior_db feels like a more clear name for this repo to me than posteriordb

Consistency between model code and dataset paths in info files

In the model info files we have

"model_code": {
    "stan": "content/models/stan/8_schools_centered.stan"
},

which when concatenated with the path to posterior db gives the full filepath to the stan code file.

The dataset info files have

"data_file": "content/datasets/data/8_schools.json"

which does not give the full filepath when concatenated with the path to the posterior db. This is because a ".zip" needs to be added to the end.

I propose we change the dataset info files to

"data_file": "content/datasets/data/8_schools.json.zip"

Add an example of a gold_standard samples object

I would be nice to have a model with actual posterior samples already present so that we can judge the structure of this part as well

Maybe split gold standards to a separate project?

I've started to think that maybe the gold standards should be in a separate project. This idea is not yet fully formed but I thought it would be good to put it out now. I'll use the name goldstandards for this hypothetical project.

posteriordb provides posterior names that are used as unique identifiers in goldstandards
goldstandards can get greater flexibility in how it is implemented: perhaps a central database and a web server along with client packages would work best for that project instead of the git repo with PRs model that we currently use in posteriordb
Moving to a separate project doesn't mean we can't keep the same API, so gs <- gold_standard(po) could still work (however there are some caveats)
One advantage could also be that we could first focus on getting many posteriors to posteriordb and only later add the gold standards. Of course this doesn't necessarily require two separate projects

Going more general

We dont have to restrict the other project to just gold standards, it could include posterior samples for any method. I'll call this more general project posteriorsamples

Selecting a gold standard

Having the ability to upload multiple posterior samples for a posterior might make it easier to come up with a good gold standard

Contributors could upload several potential gold standards
An expert user could decide which if any of these should be granted the status of an gold standard
To help them decide we could have an user interface that allows them to see marginal density plots, diagnostic values etc.

Developing inference methods

A researcher develops a new variational inference method niceinference and runs it on 50 posteriors from posteriordb. They publish a paper about their method that includes comparisons to gold standards. They also upload the posterior samples to posteriorsamples.
You are working on an improved version of niceinference. You run your method on the same 50 posteriors. You also download the posterior samples of niceinference. Now in your paper you can include both comparisons to the gold standard and between the two methods.

Next steps

So as a reminder, this idea is not yet fully developed. Maybe keeping gold standards under posteriordb is the best idea. Maybe it's not. I might write a follow-up post later if the idea develops further, right now I just wanted to put this out here.

I'll tag you here so you notice this, but commenting right now is not necessary (but if you got some ideas from this of course feel free to comment) @MansMeg @paul-buerkner

Documentation: adding a gold standard

Add a guide that goes through the steps to add a new gold standard to the posterior db