stan-dev / posteriordb Goto Github PK
View Code? Open in Web Editor NEWDatabase with posteriors of interest for Bayesian inference
Database with posteriors of interest for Bayesian inference
I feel there might be some disconnect between how we use terms like posterior
, dataset
(or just data
) and model
and how users would use the terms.
Here's some possible descriptions the users and we might use (feel free to suggest improvements)
Users might describe posterior
as a posterior distribution for model parameters while we (as in the PDB implementators) use posterior
to mean a model, a dataset and a posterior distribution for them. We also give each posterior a name.
Users might describe dataset
as a matrix of observations and attributes. We might describe it as information (like citation that introduced the dataset) along with the matrix. We also give each dataset a name.
Users might describe model
as a statistical relationship between (some) attributes of the data (which could be specified with a programming language). We might describe it as information (like citation) along with the statistical relationship. We also give each model a name.
from posteriordb import PosteriorDatabase
posterior_database = PosteriorDatabase("folder_of_pdb")
posterior_database.model("8_schools_centered")
The users might think this returns the model code for the 8 schools model, however getting the model code requires posterior_database.model("8_schools").code("stan")
posterior_database.dataset("8_schools")
The users might think this returns the loaded JSON dataset. That instead requires posterior_database.dataset("8_schools").data()
posterior_database.posterior("8_schools-8_schools_centered")
The users might think this returns the samples from the posterior distribution of model parameters, however to do that requires posterior_database.posterior("8_schools-8_schools_centered").gold_standard().posterior_draws()
(this part is not implemented yet)
The R library doesn't fully suffer from the same issues as get_data(po)
returns the loaded JSON and model_code(po, "stan")
returns the actual model code.
However this also means that model
and dataset
are not first-class entities in R. Let's say we want to search for all the hierarchical models and for each model print the references of that model. This will be more difficult if model
is not a first-class entity.
I don't know if there's a good solution for this mismatch in how we use the terms compared to how users might use them. Ideally we would try to find a set of terms that would have the same meaning for us and the users and have the API use those terms. This might be hard.
What do you think? @paul-buerkner @MansMeg
Also, if you can think of better definitions of these terms both from the users perspective and from our perspective (so the implementors perspective) those are welcome
Add meta-data filter:
can you filter after meta-data of the posterior? Both for posteriors, data and models.
Ie. add git hash for the database object in R and python. In this way we always know what version of the DB that was used.
Add a guide that goes through the steps to add a new model to the posterior db
In data and model info, it would be good to have field for contributor and date added, maybe version for model
Contributor and date would help to increase trust that models are verified and track the contributor. We want the models to follow good practices and that means that when the prior recommendations or language changes we need sometimes to update models.
Here's some potential changes for the python API that I feel could make it better. The R API is not affected.
You can view a README with the new API here: https://github.com/eerolinna/posterior_database/blob/patch-1/python/README.md
Old: po = Posterior(posterior_name, my_pdb)
New: po = my_pdb.posterior(posterior_name)
(the old way will still work)
The same applies for model and dataset, so we have my_pdb.model(model_name)
and my_pdb.dataset(data_name)
I feel this can makes it clearer that the posterior comes from the posterior database.
Old (all are equivalent)
mo = po.model
mo.model_code("stan")
mo.stan_code()
po.model_code("stan")
po.stan_code()
New
mo.code("stan")
mo.stan_code()
po.code("stan")
po.stan_code() # this could also maybe be dropped, but keeping this is fine too
This drops the unnecessary model prefix. po.model_code_file_path("stan")
is also shortened to po.code_file("stan")
po.code("stan")
maybe should still be po.model_code
, or we could just drop it and use po.model.code("stan")
. po.stan_code()
could perhaps also be dropped, then we'd use po.model.stan_code()
Old
mo.model_info
po.model_info
New
mo.information
po.model.information
Drops the model prefix and removes the shortened form
The result from calling these is something like
{'keywords': ['bda3_example', 'hiearchical'],
'description': 'A centered hiearchical model for the 8 schools example of Rubin (1981)',
'urls': ['http://www.stat.columbia.edu/~gelman/arm/examples/schools'],
'title': 'A centered hiearchical model for 8 schools',
'references': ['rubin1981estimation', 'gelman2013bayesian'],
'added_by': 'Mans Magnusson',
'added_date': '2019-08-12'}
The slot model_code
is dropped from the result as this is just an implementation detail and mo.code_file("stan")
already contains this information.
Old
da = po.dataset
da.data()
po.data()
This one I'm not sure what to do about, I feel it's confusing to have both po.data()
(the actual data, in other words the loaded JSON) and po.dataset
(which is the PDB concept of a dataset that has a name like "8_schools" and contains both the actual data po.dataset.data()
and the information about the dataset po.dataset.information
)
Old
da.data_info
po.data_info
New
da.information
po.dataset.information
Same changes as in model information
The result is something like
{'keywords': ['bda3_example'],
'description': 'A study for the Educational Testing Service to analyze the effects of\nspecial coaching programs on test scores. See Gelman et. al. (2014), Section 5.5 for details.',
'urls': ['http://www.stat.columbia.edu/~gelman/arm/examples/schools'],
'title': 'The 8 schools dataset of Rubin (1981)',
'references': ['rubin1981estimation', 'gelman2013bayesian'],
'added_by': 'Mans Magnusson',
'added_date': '2019-08-12'}
The slot data_file
is dropped from the result as this is just an implementation detail and da.data_file()
already contains this information.
Remove parts of the posterior database that is not follow the documented content, i.e. does not have enough documentation.
Posteriors in the database that do not fulfil the requirements can be moved to another folder temporarily.
Currently we have pos <- posterior_names(my_pdb)
. Should we also have model_names
and dataset_names
?
Should we just have data_file_path
instead of stan_data_file_path
? As datasets are probably not going to be framework specific.
It is a lot of unnecessary work in r to handle whitespace. Since R doesn't use whitespace in syntax it doesn't matter much, but we could direct the linting tests to the python folder. But now it is just a hassle to get failing test because of this so I disable that test in Travis file.
The array should be named by parameter names.
Let's say I want to add a PyMC version of the logistic model.
Then I need to modify these posterior files
and change
"model": {
"stan": "content/models/logistic/logistic.stan"
},
to
"model": {
"stan": "content/models/logistic/logistic.stan",
"pymc": "path/to/model.py"
},
in all of them.
Can we make it so that this change would need to be made only in one file?
We need to have a graph of the data base (need to add it to the README) to simply explain the database and how it looks Steve may have ideas here.
In posteriordb, how do we name the class containing a posterior? Simply "posterior" right? Perhaps we could think of a more specific class name as I fear this could crash with other packages' class names
Add a guide that goes through the steps to add a new dataset to the posterior db
Right now we have
po <- posterior("8_schools-8_schools_centered", my_pdb)
dataset(po)
stan_code(po)
It would be nice to also have something like
mo <- model("8_schools_centered", my_pdb)
stan_code(mo)
and
da <- data("8_schools", my_pdb)
dataset(da)
Browsing the posterior database directory with the command line shell is a bit inconvenient because |
is a special character in the shell. This is not a dealbreaker though.
Add a guide that goes through the steps to add a new posterior to the posterior db
It would be good to make sure in CI that the generated README is up-to-date. So essentially test that
read.file("README.md") == generate_from_Rmd("README.Rmd")
where generate_from_Rmd
would be a function that generates markdown file from R-markdown file (I guess a function like this might exist already but I don't know the real name)
Now we need to clone the db to use it. It would be better to just install an R package and then as default point the database to the github repository. Then users do not need to download as much (but will only be able to use the latest models.
Looks like the commit that enabled it has mysteriously vanished, I'll see if I can enable it again
Not important to do it immediately but we still need to think of this.
I'm not too familiar with R temp directory, what is the use case for copying model and data files to it? Is it needed?
I'm asking because python doesn't really have an equivalent concept and it would be great to keep the python and R versions as close to each other as possible.
If we have this model linreg
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
and dataset sales
that is like
{
"advertising_spend": ["..."],
"sales": ["..."]
}
it would be great if in the posterior file we could have something like
{
"model": "linreg",
"data": "sales",
"rename_data": {
"advertising_spend": "x",
"sales": "y"
},
"...": "..."
}
so that we could form a posterior of linreg
and sales
I got a merge conflict with Travis, due to rearranging of the posterior database folder. Now it has been moved but I needed to temporarily disable the Python tests, and I think I accidentally removed codecov for Python as well.
The rearranging of the pdb needs to be fixed in Python.
Add a guide that goes through the steps to add a new model implementation for an existing model
It fails in Travis so I commented it out. I leave this as an issue.
Michael Betancourt and Ben Bales has a large number of models and data that should be included in the database.
Also see: https://github.com/arviz-devs/arviz/blob/master/arviz/data/datasets.py
.json.zip
file, for the same reason as above.I will be adding python tests for these. If you can come with more integrity checks let me know.
Currently references in the info files are in a format like this
"references": ["Rubin (1981)", "Gelman et. al. (2014)"]
This format can be ambigious, for example if there are two papers published by Rubin in 1981.
Should we include the paper title too? Or should we use a full bibtex entry?
I think we should clean up the repo before prototype release, especially since we now have both Python and R data. I have the following suggestions:
posterior (default)
posterior predictive
prior
prior predictive
elementwise loglikelihoods
The question is how to do this efficiently. Probably by including likelihood functions that take the data and posterior gold standard and produce it.
What do we need to do to check a gold_standard, posterior content, model content, model code (validated stan priors) e.t.c. What has been validated?
Requirements and rules to add new stuff to the posterior data base. Ie what should be done to add and also what can be tested more or less automatically.
How precise should the submitted posteriors be? What are the precision metrics? (minimum tail >ESS?)
Practically independent draws obtained by long chains with no divergences thinned to desired number of draws
Now it looks for posterior_database and fails if it doesn't exist.
Currently we have stan_code_file_path(po)
. I guess we should also have code_file_path(po, framework = "stan/pymc/something else")
to make it possible to use models of other frameworks. Then stan_code_file_path(po)
could be defined as code_file_path(po, framework = "stan")
Currently model infos are found by replacing the model file extension with .info.json
. See for example content/models/8_schools/8_schools_centered.stan
and content/models/8_schools/8_schools_centered.info.json
What if we have models also in other frameworks like PyMC
and Pyro
? Then we might have the model files 8_schools_centered_pymc.py
and 8_schools_centered_pyro.py
(directories omitted). Do we also need 8_schools_centered_pymc.info.json
and 8_schools_centered_pyro.info.json
? Or can we have just one info file that describes the models of all frameworks?
From README: "templates for different jsons can be found in content/templates and schemas in schemas". They are currently missing
This is information we get from git
Two approaches to generate gold standards are included now. Stan_sampling and analytically. Stan sampling posteriors can be tested automatically to fulfill the requirements, but that requires that we create a posterior draws an object that includes chain information. The analytical posterior requires code for reproducibility.
Add all slots of relevance to rerun the gold standard and chains to the draws object.
posterior_db
feels like a more clear name for this repo to me than posteriordb
In the model info files we have
"model_code": {
"stan": "content/models/stan/8_schools_centered.stan"
},
which when concatenated with the path to posterior db gives the full filepath to the stan code file.
The dataset info files have
"data_file": "content/datasets/data/8_schools.json"
which does not give the full filepath when concatenated with the path to the posterior db. This is because a ".zip" needs to be added to the end.
I propose we change the dataset info files to
"data_file": "content/datasets/data/8_schools.json.zip"
I would be nice to have a model with actual posterior samples already present so that we can judge the structure of this part as well
I've started to think that maybe the gold standards should be in a separate project. This idea is not yet fully formed but I thought it would be good to put it out now. I'll use the name goldstandards
for this hypothetical project.
posteriordb
provides posterior names that are used as unique identifiers in goldstandards
goldstandards
can get greater flexibility in how it is implemented: perhaps a central database and a web server along with client packages would work best for that project instead of the git repo with PRs model that we currently use in posteriordb
gs <- gold_standard(po)
could still work (however there are some caveats)posteriordb
and only later add the gold standards. Of course this doesn't necessarily require two separate projectsWe dont have to restrict the other project to just gold standards, it could include posterior samples for any method. I'll call this more general project posteriorsamples
Having the ability to upload multiple posterior samples for a posterior might make it easier to come up with a good gold standard
niceinference
and runs it on 50 posteriors from posteriordb
. They publish a paper about their method that includes comparisons to gold standards. They also upload the posterior samples to posteriorsamples
.niceinference
. You run your method on the same 50 posteriors. You also download the posterior samples of niceinference
. Now in your paper you can include both comparisons to the gold standard and between the two methods.So as a reminder, this idea is not yet fully developed. Maybe keeping gold standards under posteriordb
is the best idea. Maybe it's not. I might write a follow-up post later if the idea develops further, right now I just wanted to put this out here.
I'll tag you here so you notice this, but commenting right now is not necessary (but if you got some ideas from this of course feel free to comment) @MansMeg @paul-buerkner
Add a guide that goes through the steps to add a new gold standard to the posterior db
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.