redmod-team / profit Goto Github PK

Probabilistic Response mOdel Fitting with Interactive Tools

Home Page: https://profit.readthedocs.io

License: MIT License

Python 78.88% Mathematica 4.92% Fortran 6.01% Jupyter Notebook 9.60% CSS 0.04% Makefile 0.15% Julia 0.40%

active-learning gaussian-processes model-emulation polynomial-chaos-expansion reduced-order-models reduced-order-surrogate-model surrogate uncertainty-quantification uq

profit's People

Contributors

Stargazers

Watchers

Forkers

manal44 baptisterubino mkendler krystophny michad1111 rykath squadula dbaroli

profit's Issues

Document finished and open tasks on Laplace approximation and active learning

Implement advanced noise models

Noise covariance matrix
Heteroscedastic noise via linear model and/or another GP
Individual noise for data sources

Implement online updating via eigendecomposition

see https://en.wikipedia.org/wiki/Bunch%E2%80%93Nielsen%E2%80%93Sorensen_formula

Check number of cores

If the option ntask = n is used for parallel computing, check the number of available cores before starting the computation.

Consistent definition of sigma_f and sigma_n

In proFit, the hyperparameter vector used is: [ l=length-scale , sigma^2 = (sigma_n/sigma_f)^2 ] in order to normalize.

Adapt the written functions to this definition:

Replace l^2 by l in the functions' arguments.
Add a documentation for sigma .
Add an indication about the choice of sigma_f (ex: sigma_f always equal to 1) since it isn't a parameter of the kernel functions.
Handle the evantual different values for the same variable sigma_f given that it becomes an implicit argument for the functions which build the Covariance Matrices K(X_test,X_test) ; K(X_test,X_training) ; K(X_training,X_training) .

Replace Kathi's sympy kernels by GPy kernels in application

Different functions doing the same task

In the file profit.sur.backend the following functions do the same task:

To return any of the covariance matrices K(X_train,X_train) ; K(X_test,X_test) ; K(X_test,X_train) :

kernels.gp_matrix(x0, x1, a, K)
gp.gp_matrix(x0, x1, a, K)
gp_functions.k(x0, x1, l)

To return the covariance matrix K(X_train,X_train):

kernels.gp_matrix(x0, x1, a, K)
gp.gp_matrix_train(x, a, sigma_n) (the only difference is the added gaussian noise sigma_n on the diagonal of K(X_train,X_train))

Clean up #!/usr/bin/python headers

Symmetry of the Posterior Covariance Matrix

The posterior covariance matrix (cov_f_star) isn't perfectly symmetric.

There is an error of approximatly 1e-14:
The command : np.max(cov_f_star-np.transpose(cov_f_star)) returns a value arround 1e-14

Consistent handling of relative paths

In profit.yaml and LocalCommand troubles can arise with relative paths. The most logical way from the user would be, to relate all occurances of ../ to the study directory, i.e. replace ../ by ../../../ everywhere (study/run/XX/ instead of study/).

The best place to change this is directly in LocalCommand, since the place from which people access the Python API is usually also in study, as the profit.yaml.

Implement PC-Kriging with additive GP

After projecting to a low-order spectral basis (PCE for global UQ) one can model the residue by a GP with an additive kernel. This allows for modeling complex behavior and sensitivity analysis (ANOVA / Sobol indices)

Use argparse instead of sys.argv manual

Implement Active learning

create a standardized interface between 'run' and surrogates' active learning
create actual Active Learning process (fills input.txt with points, that contribute the most information)
create test cases and benchmark

Test Surrogates and rewrite Examples

Implement / revise test cases for Custom and GPy.
Also revise the Examples to match with the cleaned up project structure. (also solves #16)

Generating runs based on directory template

Many codes rely on a standardized directory structure for each run. To automatically generate run directories the user provides a template file. Placeholders for input parameters in the template file are automatically replaced by values for a specific run. This feature should be usable for both, online and offline runs, and also dynamically generated parameter vectors.

Try block Cholesky to invert matrices

Clean interface for sympy kernels with derivatives

Add features to UI

Final name for code

Redmod is too generic and SurUQ sounds too orcish. Instead of Surrogate the word parameter should be in focus. Suggestions:

Paris - Parameter space regression including sensitivities
Paras - Parameter space regression with analysis of sensitivities
Parami - Parameter space regression with analysis ...
Parma - Parameter space
Supar - Surrogates and UQ via Parameter space regression
Hypar - Handling your parameter space regression

Write prototype for autoencoder functionality

Using scikit-learn, then scale up with PyTorch

See e.g. https://i-systems.github.io/teaching/ML/iNotes/15_Autoencoder.html

Explore and leverage parallels to easyVVUQ

Starting and managment of of runs on cluster
Check out amzn/emukit: A Python-based toolbox of various methods in uncertainty quantification and statistical emulation: multi-fidelity, experimental design, Bayesian optimisation, Bayesian quadrature, etc.

Research options for dimensionality reduction

start with PCA and work towards more generic nonlinear methods (maybe based on local sensitivity analysis)

Make it possible to treat curly braces in simulation input files

Double brackets? Configurable?

input_mode(json)
separator = '{{'

input.json :
{
  'x': {{x}},
  'y': 4
}

config.h
int return_config()
{
  return {{u}}*x;
}


{{
 'x': {x},
 'y': 4
}}

More complete variance estimate

Include

Laplace approximation around MAP values (or multiple peaks) in hyperparameter space
Variance due to (not necessarily simple) linear mean model according to Rasmussen 2.7

Fix the path conventions

In some cases, to use a function in proFIt, it is required to indicate its whole path: from the root file: profit.profit. ... instead of just starting it from the current file.

Update examples and tests

Stitching together data

Implement possibility to shift x-axis of data such that two data sources are stitched together in the optimum way. This will require a hyperparameter that quantifies the relative shift.

Plan work to be done on GPyTorch

Cleanup Config

Bring Config class and user interface in a clear form.

enhance Config options (also solves #29)
resolve path problems (also solves #19, #41)
standardize code formatting
update doc with Config options
make variable functions easily customizable (also solves #21)
optionally include Independent variable in inputs and treat as another parameter
save output as .txt or .hdf5
.py file with dict should also be a valid config file besides .yaml

Support Normal distribution again

Currently, only Uniform and LogUniform are supported after switching configuration backends to yaml.

profit binary not found on Windows/Anaconda

pip install -e . --user moves profit into %APPDATA%, which is usually not in %PATH% on Windows with Anaconda. Documentation should be updated to use pip install -e . on this setup.

Revise SLURM backend for runner

log transform in optimizer

Implement synchronized writing of output.txt

This is related to #7 . Could be done via MPI and/or simpler local solution for the multi-process runner.

Implement and test derivatives in GPy Prod kernels

Parameter scan to compare two codes

A user develops a new numerical method that is faster at the same accuracy than existing methods. He wants to produce plots of accuracy vs computation time for his new code as well as an existing one.

Old code

Input parameters:

relative tolerance, logarithmic from 1e-6 to 1e-12
Output parameters:
computation time
accuracy

New code

Input parameters:

step size, logarithmic from 1e-1 to 1e-3
Output parameters:
computation time
accuracy

It should be possible to plot two outputs against each other here. So one would fit a response model with x = computation time and y = accuracy.

Two functions having the same definition

In the file profit.profit.sur.backend the following functions have the same definition:

kernels.gp_matrix(x0, x1, a, K)
gp.gp_matrix(x0, x1, a, K)

Add HDF5 MPI support

When doing distributed runs on the cluster, all output must be written in a concurrency-sage way. HDF5 with MPI communication looks like a reasonable choice.

Integrate tool to explore conditional probability distributions

For the work with Ulrich Callies from HZG a tool was developed to explore conditional distributions with one or more variables fixed in a certain range. Then the marginal distributions of the remaining variables are plotted as histograms and/or with a kernel density estimator. This way a high-dimensional probability distribution can be explored in an intuitive way.

Automatic runs at specific points

The user wants to specify points where the response should be evaluated. Based on a user-supplied template she tells profit to generate a set of directories and a batch submission script.

Remarks:

A template for a single code run is required as well as a for the submission script, as queuing system and specific requirements of the code are not known. One could supply "template templates" for the most common queuing systems. One should not reinvent the wheel by adding a lot of options that SLURM/PBS already supply in their file format that most users know.

Consistent optimization in sigma_f for normalized variant

Dividing K_y by sigma_f^2 introduces an extra additive term -1/2 log(sigma_f^(-2)) in the NLL. This should be either justified or cancelled by adding it again.

Use a simple and documented method to build the Covariance Matrices

There needs to be one simple and clearly documented way to build the Covariance Matrices: K(X_test,X_test) ; K(X_test,X_training) ; K(X_training,X_training) .

Update documentation to current use cases

Running offline with input/output files

The user would like to run his code independently from suruq. Therefore the user takes the following steps

Run profit in preprocessing mode to generate input file with a table of input parameters
Run code based on different parameter combinations in input file
Collect results in output file with format readable by profit
Do postprocessing in profit

Interfacing to input/output file should be easy and done by the user. For this purpose a txt and a hdf5 standard format will be supplied.

Document finished and open tasks on product kernels and derivatives

Cleanup Surrogates

Standardize surrogates. For now only Custom and GPy.

set structure in abstract class
implement interfaces to Config, so the surrogate to be used can easily be selected
cleanup Custom surrogate functions and add docstrings
~~make Fortran kernels user friendly~~
provide standard kernels in python
implement methods, so every surrogate has the same (e.g. train, add_training_data, predict, plot, etc.) and can be accessed by a standardized interface
implement / revise different calculation methods in backend for Custom surrogate. Make them easily extendable by future developers

Make three-digit run folders standard

Right now, run folders are created as "0, 1, 2, 3, ..., 10, 11, ...". For better sorting in the file manager and console it should be standard to have "000, 001, 002, ..." which supports up to 1000 run folders. More generally one should put a configuration option ndigit in the run section of profit.yaml that defaults to three.

Check shift in nll
Avoid negative eigenvalues
Don't to iterative eigendecomposition for too small systems