deel-ai / puncc Goto Github PK
View Code? Open in Web Editor NEW๐ Puncc is a python library for predictive uncertainty quantification using conformal prediction.
Home Page: https://deel-ai.github.io/puncc/
๐ Puncc is a python library for predictive uncertainty quantification using conformal prediction.
Home Page: https://deel-ai.github.io/puncc/
In case we want to conformalize a pretrained model using the API, the current ConformalPredictor
class requires a splitter even when X_fit
and y_fit
are not necessary.
In such situation, the correct behavior would be to pass a None splitter argument for the ConformalPredictor
's constructor and use calibration data in the call for fit
. The ConformalPredictor
should allow a None splitter only when train
argument is False.
Other
No response
After git clone https://github.com/deel-ai/puncc.git
and cd puncc
, the following action has failed:
$ make prepare-dev
python -m venv puncc-dev-env
make: python: Command not found
make: *** [Makefile:17: prepare-dev] Error 127
Cause: on my local machine (ubuntu via WSL on windows), I do not have a python
installed, but only python3
and python3.X
.
Possible solutions: None trivial, for what I know.
Using python3
is a bit better because it is explicit in what we want, but we could have an old (python <= 3.7) version pointed by python3
.
Remark: I do not know if not having a python
but only python3
is standard, but since I have it, others could (no major tinkering with python on my machine)
Installation of prepare-dev
virtual environment with puncc
and other packages.
v0.9
- OS: Ubuntu 20.04.5 LTS (Focal Fossa) on Windows Subsystem for Linux (WSL)
- Python version:
- Packages used version:
No response
Within a terminal (linux) without python
, run:
make prepare-dev
I am training my predictor (LinearRegression()
) with my own data, and then I want to create a conformal predictor with SplitCP
that takes my pre-trained model and does conformalization.
Here is my problem:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
X_fit, y_fit = make_regression(n_samples=200, n_features=1, noise=50, random_state=42, bias=200)
X_cal, y_cal = make_regression(n_samples=200, n_features=1, noise=50, random_state=42, bias=200)
X_test, y_test = make_regression(n_samples=100, n_features=1, noise=50, random_state=42, bias=200)
mod = LinearRegression()
mod.fit(X_fit, y_fit)
print(mod.coef_)
> [85.88287056]
So my mod
has been trained correctly. I give two different cases:
base = BasePredictor(mod, is_trained=True)
cp = SplitCP(base)
cp.fit(X_calib=X_cal, y_calib=y_cal)
Now, I expect SplitCP
to have "learned" that my mod
is ready for prediction (via manually setting is_trained=True
).
However, the predicted values returned by cp.predict
are unexpected, while cp.predictor.predict(..)
behaves as expected (returns predictions of mod
).
Good:
internal_call_pred = cp.predictor.predict(X_test)
print(internal_call_pred[:10])
> [287.12357943 214.6184216 116.30331871 234.13103249 165.98971046
262.76792038 167.34292778 253.7391835 259.67508504 293.32885547]
Bad:
preds, lo, hi = cp.predict(X_test, 0.1)
print(preds[:10])
>[ nan nan -inf nan -inf nan -inf nan nan nan]
Remark: during a previous run, the code celle above was returning small values around 0.0
, so maybe it is returning something that is not initialized properly.
On the other hand, this seems to work correctly:
base = BasePredictor(mod, is_trained=True)
cp = SplitCP(base, train=False)
cp.fit(X_calib=X_cal, y_calib=y_cal)
internal_call_pred = cp.predictor.predict(X_test)
print(internal_call_pred[:10])
>[287.12357943 214.6184216 116.30331871 234.13103249 165.98971046
262.76792038 167.34292778 253.7391835 259.67508504 293.32885547]
preds, lo, hi = cp.predict(X_test, 0.1)
print(preds[:10])
>[287.12357943 214.6184216 116.30331871 234.13103249 165.98971046
262.76792038 167.34292778 253.7391835 259.67508504 293.32885547]
is_trained
in BasePredictor
or train
in SplitCP
is redundant, or the latter ignore the first.cp.predict
returns dummy values when the call to the underlying sklearn fitted model via cp.predict.predict(...)
still works as expectedRegression
No response
Would it be possible to avoid hand-wrapping my own pre-trained sklearn model with BasePredictor
?
That is, pass directly my model
to SplitCP
, for instance?
See below for minimal example, which will raise an error for cp.fit(...)
: AttributeError: 'LinearRegression' object has no attribute 'is_trained'
from deel.puncc.regression import SplitCP
mod = LinearRegression()
mod.fit(X_fit, y_fit)
cp = SplitCP(mod, train=False)
cp.fit( X_calib=X_cal, y_calib=y_cal)
v0.9
- Python version: Python 3.10.12 via colab
The method deel.puncc.api.utils.quantile
is to be updated to compute the
NotImplementedError
raised when Prediction (API)
DualPredictor
is supposed to be initialized with is_trained
parameter that should be a list of booleans. There are a few issues related to this variable:
is_trained
is not updated after method fit
is completed:for count, model in enumerate(self.models):
if not self.is_trained[count]:
model.fit(X, y, **dictargs[count])
#end of method
Instead, you may want to rewrite it like this:
for count, model in enumerate(self.models):
if not self.is_trained[count]:
model.fit(X, y, **dictargs[count])
self.is_trained[count] = True
#end of method
is_trained
is a part of condition of if
statement:...
if self.train:
if self.splitter is None:
raise RuntimeError(
"The splitter argument is None but train is set to "
+ "True. Please provide a correct splitter to train "
+ "the underlying model."
)
logger.info(f"Fitting model on fold {i+cached_len}")
predictor.fit(X_fit, y_fit, **kwargs) # Fit K-fold predictor
# Make sure that predictor is already trained if train arg is False
elif self.train is False and predictor.is_trained is False:
raise RuntimeError(
"'train' argument is set to 'False' but model is not pre-trained"
)
else: # Skipping training
logger.info("Skipping training.")
...
In elif
statement predictor.is_trained
is used as boolean but in fact it can be a list if predictor is an instance of DualPredictor
. In this case it will be True
if the list is not empty even though a model can still be not trained.
The solution suggested in point 2 is only partial. I think the best way would be to keep is_trained
variable private (rename it to _is_trained
) and introduce a property is_trained
which will behave as instance variable but will be implemented as a method under the hood. For example for DualPredictor
it can look like this:
@property
def is_trained(self) -> bool:
return self._is_trained[0] and self._is_trained[1]
is_trained
is expected to be boolean
is_trained
is expected to change after models are trained
v0.9
No response
No response
I do not have example of a code that actually fails because of that. The test named test_locally_adaptive_cp
actually runs the part of the code with aforementioned elif
statement.
For simplicity, the weight normalization in case of nonexchangeable CP needs to be done inside the method deel.puncc.api.calibration.BaseCalibrator.calibrate
.
Also, documentation should clarify if the weights passed as arguments are to be normalized or not.
Add a new module deel.puncc.api.corrections
and implement Bonferroni (and other multiple hypothesis testing) correction.
None
No response
I have two qunatile catboost regressor model but crq.predict(X_test, alpha=0.05) throws an error :- TypeError: _quantile_dispatcher() got an unexpected keyword argument 'method'
predictor = DualPredictor(models=[reg_low, reg_high])
crq = CQR(predictor)
fit
trains the model and computes the nonconformitycrq.fit(X_fit=X_train, y_fit=y_train, X_calib=X_valid, y_calib=y_valid)
y_pred, y_pred_lower, y_pred_upper = crq.predict(X_test, alpha=0.05)
coverage = regression_mean_coverage(y_test, y_pred_lower, y_pred_upper)
width = regression_sharpness(y_pred_lower=y_pred_lower,
y_pred_upper=y_pred_upper)
print(f"Marginal coverage: {np.round(coverage, 2)}")
print(f"Average width: {np.round(width, 2)}")
It should run
v0.9
- OS:
- Python version:
- Packages used version:
No response
param = {'loss_function': 'Quantile:alpha=0.05',
'learning_rate': 0.4607417710785185,
'l2_leaf_reg': 0.03572230525884548,
'depth': 4,
'boosting_type': 'Plain',
'bootstrap_type': 'MVS',
'min_data_in_leaf': 8}
reg_left = CatBoostRegressor(task_type="GPU", devices='-1', **param)
param = {'loss_function': 'Quantile:alpha=0.95',
'learning_rate': 0.002097382718709981,
'l2_leaf_reg': 0.07411180923916862,
'depth': 1,
'boosting_type': 'Plain',
'bootstrap_type': 'Bayesian',
'min_data_in_leaf': 5,
'bagging_temperature': 9.119533192831474}
reg_high = CatBoostRegressor(task_type="GPU", devices='-1', **param)
predictor = DualPredictor(models=[reg_low, reg_high])
crq = CQR(predictor)
fit
trains the model and computes the nonconformitycrq.fit(X_fit=X_train, y_fit=y_train, X_calib=X_valid, y_calib=y_valid)
y_pred, y_pred_lower, y_pred_upper = crq.predict(X_test, alpha=0.05)
coverage = regression_mean_coverage(y_test, y_pred_lower, y_pred_upper)
width = regression_sharpness(y_pred_lower=y_pred_lower,
y_pred_upper=y_pred_upper)
print(f"Marginal coverage: {np.round(coverage, 2)}")
print(f"Average width: {np.round(width, 2)}")
TODO
The old branch https://github.com/deel-ai/puncc/tree/fix-cv-quantile
, not merged into main, contains some corrections to the quantile procedure followed during cross-validation-plus.
These must be reimplemented in the current main.
It is not worth the time to merge this old code, just re-write and re-check everything for statistical correctness.
Where:
deel/puncc/api/calibration.py
The imprecise code:
y_lo = (-1) * np.quantile(
(-1) * concat_y_lo, 1 - alpha, axis=1, method="inverted_cdf"
)
y_hi = np.quantile(
concat_y_hi, 1 - alpha, axis=1, method="inverted_cdf"
)
The 1 - alpha
should be (1 - alpha)(1 + 1/n)
, where n is the number of training points (only in the case of jackknife+ and CV+!)
Sources:
q+
and q-
formulae.Remark: for small values of alpha, outside the admissible range, the authors set the quantile to infinity. We should consider this and see what we do in our code.
An infinite prediction interval can be useless in practice, especially for a user not acquainted with conformal prediction.
Hi!
Thank you for creating Puncc. I'm trying to use LocallyAdaptiveCP as described here https://deel-ai.github.io/puncc/regression.html#deel.puncc.regression.LocallyAdaptiveCP
mu_model = xgb.XGBRegressor()
sigma_model = xgb.XGBRegressor()
# Wrap models in a mean/variance predictor
mean_var_predictor = MeanVarPredictor(
models=[mu_model, sigma_model]
)
cp = LocallyAdaptiveCP(mean_var_predictor)
cp.fit(X_fit=X_train, y_fit=y_train, X_calib=X_test, y_calib=y_test)
But I get an error: All MAD predictions should be positive. Any idea of what am I missing?
I think the error comes from
puncc/deel/puncc/api/nonconformity_scores.py
Line 248 in 6e0a8f8
mean_absolute_deviation = absolute_difference(y_pred, y_true)
if np.any(sigma_pred < 0):
raise RuntimeError("All MAD predictions should be positive.")
return mean_absolute_deviation / (sigma_pred + EPSILON)
But I don't know how to avoid it. Any pointers would be greatly appreciated!
While working with conformal classification with JD, we realized that it makes sense to be able to force non-empty prediction sets, at least from an operational point of view.
For example, in classification via softmax, this corresponds to always including the class whose score is highest (even if very low).
In the case of (R)APS, this means bypassing the randomization step.
In the RAPS paper, here is the explication of the phenomenon:
Here also the detail of the algo with and without the randomization:
I reckon we could achieve this simply by adding a flag to the class, such as at instantiation:
aps_cp = RAPS(class_predictor, [...], avoid_empty_sets=True)
How can we get the conformalizing quantile after cp.fit(...)
?
Maybe we could get something like:
cp.get_conformalizer(alpha=0.043)
> 3.095
The nonconformity scores should already stored somewhere in the object, after fit, it would boil down to apply the correct (non trivial?) quantile formula from within puncc to the array of scores.
Calibration (API)
In the case of multivariate regression, we may have a vector of alphas, of the same length as the number of features of the Y variable. We could modify the alpha_calib_check function so that it takes a float or a np.ndarray alpha as an argument and checks that all coordinates of alpha satrisfy the desired constraint.
No response
v0.9
- OS:
- Python version:
- Packages used version:
Calibration (API)
The function alpha_calib_check is only designed to work for the usual CP, where all weights are equal, it should be adapted to the case of the weighted CP.
Easy solution: call alpha_calib_check only in the case of usual CP, and allow for returning infinity as a quantile in the case of weighted CP.
Better solution: modify alpha_calib_check to take into account the weighted version of CP.
v0.9
- OS:
- Python version:
- Packages used version:
No response
There is nothing to reproduce.
Add nonconfomity score and prediction set functions for conformal multivariate regression in their respective modules.
Data check in splitting.IdSplitter
is not exhaustive:
Also, it should be transferred to utils
module.
Update deel.puncc.calibration.BaseCalibrator
for multivariate regression:
Add a method name_placeholder
that returns the
Add a correction function (callable) as an argument in the calibrate
method. By default, use Bonferroni.
The methods fit
and calibrate
shoud be compatible with multivariate prediction. Specifically, calibrate
should be allowed to run with a vector of alphas
The CvPlusCalibrator
should support multivariate prediction
Make sure backward compatibility is respected: univariate prediction should be a special case.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.