efs-opensource / calibration-framework Goto Github PK

The net:cal calibration framework is a Python 3 library for measuring and mitigating miscalibration of uncertainty estimates, e.g., by a neural network.

Home Page: https://efs-opensource.github.io/calibration-framework/

License: Apache License 2.0

Python 100.00%

calibration-framework's Introduction

net:cal - Uncertainty Calibration

The net:cal calibration framework is a Python 3 library for measuring and mitigating miscalibration of uncertainty estimates, e.g., by a neural network. For full API reference documentation, visit https://efs-opensource.github.io/calibration-framework.

This Source Code Form is subject to the terms of the Apache License 2.0. If a copy of the APL2 was not distributed with this file, You can obtain one at https://www.apache.org/licenses/LICENSE-2.0.txt.

Important: updated references! If you use the net:cal framework (classification or detection) or parts of it for your research, please cite it by:

@InProceedings{Kueppers_2020_CVPR_Workshops,
   author = {Küppers, Fabian and Kronenberger, Jan and Shantia, Amirhossein and Haselhoff, Anselm},
   title = {Multivariate Confidence Calibration for Object Detection},
   booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
   month = {June},
   year = {2020}
}

If you use Bayesian calibration methods with uncertainty, please cite it by:

@InProceedings{Kueppers_2021_IV,
   author = {Küppers, Fabian and Kronenberger, Jan and Schneider, Jonas and Haselhoff, Anselm},
   title = {Bayesian Confidence Calibration for Epistemic Uncertainty Modelling},
   booktitle = {Proceedings of the IEEE Intelligent Vehicles Symposium (IV)},
   month = {July},
   year = {2021},
}

If you use Regression calibration methods, please cite it by:

@InProceedings{Kueppers_2022_ECCV_Workshops,
  author    = {Küppers, Fabian and Schneider, Jonas and Haselhoff, Anselm},
  title     = {Parametric and Multivariate Uncertainty Calibration for Regression and Object Detection},
  booktitle = {European Conference on Computer Vision (ECCV) Workshops},
  year      = {2022},
  month     = {October},
  publisher = {Springer},
}

Overview
Installation
Requirements
Calibration Metrics
- Confidence Calibration Metrics
- Regression Calibration Metrics
Methods
- Confidence Calibration Methods
- Regression Calibration Methods
  - Non-parametric calibration
  - Parametric calibration
Visualization
Examples
References

Overview

This framework is designed to calibrate the confidence estimates of classifiers like neural networks. Modern neural networks are likely to be overconfident with their predictions. However, reliable confidence estimates of such classifiers are crucial especially in safety-critical applications.

For example: given 100 predictions with a confidence of 80% of each prediction, the observed accuracy should also match 80% (neither more nor less). This behaviour is achievable with several calibration methods.

Update on version 1.3

TL;DR:

Regression calibration methods: train and infer methods to rescale the uncertainty of probabilistic regression models
New package: netcal.regression with regression calibration methods:
- Isotonic Regression (netcal.regression.IsotonicRegression)
- Variance Scaling (netcal.regression.VarianceScaling)
- GP-Beta (netcal.regression.GPBeta)
- GP-Normal (netcal.regression.GPNormal)
- GP-Cauchy (netcal.regression.GPCauchy)
Implement netcal.regression.GPNormal method with correlation estimation and recalibration
Restructured netcal.metrics package to distinguish between (semantic) confidence calibration in netcal.confidence and regression uncertainty calibration in netcal.regression:
- Expected Calibration Error (ECE - netcal.confidence.ECE)
- Maximum Calibration Error (MCE - netcal.confidence.MCE)
- Average Calibration Error (ACE - netcal.confidence.ACE)
- Maximum Mean Calibration Error (MMCE - netcal.confidence.MMCE)
- Negative Log Likelihood (NLL - netcal.regression.NLL)
- Prediction Interval Coverage Probability (PICP - netcal.regression.PICP)
- Pinball loss (netcal.regression.PinballLoss)
- Uncertainty Calibration Error (UCE - netcal.regression.UCE)
- Expected Normalized Calibration Error (ENCE - netcal.regression.ENCE)
- Quantile Calibration Error (QCE - netcal.regression.QCE)
Added new types of reliability diagrams to visualize regression calibration properties:
- Reliability Regression diagram to visualize calibration for different quantile levels (preferred - netcal.presentation.ReliabilityRegression)
- Reliability QCE diagram to visualize QCE over stddev (netcal.presentation.QCE)
Updated examples
Minor bugfixes
Use library tikzplotlib within the netcal.presentation package to enable a direct conversion of matplotlib.Figure objects to Tikz-Code (e.g., can be used for LaTeX figures)

Within this release, we provide a new package netcal.regression to enable recalibration of probabilistic regression tasks. Within probabilistic regression, a regression model does not output a single score for each prediction but rather a probability distribution (e.g., Gaussian with mean/variance) that targets the true output score. Similar to (semantic) confidence calibration, regression calibration requires that the estimated uncertainty matches the observed error distribution. There exist several definitions for regression calibration which the provided calibration methods aim to mitigate (cf. README within the netcal.regression package). We distinguish the provided calibration methods into non-parametric and parametric methods. Non-parametric calibration methods take a probability distribution as input and apply recalibration in terms of quantiles on the cumulative (CDF). This leads to a recalibrated probability distribution that, however, has no analytical representation but is given by certain points defining a CDF distribution. Non-parametric calibration methods are netcal.regression.IsotonicRegression and netcal.regression.GPBeta.

In contrast, parametric calibration methods also take a probability distribution as input and provide a recalibrated distribution that has an analytical expression (e.g., Gaussian). Parametric calibration methods are netcal.regression.VarianceScaling, netcal.regression.GPNormal, and netcal.regression.GPCauchy.

The calibration methods are designed to also work with multiple independent dimensions. The methods netcal.regression.IsotonicRegression and netcal.regression.VarianceScaling apply a recalibration of each dimension independently of each other. In contrast, the GP methods netcal.regression.GPBeta, netcal.regression.GPNormal, and netcal.regression.GPCauchy use a single GP to apply recalibration. Furthermore, the GP-Normal netcal.regression.GPNormal is can model possible correlations within the training data to transform multiple univariate probability distributions of a single sample to a joint multivariate (normal) distribution with possible correlations. This calibration scheme is denoted as correlation estimation. Additionally, the GP-Normal is also able to take a multivariate (normal) distribution with correlations as input and applies a recalibration of the whole covariance matrix. This is referred to as correlation recalibration.

Besides the recalibration methods, we restructured the netcal.metrics package which now also holds several metrics for regression calibration (cf. netcal.metrics package documentation for detailed information). Finally, we provide several ways to visualize regression miscalibration within the netcal.presentation package.

All plot-methods within the netcal.presentation package now support the option "tikz=True" which switches from standard matplotlib.Figure objects to strings with Tikz-Code. Tikz-code can be directly used for LaTeX documents to render images as vector graphics with high quality. Thus, this option helps to improve the quality of your reliability diagrams if you are planning to use this library for any type of publication/document

Update on version 1.2

TL;DR:

Bayesian confidence calibration: train and infer scaling methods using variational inference (VI) and MCMC sampling
New metrics: MMCE [13] and PICP [14] (netcal.metrics.MMCE and netcal.metrics.PICP)
New regularization methods: MMCE [13] and DCA [15] (netcal.regularization.MMCEPenalty and netcal.regularization.DCAPenalty)
Updated examples
Switched license from MPL2 to APL2

Now you can also use Bayesian methods to obtain uncertainty within a calibration mapping mainly in the netcal.scaling package. We adapted Markov-Chain Monte-Carlo sampling (MCMC) as well as Variational Inference (VI) on common calibration methods. It is also easily possible to bring the scaling methods to CUDA in order to speed-up the computations. We further provide new metrics to evaluate confidence calibration (MMCE) and to evaluate the quality of prediction intervals (PICP). Finally, we updated our framework by new regularization methods that can be used during model training (MMCE and DCA).

Update on version 1.1

This framework can also be used to calibrate object detection models. It has recently been shown that calibration on object detection also depends on the position and/or scale of a predicted object [12]. We provide calibration methods to perform confidence calibration w.r.t. the additional box regression branch. For this purpose, we extended the commonly used Histogram Binning [3], Logistic Calibration alias Platt scaling [10] and the Beta Calibration method [2] to also include the bounding box information into a calibration mapping. Furthermore, we provide two new methods called the Dependent Logistic Calibration and the Dependent Beta Calibration that are not only able to perform a calibration mapping w.r.t. additional bounding box information but also to model correlations and dependencies between all given quantities [12]. Those methods should be preffered over their counterparts in object detection mode.

The framework is structured as follows:

netcal
  .binning         # binning methods (confidence calibration)
  .scaling         # scaling methods (confidence calibration)
  .regularization  # regularization methods (confidence calibration)
  .presentation    # presentation methods (confidence/regression calibration)
  .metrics         # metrics for measuring miscalibration (confidence/regression calibration)
  .regression      # methods for regression uncertainty calibration (regression calibration)

examples           # example code snippets

Installation

The installation of the calibration suite is quite easy as it registered in the Python Package Index (PyPI). You can either install this framework using PIP:

$ python3 -m pip install netcal

Or simply invoke the following command to install the calibration suite when installing from source:

$ git clone https://github.com/EFS-OpenSource/calibration-framework
$ cd calibration-framework
$ python3 -m pip install .

Note: with update 1.3, we switched from setup.py to pyproject.toml according to PEP-518. The setup.py is only for backwards compatibility.

Requirements

According to requierments.txt:

numpy>=1.18
scipy>=1.4
matplotlib>=3.3
scikit-learn>=0.24
torch>=1.9
torchvision>=0.10.0
tqdm>=4.40
pyro-ppl>=1.8
tikzplotlib>=0.9.8
tensorboard>=2.2
gpytorch>=1.5.1

Calibration Metrics

We further distinguish between onfidence calibration which aims to recalibrate confidence estimates in the [0, 1] interval, and regression uncertainty calibration which addresses the problem of calibration in probabilistic regression settings.

Confidence Calibration Metrics

The most common metric to determine miscalibration in the scope of classification is the Expected Calibration Error (ECE) [1]. This metric divides the confidence space into several bins and measures the observed accuracy in each bin. The bin gaps between observed accuracy and bin confidence are summed up and weighted by the amount of samples in each bin. The Maximum Calibration Error (MCE) denotes the highest gap over all bins. The Average Calibration Error (ACE) [11] denotes the average miscalibration where each bin gets weighted equally. For object detection, we implemented the Detection Calibration Error (D-ECE) [12] that is the natural extension of the ECE to object detection tasks. The miscalibration is determined w.r.t. the bounding box information provided (e.g. box location and/or scale). For this purpose, all available information gets binned in a multidimensional histogram. The accuracy is then calculated in each bin separately to determine the mean deviation between confidence and accuracy.

(Detection) Expected Calibration Error [1], [12] (netcal.metrics.ECE)
(Detection) Maximum Calibration Error [1], [12] (netcal.metrics.MCE)
(Detection) Average Calibration Error [11], [12] (netcal.metrics.ACE)
Maximum Mean Calibration Error (MMCE) [13] (netcal.metrics.MMCE) (no position-dependency)

Regression Calibration Metrics

In regression calibration, the most common metric is the Negative Log Likelihood (NLL) to measure the quality of a predicted probability distribution w.r.t. the ground-truth:

Negative Log Likelihood (NLL) (netcal.metrics.NLL)

The metrics Pinball Loss, Prediction Interval Coverage Probability (PICP), and Quantile Calibration Error (QCE) evaluate the estimated distributions by means of the predicted quantiles. For example, if a forecaster makes 100 predictions using a probability distribution for each estimate targeting the true ground-truth, we can measure the coverage of the ground-truth samples for a certain quantile level (e.g., 95% quantile). If the relative amount of ground-truth samples falling into a certain predicted quantile is above or below the specified quantile level, a forecaster is told to be miscalibrated in terms of quantile calibration. Appropriate metrics in this context are

Pinball Loss (netcal.metrics.PinballLoss)
Prediction Interval Coverage Probability (PICP) [14] (netcal.metrics.PICP)
Quantile Calibration Error (QCE) [15] (netcal.metrics.QCE)

Finally, if we work with normal distributions, we can measure the quality of the predicted variance/stddev estimates. For variance calibration, it is required that the predicted variance mathes the observed error variance which is equivalent to then Mean Squared Error (MSE). Metrics for variance calibration are

Expected Normalized Calibration Error (ENCE) [17] (netcal.metrics.ENCE)
Uncertainty Calibration Error (UCE) [18] (netcal.metrics.UCE)

Methods

We further give an overview about the post-hoc calibration methods for (semantic) confidence calibration as well as about the methods for regression uncertainty calibration.

Confidence Calibration Methods

The post-hoc calibration methods are separated into binning and scaling methods. The binning methods divide the available information into several bins (like ECE or D-ECE) and perform calibration on each bin. The scaling methods scale the confidence estimates or logits directly to calibrated confidence estimates - on detection calibration, this is done w.r.t. the additional regression branch of a network.

Important: if you use the detection mode, you need to specifiy the flag "detection=True" in the constructor of the according method (this is not necessary for netcal.scaling.LogisticCalibrationDependent and netcal.scaling.BetaCalibrationDependent).

Most of the calibration methods are designed for binary classification tasks. For binning methods, multi-class calibration is performed in "one vs. all" by default.

Some methods such as "Isotonic Regression" utilize methods from the scikit-learn API [9].

Another group are the regularization tools which are added to the loss during the training of a Neural Network.

Binning

Implemented binning methods are:

Histogram Binning for classification [3], [4] and object detection [12] (netcal.binning.HistogramBinning)
Isotonic Regression [4],[5] (netcal.binning.IsotonicRegression)
Bayesian Binning into Quantiles (BBQ) [1] (netcal.binning.BBQ)
Ensemble of Near Isotonic Regression (ENIR) [6] (netcal.binning.ENIR)

Scaling

Implemented scaling methods are:

Logistic Calibration/Platt Scaling for classification [10] and object detection [12] (netcal.scaling.LogisticCalibration)
Dependent Logistic Calibration for object detection [12] (netcal.scaling.LogisticCalibrationDependent) - on detection, this method is able to capture correlations between all input quantities and should be preferred over Logistic Calibration for object detection
Temperature Scaling for classification [7] and object detection [12] (netcal.scaling.TemperatureScaling)
Beta Calibration for classification [2] and object detection [12] (netcal.scaling.BetaCalibration)
Dependent Beta Calibration for object detection [12] (netcal.scaling.BetaCalibrationDependent) - on detection, this method is able to capture correlations between all input quantities and should be preferred over Beta Calibration for object detection

New on version 1.2: you can provide a parameter named "method" to the constructor of each scaling method. This parameter could be one of the following: - 'mle': use the method feed-forward with maximum likelihood estimates on the calibration parameters (standard) - 'momentum': use non-convex momentum optimization (e.g. default on dependent beta calibration) - 'mcmc': use Markov-Chain Monte-Carlo sampling to obtain multiple parameter sets in order to quantify uncertainty in the calibration - 'variational': use Variational Inference to obtain multiple parameter sets in order to quantify uncertainty in the calibration

Regularization

With some effort, it is also possible to push the model training towards calibrated confidences by regularization. Implemented regularization methods are:

Confidence Penalty [8] (netcal.regularization.confidence_penalty and netcal.regularization.ConfidencePenalty - the latter one is a PyTorch implementation that might be used as a regularization term)
Maximum Mean Calibration Error (MMCE) [13] (netcal.regularization.MMCEPenalty - PyTorch regularization module)
DCA [15] (netcal.regularization.DCAPenalty - PyTorch regularization module)

Regression Calibration Methods

The netcal library provides post-hoc methods to recalibrate the uncertainty of probabilistic regression tasks. We distinguish the calibration methods into non-parametric and parametric methods. Non-parametric calibration methods take a probability distribution as input and apply recalibration in terms of quantiles on the cumulative (CDF). This leads to a recalibrated probability distribution that, however, has no analytical representation but is given by certain points defining a CDF distribution. In contrast, parametric calibration methods also take a probability distribution as input and provide a recalibrated distribution that has an analytical expression (e.g., Gaussian).

Non-parametric calibration

The common non-parametric recalibration methods use the predicted cumulative (CDF) distribution functions to learn a mapping from the uncalibrated quantiles to the observed quantile coverage. Using a recalibrated CDF, it is possible to derive the respective density functions (PDF) or to extract statistical moments such as mean and variance. Non-parametric calibration methods within the netcal.regression package are

Isotonic Regression [19] which applies a (marginal) recalibration of the CDF (netcal.regression.IsotonicRegression)
GP-Beta [20] which applies an input-dependent recalibration of the CDF using a Gaussian process for parameter estimation (netcal.regression.GPBeta)

Parametric calibration

The parametric recalibration methods apply a recalibration of the estimated distributions so that the resulting distribution is given in terms of a distribution with an analytical expression (e.g., a Gaussian). These methods are suitable for applications where a parametric distribution is required for subsequent applications, e.g., within Kalman filtering. We implemented the following parametric calibration methods:

Variance Scaling [17], [18] which is nothing else but a temperature scaling for the predicted variance (netcal.regression.VarianceScaling)
GP-Normal [16] which applies an input-dependent rescaling of the predicted variance (netcal.regression.GPNormal). Note: this method is also able to capture correlations between multiple input dimensions and can return a joint multivariate normal distribution as calibration output (cf. examples section).
GP-Cauchy [16] is similar to GP-Normal but utilizes a Cauchy distribution as calibration output (netcal.regression.GPCauchy)

Visualization

For visualization of miscalibration, one can use a Confidence Histograms & Reliability Diagrams for (semantic) confidence calibration as well as for regression uncertainty calibration. Within confidence calibration, these diagrams are similar to ECE. The output space is divided into equally spaced bins. The calibration gap between bin accuracy and bin confidence is visualized as a histogram.

For detection calibration, the miscalibration can be visualized either along one additional box information (e.g. the x-position of the predictions) or distributed over two additional box information in terms of a heatmap.

For regression uncertainty calibration, the reliability diagram shows the relative prediction interval coverage of the ground-truth samples for different quantile levels.

Reliability Diagram [1], [12] (netcal.presentation.ReliabilityDiagram)
Reliability Diagram for regression calibration (netcal.presentation.ReliabilityRegression)
Reliability QCE Diagram [16] shows the Quantile Calibration Error (QCE) for different variance levels (netcal.presentation.ReliabilityQCE)

New on version 1.3: All plot-methods within the netcal.presentation package now support the option "tikz=True" which switches from standard matplotlib.Figure objects to strings with Tikz-Code. Tikz-code can be directly used for LaTeX documents to render images as vector graphics with high quality. Thus, this option helps to improve the quality of your reliability diagrams if you are planning to use this library for any type of publication/document

Examples

The calibration methods work with the predicted confidence estimates of a neural network and on detection also with the bounding box regression branch.

Classification

This is a basic example which uses softmax predictions of a classification task with 10 classes and the given NumPy arrays:

ground_truth  # this is a NumPy 1-D array with ground truth digits between 0-9 - shape: (n_samples,)
confidences   # this is a NumPy 2-D array with confidence estimates between 0-1 - shape: (n_samples, n_classes)

Post-hoc Calibration for Classification

This is an example for netcal.scaling.TemperatureScaling but also works for every calibration method (remind different constructor parameters):

import numpy as np
from netcal.scaling import TemperatureScaling

temperature = TemperatureScaling()
temperature.fit(confidences, ground_truth)
calibrated = temperature.transform(confidences)

Measuring Miscalibration for Classification

The miscalibration can be determined with the ECE:

from netcal.metrics import ECE

n_bins = 10

ece = ECE(n_bins)
uncalibrated_score = ece.measure(confidences, ground_truth)
calibrated_score = ece.measure(calibrated, ground_truth)

Visualizing Miscalibration for Classification

The miscalibration can be visualized with a Reliability Diagram:

from netcal.presentation import ReliabilityDiagram

n_bins = 10

diagram = ReliabilityDiagram(n_bins)
diagram.plot(confidences, ground_truth)  # visualize miscalibration of uncalibrated
diagram.plot(calibrated, ground_truth)   # visualize miscalibration of calibrated

# you can also use this method to create a tikz file with tikz code
# that can be directly used within LaTeX documents:
diagram.plot(confidences, ground_truth, tikz=True, filename="diagram.tikz")

Detection (Confidence of Objects)

In this example we use confidence predictions of an object detection model with the according x-position of the predicted bounding boxes. Our ground-truth provided to the calibration algorithm denotes if a bounding box has matched a ground-truth box with a certain IoU and the correct class label.

matched                # binary NumPy 1-D array (0, 1) that indicates if a bounding box has matched a ground truth at a certain IoU with the right label - shape: (n_samples,)
confidences            # NumPy 1-D array with confidence estimates between 0-1 - shape: (n_samples,)
relative_x_position    # NumPy 1-D array with relative center-x position between 0-1 of each prediction - shape: (n_samples,)

Post-hoc Calibration for Detection

This is an example for netcal.scaling.LogisticCalibration and netcal.scaling.LogisticCalibrationDependent but also works for every calibration method (remind different constructor parameters):

import numpy as np
from netcal.scaling import LogisticCalibration, LogisticCalibrationDependent

input = np.stack((confidences, relative_x_position), axis=1)

lr = LogisticCalibration(detection=True, use_cuda=False)    # flag 'detection=True' is mandatory for this method
lr.fit(input, matched)
calibrated = lr.transform(input)

lr_dependent = LogisticCalibrationDependent(use_cuda=False) # flag 'detection=True' is not necessary as this method is only defined for detection
lr_dependent.fit(input, matched)
calibrated = lr_dependent.transform(input)

Measuring Miscalibration for Detection

The miscalibration can be determined with the D-ECE:

from netcal.metrics import ECE

n_bins = [10, 10]
input_calibrated = np.stack((calibrated, relative_x_position), axis=1)

ece = ECE(n_bins, detection=True)           # flag 'detection=True' is mandatory for this method
uncalibrated_score = ece.measure(input, matched)
calibrated_score = ece.measure(input_calibrated, matched)

Visualizing Miscalibration for Detection

The miscalibration can be visualized with a Reliability Diagram:

from netcal.presentation import ReliabilityDiagram

n_bins = [10, 10]

diagram = ReliabilityDiagram(n_bins, detection=True)    # flag 'detection=True' is mandatory for this method
diagram.plot(input, matched)                # visualize miscalibration of uncalibrated
diagram.plot(input_calibrated, matched)     # visualize miscalibration of calibrated

# you can also use this method to create a tikz file with tikz code
# that can be directly used within LaTeX documents:
diagram.plot(input, matched, tikz=True, filename="diagram.tikz")

Uncertainty in Confidence Calibration

We can also quantify the uncertainty in a calibration mapping if we use a Bayesian view on the calibration models. We can sample multiple parameter sets using MCMC sampling or VI. In this example, we reuse the data of the previous detection example.

matched                # binary NumPy 1-D array (0, 1) that indicates if a bounding box has matched a ground truth at a certain IoU with the right label - shape: (n_samples,)
confidences            # NumPy 1-D array with confidence estimates between 0-1 - shape: (n_samples,)
relative_x_position    # NumPy 1-D array with relative center-x position between 0-1 of each prediction - shape: (n_samples,)

Post-hoc Calibration with Uncertainty

This is an example for netcal.scaling.LogisticCalibration and netcal.scaling.LogisticCalibrationDependent but also works for every calibration method (remind different constructor parameters):

import numpy as np
from netcal.scaling import LogisticCalibration, LogisticCalibrationDependent

input = np.stack((confidences, relative_x_position), axis=1)

# flag 'detection=True' is mandatory for this method
# use Variational Inference with 2000 optimization steps for creating this calibration mapping
lr = LogisticCalibration(detection=True, method='variational', vi_epochs=2000, use_cuda=False)
lr.fit(input, matched)

# 'num_samples=1000': sample 1000 parameter sets from VI
# thus, 'calibrated' has shape [1000, n_samples]
calibrated = lr.transform(input, num_samples=1000)

# flag 'detection=True' is not necessary as this method is only defined for detection
# this time, use Markov-Chain Monte-Carlo sampling with 250 warm-up steps, 250 parameter samples and one chain
lr_dependent = LogisticCalibrationDependent(method='mcmc',
                                            mcmc_warmup_steps=250, mcmc_steps=250, mcmc_chains=1,
                                            use_cuda=False)
lr_dependent.fit(input, matched)

# 'num_samples=1000': although we have only sampled 250 different parameter sets,
# we can randomly sample 1000 parameter sets from MCMC
calibrated = lr_dependent.transform(input)

Measuring Miscalibration with Uncertainty

You can directly pass the output to the D-ECE and PICP instance to measure miscalibration and mask quality:

from netcal.metrics import ECE
from netcal.metrics import PICP

n_bins = 10
ece = ECE(n_bins, detection=True)
picp = PICP(n_bins, detection=True)

# the following function calls are equivalent:
miscalibration = ece.measure(calibrated, matched, uncertainty="mean")
miscalibration = ece.measure(np.mean(calibrated, axis=0), matched)

# now determine uncertainty quality
uncertainty = picp.measure(calibrated, matched, kind="confidence")

print("D-ECE:", miscalibration)
print("PICP:", uncertainty.picp) # prediction coverage probability
print("MPIW:", uncertainty.mpiw) # mean prediction interval width

If we want to measure miscalibration and uncertainty quality by means of the relative x position, we need to broadcast the according information:

# broadcast and stack x information to calibrated information
broadcasted = np.broadcast_to(relative_x_position, calibrated.shape)
calibrated = np.stack((calibrated, broadcasted), axis=2)

n_bins = [10, 10]
ece = ECE(n_bins, detection=True)
picp = PICP(n_bins, detection=True)

# the following function calls are equivalent:
miscalibration = ece.measure(calibrated, matched, uncertainty="mean")
miscalibration = ece.measure(np.mean(calibrated, axis=0), matched)

# now determine uncertainty quality
uncertainty = picp.measure(calibrated, matched, uncertainty="mean")

print("D-ECE:", miscalibration)
print("PICP:", uncertainty.picp) # prediction coverage probability
print("MPIW:", uncertainty.mpiw) # mean prediction interval width

Probabilistic Regression

The following example shows how to use the post-hoc calibration methods for probabilistic regression tasks. Within probabilistic regression, a forecaster (e.g. with Gaussian prior) outputs a mean and a variance targeting the true ground-truth score. Thus, the following information is required to construct the calibration methods:

mean          # NumPy n-D array holding the estimated mean of shape (n, d) with n samples and d dimensions
stddev        # NumPy n-D array holding the estimated stddev (independent) of shape (n, d) with n samples and d dimensions
ground_truth  # NumPy n-D array holding the ground-truth scores of shape (n, d) with n samples and d dimensions

Post-hoc Calibration (Parametric)

These information might result e.g. from object detection where the position information of the objects (bounding boxes) are parametrized by normal distributions. We start by using parametric calibration methods such as Variance Scaling:

from netcal.regression import VarianceScaling, GPNormal

# the initialization of the Variance Scaling method is pretty simple
varscaling = VarianceScaling()

# the GP-Normal requires a little bit more parameters to parametrize the underlying GP
gpnormal = GPNormal(
    n_inducing_points=12,    # number of inducing points
    n_random_samples=256,    # random samples used for likelihood
    n_epochs=256,            # optimization epochs
    use_cuda=False,          # can also use CUDA for computations
)

# fit the Variance Scaling
# note that we need to pass the first argument as tuple as the input distributions
# are parametrized by mean and variance
varscaling.fit((mean, stddev), ground_truth)

# fit GP-Normal - similar parameters here!
gpnormal.fit((mean, stddev), ground_truth)

# transform distributions to obtain recalibrated stddevs
stddev_varscaling = varscaling.transform((mean, stddev))  # NumPy array with stddev - has shape (n, d)
stddev_gpnormal = gpnormal.transform((mean, stddev))  # NumPy array with stddev - has shape (n, d)

Post-hoc Calibration (Non-Parametric)

We can also use non-parametric calibration methods. In this case, the calibrated distributions are defined by their density (PDF) and cumulative (CDF) functions:

from netcal.regression import IsotonicRegression, GPBeta

# the initialization of the Isotonic Regression method is pretty simple
isotonic = IsotonicRegression()

# the GP-Normal requires a little bit more parameters to parametrize the underlying GP
gpbeta = GPBeta(
    n_inducing_points=12,    # number of inducing points
    n_random_samples=256,    # random samples used for likelihood
    n_epochs=256,            # optimization epochs
    use_cuda=False,          # can also use CUDA for computations
)

# fit the Isotonic Regression
# note that we need to pass the first argument as tuple as the input distributions
# are parametrized by mean and variance
isotonic.fit((mean, stddev), ground_truth)

# fit GP-Beta - similar parameters here!
gpbeta.fit((mean, stddev), ground_truth)

# transform distributions to obtain recalibrated distributions
t_isotonic, pdf_isotonic, cdf_isotonic = varscaling.transform((mean, stddev))
t_gpbeta, pdf_gpbeta, cdf_gpbeta = gpbeta.transform((mean, stddev))

# Note: the transformation results are NumPy n-d arrays with shape (t, n, d)
# with t as the number of points that define the PDF/CDF,
# with n as the number of samples, and
# with d as the number of dimensions.

# The resulting variables can be interpreted as follows:
# - t_isotonic/t_gpbeta: x-values of the PDF/CDF with shape (t, n, d)
# - pdf_isotonic/pdf_gpbeta: y-values of the PDF with shape (t, n, d)
# - cdf_isotonic/cdf_gpbeta: y-values of the CDF with shape (t, n, d)

You can visualize the non-parametric distribution of a single sample within a single dimension using Matplotlib:

from matplotlib import pyplot as plt

fig, (ax1, ax2) = plt.subplots(2, 1)

# plot the recalibrated PDF within a single axis after calibration
ax1.plot(
    t_isotonic[:, 0, 0], pdf_isotonic[:, 0, 0],
    t_gpbeta[:, 0, 0], pdf_gpbeta[:, 0, 0],
)

# plot the recalibrated PDF within a single axis after calibration
ax2.plot(
    t_isotonic[:, 0, 0], cdf_isotonic[:, 0, 0],
    t_gpbeta[:, 0, 0], cdf_gpbeta[:, 0, 0],
)

plt.show()

We provide a method to extract the statistical moments expectation and variance from the recalibrated cumulative (CDF). Note that we advise to use one of the parametric calibration methods if you need e.g. a Gaussian for subsequent applications such as Kalman filtering.

from netcal import cumulative_moments

# extract the expectation (mean) and the variance from the recalibrated CDF
ymean_isotonic, yvar_isotonic = cumulative_moments(t_isotonic, cdf_isotonic)
ymean_gpbeta, yvar_gpbeta = cumulative_moments(t_gpbeta, cdf_gpbeta)

# each of these variables has shape (n, d) and holds the
# mean/variance for each sample and in each dimension

Correlation Estimation and Recalibration

With the GP-Normal netcal.regression.GPNormal, it is also possible to detect possible correlations between multiple input dimensions that have originally been trained/modelled independently from each other:

from netcal.regression import GPNormal

# the GP-Normal requires a little bit more parameters to parametrize the underlying GP
gpnormal = GPNormal(
    n_inducing_points=12,    # number of inducing points
    n_random_samples=256,    # random samples used for likelihood
    n_epochs=256,            # optimization epochs
    use_cuda=False,          # can also use CUDA for computations
    correlations=True,       # enable correlation capturing between the input dimensions
)

# fit GP-Normal
# note that we need to pass the first argument as tuple as the input distributions
# are parametrized by mean and variance
gpnormal.fit((mean, stddev), ground_truth)

# transform distributions to obtain recalibrated covariance matrices
cov = gpnormal.transform((mean, stddev))  # NumPy array with covariance - has shape (n, d, d)

# note: if the input is already given by multivariate normal distributions
# (stddev is covariance and has shape (n, d, d)), the methods works similar
# and simply applies a covariance recalibration of the input

Measuring Miscalibration for Regression

Measuring miscalibration is as simple as the training of the methods:

import numpy as np
from netcal.metrics import NLL, PinballLoss, QCE

# define the quantile levels that are used to evaluate the pinball loss and the QCE
quantiles = np.linspace(0.1, 0.9, 9)

# initialize NLL, Pinball, and QCE objects
nll = NLL()
pinball = PinballLoss()
qce = QCE(marginal=True)  # if "marginal=False", we can also measure the QCE by means of the predicted variance levels (realized by binning the variance space)

# measure miscalibration with the initialized metrics
# Note: the parameter "reduction" has a major influence to the return shape of the metrics
# see the method docstrings for detailed information
nll.measure((mean, stddev), ground_truth, reduction="mean")
pinball.measure((mean, stddev), ground_truth, q=quantiles, reduction="mean")
qce.measure((mean, stddev), ground_truth, q=quantiles, reduction="mean")

Visualizing Miscalibration for Regression

Example visualization code block using the netcal.presentation.ReliabilityRegression class:

from netcal.presentation import ReliabilityRegression

# define the quantile levels that are used for the quantile evaluation
quantiles = np.linspace(0.1, 0.9, 9)

# initialize the diagram object
diagram = ReliabilityRegression(quantiles=quantiles)

# visualize miscalibration with the initialized object
diagram.plot((mean, stddev), ground_truth)

# you can also use this method to create a tikz file with tikz code
# that can be directly used within LaTeX documents:
diagram.plot((mean, stddev), ground_truth, tikz=True, filename="diagram.tikz")

References

[1] Naeini, Mahdi Pakdaman, Gregory Cooper, and Milos Hauskrecht: "Obtaining well calibrated probabilities using bayesian binning." Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

[2] Kull, Meelis, Telmo Silva Filho, and Peter Flach: "Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers." Artificial Intelligence and Statistics, PMLR 54:623-631, 2017.

[3] Zadrozny, Bianca and Elkan, Charles: "Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers." In ICML, pp. 609–616, 2001.

[4] Zadrozny, Bianca and Elkan, Charles: "Transforming classifier scores into accurate multiclass probability estimates." In KDD, pp. 694–699, 2002.

[5] Ryan J Tibshirani, Holger Hoefling, and Robert Tibshirani: "Nearly-isotonic regression." Technometrics, 53(1):54–61, 2011.

[6] Naeini, Mahdi Pakdaman, and Gregory F. Cooper: "Binary classifier calibration using an ensemble of near isotonic regression models." 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016.

[7] Chuan Guo, Geoff Pleiss, Yu Sun and Kilian Q. Weinberger: "On Calibration of Modern Neural Networks." Proceedings of the 34th International Conference on Machine Learning, 2017.

[8] Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L. and Hinton, G.: “Regularizing neural networks by penalizing confident output distributions.” CoRR, 2017.

[9] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E.: "Scikit-learn: Machine Learning in Python." In Journal of Machine Learning Research, volume 12 pp 2825-2830, 2011.

[10] Platt, John: "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods." Advances in large margin classifiers, 10(3): 61–74, 1999.

[11] Neumann, Lukas, Andrew Zisserman, and Andrea Vedaldi: "Relaxed Softmax: Efficient Confidence Auto-Calibration for Safe Pedestrian Detection." Conference on Neural Information Processing Systems (NIPS) Workshop MLITS, 2018.

[12] Fabian Küppers, Jan Kronenberger, Amirhossein Shantia, and Anselm Haselhoff: "Multivariate Confidence Calibration for Object Detection"." The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020

[13] Kumar, Aviral, Sunita Sarawagi, and Ujjwal Jain: "Trainable calibration measures for neural networks from _kernel mean embeddings." International Conference on Machine Learning. 2018

[14] Jiayu Yao, Weiwei Pan, Soumya Ghosh, and Finale Doshi-Velez: "Quality of Uncertainty Quantification for Bayesian Neural Network Inference." Workshop on Uncertainty and Robustness in Deep Learning, ICML, 2019

[15] Liang, Gongbo, et al.: "Improved trainable calibration method for neural networks on medical imaging classification." arXiv preprint arXiv:2009.04057 (2020)

[16] Fabian Küppers, Jonas Schneider, Jonas, and Anselm Haselhoff: "Parametric and Multivariate Uncertainty Calibration for Regression and Object Detection." In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Springer, October 2022

[17] Levi, Dan, et al.: "Evaluating and calibrating uncertainty prediction in regression tasks." arXiv preprint arXiv:1905.11659 (2019).

[18] Laves, Max-Heinrich, et al.: "Well-calibrated regression uncertainty in medical imaging with deep learning." Medical Imaging with Deep Learning. PMLR, 2020.

[19] Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon: "Accurate uncertainties for deep learning using calibrated regression." International Conference on Machine Learning. PMLR, 2018.

[20] Hao Song, Tom Diethe, Meelis Kull and Peter Flach: "Distribution calibration for regression." International Conference on Machine Learning. PMLR, 2019.

calibration-framework's People

Contributors

Stargazers

Watchers

calibration-framework's Issues

Support for vector and matrix scaling methods

Can netcal support vector and matrix scaling methods

Basic binary classification case

Hi, I'm having problems understanding what's the proper use of the library for a very simple binary classifier. I have a 1-D array of binary labels {0, 1} and a 1-D array of model predictions with probability values p in range (0, 1). Those values reflect the probability of a positive class.

Plugging those values into e.g. the reliability diagram, I got the following plot:

Confidence histogram makes sense to me, as most samples are negative and classifier correctly assigns a low probability. But I'm not sure how to interpret the reliability diagram -- what do the dark red bars suggest here? Also, ECE I received is very high (>0.8).

I tried to reverse the probabilities for negative samples, i.e. if a label is 0, then the probability is (1-p). This gives a more justifiable plot:

Could you confirm that for negative samples the probability should reflect probability of a negative class, not the positive class, even in a binary classification case?

Also, it might be worth clarifying that the confidence estimates for some functions (e.g. Platt's / temperature scaling) are supposed to be in the prediction space and not logit space. After reading official papers and implementations it might be confusing because conversion prediction -> logit is done behind the scenes, and information in docs about this would be helpful.

How to evaluate using D-ECE

Hi I am trying to evaluate my object detection model
Should I concat predictions for all images together, or how to do it
Is there code with working example?
Thanks

Pickling objects

I wanted to pickle the LogisticCalibration() class after I had fit it (for later re-use), but I was getting an error related to can't pickle _thread.Rlock objects.

I was able to find a work-around by setting logger in the class no None. Seems a bit hacky, but it did work. Might be something to think about in future releases

Can this be used in face verifaction and how?

Reliability diagram correctness

Hi,when I tried to plot the Reliability diagram for CIFAR 10 resnet110 model,the plot contains the blue region's filled for low level bins also even though there are no probability values present in those bins.Is this anything that is default in the code

Problems measuring miscalibration

I'm trying to do this, as you pointed out:
uncalibrated_score = ece.measure(confidences)

but I'm getting this error:
TypeError: measure() is missing 1 required positional argument: 'y'

confidences is a NumPy object already:
{ndarray: (512, 8)}

### EDIT

I've added the ground truth as you did in one of your examples.
uncalibrated_score = ece.measure(confidences, ground_truth)

Where ground_truth are the encoded labels. Neither confidences nor ground_truth have NaN values, but I'm getting:
TypeError: nan_to_num() got an unexpected keyword argument 'nan'

### EDIT
YOU NEED NUMPY >= 1.17 FOR THIS TO WORK.

Pyro import fails in 1.2.1 netcal.scaling

In 1.2.1 importing netcal.scaling results in the following error:

Traceback (most recent call last):
  File "/path_to_script/ecal.py", line 3, in <module>
    from netcal.scaling import TemperatureScaling, LogisticCalibration
  File "/path_to_miniconda/lib/python3.8/site-packages/netcal/scaling/__init__.py", line 28, in <module>
    from .AbstractLogisticRegression import AbstractLogisticRegression
  File "/path_to_miniconda/lib/python3.8/site-packages/netcal/scaling/AbstractLogisticRegression.py", line 26, in <module>
    import pyro
  File "/path_to_miniconda/lib/python3.8/site-packages/pyro/__init__.py", line 4, in <module>
    import pyro.poutine as poutine
  File "/path_to_miniconda/lib/python3.8/site-packages/pyro/poutine/__init__.py", line 4, in <module>
    from .handlers import (
  File "/path_to_miniconda/lib/python3.8/site-packages/pyro/poutine/handlers.py", line 60, in <module>
    from .collapse_messenger import CollapseMessenger
  File "/path_to_miniconda/lib/python3.8/site-packages/pyro/poutine/collapse_messenger.py", line 7, in <module>
    from pyro.distributions.distribution import COERCIONS
  File "/path_to_miniconda/lib/python3.8/site-packages/pyro/distributions/__init__.py", line 4, in <module>
    import pyro.distributions.torch_patch  # noqa F403
  File "/path_to_miniconda/lib/python3.8/site-packages/pyro/distributions/torch_patch.py", line 87, in <module>
    @patch_dependency("torch.distributions.constraints._CorrCholesky.check")
  File "/path_to_miniconda/lib/python3.8/site-packages/pyro/distributions/torch_patch.py", line 18, in patch_dependency
    module = getattr(module, part)
AttributeError: module 'torch.distributions.constraints' has no attribute '_CorrCholesky'

This is with pytorch 1.7.1, python 3.8, pyro-ppl 1.7.0. Reproduction is as simple as import netcal.scaling.

Getting Error While Installing Netcal (Pyro-ppl library Issue)

File "/workspace/system-paper/main_util.py", line 23, in
from netcal.scaling import TemperatureScaling
File "/opt/conda/lib/python3.6/site-packages/netcal/scaling/init.py", line 28, in
from .AbstractLogisticRegression import AbstractLogisticRegression
File "/opt/conda/lib/python3.6/site-packages/netcal/scaling/AbstractLogisticRegression.py", line 26, in
import pyro
File "/opt/conda/lib/python3.6/site-packages/pyro/init.py", line 4, in
import pyro.poutine as poutine
File "/opt/conda/lib/python3.6/site-packages/pyro/poutine/init.py", line 4, in
from .handlers import (
File "/opt/conda/lib/python3.6/site-packages/pyro/poutine/handlers.py", line 60, in
from .collapse_messenger import CollapseMessenger
File "/opt/conda/lib/python3.6/site-packages/pyro/poutine/collapse_messenger.py", line 7, in
from pyro.distributions.distribution import COERCIONS
File "/opt/conda/lib/python3.6/site-packages/pyro/distributions/init.py", line 4, in
import pyro.distributions.torch_patch # noqa F403
File "/opt/conda/lib/python3.6/site-packages/pyro/distributions/torch_patch.py", line 87, in
@patch_dependency("torch.distributions.constraints._CorrCholesky.check")
File "/opt/conda/lib/python3.6/site-packages/pyro/distributions/torch_patch.py", line 18, in patch_dependency
module = getattr(module, part)
AttributeError: module 'torch.distributions.constraints' has no attribute '_CorrCholesky

NaN outputs

Sometimes I get NaN with the transform function.
In these cases, the below warning is observed when calling fit function:
/usr/local/lib/python3.10/dist-packages/netcal/binning/HistogramBinning.py:280: RuntimeWarning: invalid value encountered in divide
calibrated = np.divide(calibrated, normalizer)

ENIR

The following error sometimes occurs when working with ENIR

ValueError: Array of size zero to minimum decrement operation that has no identity

Error in the code of temperature scaling

Based on the formula for TS ,which is softmax(z/T) where T is the Temperature.But in the repository,the code represents softmax(z*T),where weight T is calculated.Can you please confirm this.

Missing sdist URL in pypi

Hi,

thank you for the library! I am using it to compute some calibration metrics like ECE and so on, so it came in really handy.

I have one small request: could you provide a link to the sdist in pypi? The reason I need it is a fairly unusual one - I want to use netcal withi a custom pyodide bin in a frontend application and wanted to create a pyodide package with their mkpgk wrapper (see https://pyodide.readthedocs.io/en/latest/new_packages.html). This failed with the error
Exception: No sdist URL found for package netcal (https://pypi.org/project/netcal/).

I can root around it, of course, but it would be an easy fix to make it work :)

Information on relative_x_position variable

Hi and thanks for this great repo,

I'm trying to use the repo to calibrate the confidence scores from a BERT model I fine-tuned. My problem is a binary classification and I want to use Platt Scaling (LogisticCalibration class). I am not sure I understand what the relative_x_position variable refers to? Could you please help me understand this?

Thanks a lot in advance.

Incorrect documentation for ECE usage

In the readme and API reference docs:

from netcal.metrics import ECE

n_bins = 10

ece = ECE(n_bins)
uncalibrated_score = ece.measure(confidences)
calibrated_score = ece.measure(calibrated)

This triggers:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-ea3fcf5ef398> in <module>
      4 
      5 ece = ECE(n_bins)
----> 6 uncalibrated_score = ece.measure(confidences)
      7 calibrated_score = ece.measure(calibrated)
      8 print('uncalibrated_score', uncalibrated_score)

TypeError: measure() missing 1 required positional argument: 'y'

The correct documentation is:

from netcal.metrics import ECE

n_bins = 10

ece = ECE(n_bins)
uncalibrated_score = ece.measure(confidences, ground_truth)
calibrated_score = ece.measure(calibrated, ground_truth)

Numpy explicit dtypes

According to https://numpy.org/doc/stable/release/1.20.0-notes.html#deprecations, the "generic" Numpy dtypes np.float and np.int were deprecated and have been removed in the latest Numpy release 1.24.0 according to https://numpy.org/doc/stable/release/1.24.0-notes.html#expired-deprecations .
An update is required to remove the "generic" dtypes completely.

Use of nan parameters in np.nan_to_num requires numpy>=1.17

In Miscalibration.py, L233 you are using the "nan" parameters in np.nan_to_num. This parameters was added in version 1.17 of numpy. You probably need to update the requirements for your package to use numpy 1.17 :)

mean_estimate flag in scaling calibration transform method does not work as intended

When the mean_estimate parameter is flagged as True, it returns the error expected np.ndarray (got Tensor). This only happens when method in the constructor is initialized as 'variational' or 'mcmc'. When method is initialized as 'mle', there are no problems yielding numpy array, i.e. inputs to the fit and transform are valid.

LogisticCalibration use _inverse_sigmoid

Hi Fabian Küppers
when use LogisticCalibration() for binary class, why use _inverse_sigmoid(X) rather than X.

version 1.0 netcal    
`        # if binary, use sigmoid instead of softmax
    if self.num_classes <= 2 or self.independent_probabilities:
        logit = self._inverse_sigmoid(X) 
    else:
        logit = self._inverse_softmax(X)

    # otherwise, use SciPy optimzation. Usually, this is much faster
    if self.num_classes > 2:
        # convert ground truth to one hot if not binary
        y = self._get_one_hot_encoded_labels(y, self.num_classes)

    # if temperature scaling, fit single parameter
    if self.temperature_only:
        theta_0 = np.array(1.0)

    # else fit bias and weights for each class (one parameter on binary)
    else:
        if self._is_binary_classification():
            theta_0 = np.array([0.0, 1.0])
        else:
            theta_0 = np.concatenate((np.zeros(self.num_classes), np.ones(self.num_classes)))

    # perform minimization of squared loss - invoke SciPy optimization suite
    result = optimize.minimize(fun=self._loss_function, x0=theta_0,
                               args=(logit, y))`

Thanks

inference with calibrated model

Once i have calibrated my model, how can i use the calibrated model to run inference on an image

TemperatureScaling().transform() return input confidences.

I run CIFAR.py example in the classification folder, and print both the inputs and outputs of TemperatureScaling().transform(), the inputs and outputs are always the same. Why the calibration doesn't work?

how use multi-classes logits in detection calibration?

Since we use inverse_sigmoid to reconstruct the logits, Why not use the multi-class output directly?
So the LogisticCalibration model can learn the correlation between different classes?

ReliabilityDiagram.plot() makes duplicate copy of figure

Code to recreate (I ran it in Google Colab):

!pip install netcal

import numpy as np
from netcal.presentation import ReliabilityDiagram

conf = np.random.rand(1000)
ground = np.random.randint(0, 2, 1000)

diag = ReliabilityDiagram(20)
diag.plot(conf, ground)

Results of !pip show netcal:

Name: netcal
Version: 1.3.5
Summary: The net:cal calibration framework is a Python 3 library for measuring and mitigating miscalibration of uncertainty estimates, e.g., by a neural network.
Home-page: 
Author: Fabian Küppers
Author-email: [[email protected]](mailto:[email protected])
License: Apache-2.0
Location: /usr/local/lib/python3.9/dist-packages
Requires: gpytorch, matplotlib, numpy, pyro-ppl, scikit-learn, scipy, tensorboard, tikzplotlib, torch, torchvision, tqdm
Required-by:

Re-using HistogramBinning

Hi, I noticed the following behavior when using HistogramBinning:

from netcal.binning import HistogramBinning
import numpy as np

labels = np.random.randint(2, size=(100,))
preds = np.random.uniform(size=(100,))

estimator = HistogramBinning()
for i in range(2):
    print(f"Loop {i}")
    estimator.fit(preds, labels)

with the code above, first loop will run correctly but second will throw AttributeError: Parameter 'bins' must be int for classification mode. (as this line changes bins from int to array).

This can be fixed by re-initializing HistogramBinning every time in the loop, but this error doesn't show up in other estimators so I thought it would be worth bringing up here. Maybe there's a way to avoid this, and if not I'll keep this issue for others that might encounter this problem.

ReliabilityDiagram fails to import because of tikzplotlib

When trying to plot a ReliabilityDiagram I got this traceback:

File "REDACTED", line 31, in
from netcal.presentation import ReliabilityDiagram
File "REDACTED.venv\Lib\site-packages\netcal\presentation_init_.py", line 25, in
from .ReliabilityDiagram import ReliabilityDiagram
File "REDACTED.venv\Lib\site-packages\netcal\presentation\ReliabilityDiagram.py", line 14, in
import tikzplotlib
File "REDACTED.venv\Lib\site-packages\tikzplotlib_init_.py", line 5, in
from ._save import Flavors, get_tikz_code, save
from . import _axes
File "REDACTED.venv\Lib\site-packages\tikzplotlib_axes.py", line 3, in
from matplotlib.backends.backend_pgf import (
ImportError: cannot import name 'common_texification' from 'matplotlib.backends.backend_pgf' (REDACTED.venv\Lib\site-packages\matplotlib\backends\backend_pgf.py)

This seems to me to be caused by this issue in the tikzplotlib library and a quick fix would be to downgrade matplotlib to before 3.8.

LogisticCalibration implementation differences to scikit-learn

Hi @fabiankueppers,

Thanks for creating this great library. It works perfectly for our use case 😊 There's just one thing I'm wondering about:

For LogisticCalibration, the documentation states that it implements Platt scaling. However, I've found that it yields quite different results than when implementing it with the logistic model in sklearn.

So this

from netcal.scaling import LogisticCalibration

LC = LogisticCalibration()
LC.fit(np.array(pred), np.array(labels))
calibrated_prob = LC.transform(np.array(pred))

gives very different results from this:

from sklearn.linear_model import LogisticRegression as LR
lr = LR().fit(np.reshape(pred,(-1,1)), labels)
calibrated_prob = lr.predict_proba(np.reshape(pred,(-1,1)))[:,1]

Are there any intended differences between your implementation and sklearn? Or are we just comparing it wrong?

I've found it challenging to tell by looking at the code alone.

Thanks,
Patrick

Bug in Reliability Diagram

The ReliabilityDiagram creates bins with values which are not in the input.
Code to reproduce it:

import numpy as np
import matplotlib.pyplot as plt
from netcal.presentation import ReliabilityDiagram
# Generate true and predicts values
y_true = np.random.randint(0, 2, 100).astype(np.float32)
# Generate perfect predictions:
y_pred = y_true.copy() 
n_bins = 10
diagram = ReliabilityDiagram(n_bins)
_ = diagram.plot(y_pred, y_true)

Question about input range of multivariate confidence calibration

Hello, I would like to ask a question that arose while doing research using the great platform you provide.
This question is about the function fit() of the class Abstract Calibration implemented in netcal/AbstractCalibration.py.
Looking at line 164, regardless of the task (whether classification or detection), the range of input X is limited to a value between 0 and 1.

If calibration is performed using box parameters together, elements such as width and length will be outside the range. Is there a reason why you implemented it as above?
Also, if I want to use box parameters, could you please recommend how to convert them to that range and calibrate them?

transform single prediction

Hi,
I want to transform my predicted output one at a time. However, it's throwing error as it squeezes (1,num_classes) shaped output to (num_classes).
Thanks!

Mean Accuracy Treshold

How is the mean accuracy in the ReliabilityDiagram Calculated? What is the threshold used to select a binary outcome?
Would it be possible to add a parameter to use as a threshold?

Thanks

TemperatureScaling

Why is TemperatureScaling applied to confidences instead of logits?

DType Error for LogisticCalibrationDependent

I recently upgraded from netcal version 1.2.1 to 1.3.1, and now I can no longer fit a LogisticCalibrationDependent instance without the following error occurring: RuntimeError: Found dtype Double but expected Float. My code matches the examples in terms of dtypes for the features (np.float32) and matched vector (np.int32). Exact same code works with version 1.2.1.

The error is being thrown from the following line in AbstractLogisticRegression (line 582 according to pdb):

torch.nn.BCELoss(reduction='mean')(torch.sigmoid(x), y)

I'm using PyTorch version 1.11.0.

SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats

Hello,

I have this error when trying to run the following:
from netcal.presentation import ReliabilityDiagram

SystemError Traceback (most recent call last)
File ~/miniconda3/envs/drain/lib/python3.11/site-packages/IPython/core/formatters.py:340, in BaseFormatter.call(self, obj)
338 pass
339 else:
--> 340 return printer(obj)
341 # Finally look for special method names
342 method = get_real_method(obj, self.print_method)

File ~/miniconda3/envs/drain/lib/python3.11/site-packages/IPython/core/pylabtools.py:152, in print_figure(fig, fmt, bbox_inches, base64, **kwargs)
149 from matplotlib.backend_bases import FigureCanvasBase
150 FigureCanvasBase(fig)
--> 152 fig.canvas.print_figure(bytes_io, **kw)
153 data = bytes_io.getvalue()
154 if fmt == 'svg':

File ~/miniconda3/envs/drain/lib/python3.11/site-packages/matplotlib/backend_bases.py:2042, in FigureCanvasBase.print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, **kwargs)
2036 if bbox_inches:
2037 # call adjust_bbox to save only the given area
2038 if bbox_inches == "tight":
2039 # When bbox_inches == "tight", it saves the figure twice.
2040 # The first save command (to a BytesIO) is just to estimate
2041 # the bounding box of the figure.
-> 2042 result = print_method(
2043 io.BytesIO(),
...
521 cbook.open_file_cm(filename_or_obj, "wb") as fh:
--> 522 _png.write_png(renderer._renderer, fh,
523 self.figure.dpi, metadata=metadata)

SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats

Is it accuracy - or is it the relative frequency of positive examples in the bin?

Dear Fabian,

Thank you for the time you put into this repo and for open sourcing your code!

I have never used netcal before, and so I found myself comparing it to other libraries/pieces of code that do similar things. Concerning the visualisation function(s), specifically netcal.presentation.ReliabilityDiagram, I was wondering: is the quantity you plot on the y axis really the accuracy, or is it the relative frequency of positive examples in each bin (as, from my understanding, it should be in calibration curves)?

Checking the code here, in particular this snippet:

for batch_X, batch_matched, batch_hist, batch_median in zip(X, matched, histograms, median_confidence):
            acc_hist, conf_hist, _, num_samples_hist = batch_hist
            empty_bins, = np.nonzero(num_samples_hist == 0)

            # calculate overall mean accuracy and confidence
            mean_acc.append(np.mean(batch_matched))
            mean_conf.append(np.mean(batch_X))

assuming batch_matched stores the ground truth labels for each batch, I am pretty confident that should not be named "accuracy" (still - I confess I have not spent a lot of time trying to understanding perfectly what the various function should return).

I have also tried to compare the results from netcal with scikit-learn calibration_curve function - whose documentation state returns "the proportion of samples whose class is the positive class, in each bin (fraction of positives)", and the results look very similar, if not identical, to what I get with netcal.

It would be amazing if you could clarify this!

Cheers,
Dennis.

How to extract temperature value for future use?

Hello,
I am not able to figure out ,how to extract temperature value of my model so that i can use that in my test data.
thanks

Temperature scaling for Multi-label classification

If we were to use Temperature scaling for Multi-label classification, do we work under the assumption that every class is independent of each other and perform calibration on each of our classes independently?

ece algorithm has bug using for binary classification

the parameter I pass: X is the proba, y is the label vector.
I found that the result is different from which i get from tfp.stats.expected_calibration_error.
I check both codes, and I found that it is different to caculate the acc for each bin. I think the code in netcal.metrics.MIscalibration.py 386 line may be wrong, the code is

matched = np.array(y)
but even if y true label is zero, it can be matched as long as the predicted label is 0.

In your code, you calculate the acc as the portion of the positive sample of the total samples when the y is one dim.

Wrong Identification of Multi-Class Classification in Metrics Calculation

Setup:

Using metrics package for classification (e.g., netcal.metrics.ECE)
Binary classification (number of distinct ground-truth labels: 2)
Input array with shape (n, 2) with n samples and confidence scores for the negative/positive classes, respectively

In this scenario, the metric erroneously identifies the input as mulit-class, although the input is binary. This results in an error.

Feature request: Allow for pre-existing figure object for rendering reliability diagrams

Feature request: allow the plotting routines for reliability diagram to take pre-existing Matplotlib Figure instances to work with so that the user is able to perform the figure handling.
Add new optional parameter to consume figure object (not axis objects, as most of the diagrams create multiple subplots).

ECE measrue error -" ValueError: The dimension of bins must be equal to the dimension of the sample x."

Hi,

I'm trying to use the EC.Emeasrue function, in accorgance with the example in the readme, but get the following error:
ValueError: The dimension of bins must be equal to the dimension of the sample x.

I'm running this dummy exampe:

import numpy as np

ground_truth = np.asarray([1, 1, 0]) 
confidences = np.asarray([[0.1, 0.8], [0.3, 0.7,], [0.2, 0.8]]) 

n_bins = 10
ece = ECE(n_bins)
uncalibrated_score = ece.measure(confidences, ground_truth)

The function return value when the confidences are of shape (n_samples, ).

Am I doing something wrong?

EDIT: It's seems that when going to 3 class classification, its workin and 2-classes classification must be formulized a single-logit.

Thanks

Seeing strange behavior in reliability diagram for histogram binning

I've run HistogramBinning on several datasets and always seem to end up with an empty right-most bin on the post-calibration reliability diagram (see figure below).

Classification example not running

I installed the calibration framework from scratch as described and was using an conda environment, but when I try to run the classification examples I got an error. It looks like that it is related to the ENIR.

Get path of all Near Isotonic Regression models with mPAVA ...
Traceback (most recent call last):
File "/home/labor/calibration-framework/examples/classification/CIFAR.py", line 169, in
cross_validation(model, use_cuda=use_cuda, domain=domain)
File "/home/labor/calibration-framework/examples/classification/CIFAR.py", line 135, in cross_validation
success = cross_validation_5_2(models=models, datafile=datafile, bins=bins, save_models=save_models, domain=domain)
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/netcal/Decorator.py", line 90, in new_f
return f(*args, **kwds)
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/examples/classification/utils.py", line 236, in cross_validation_5_2
instance.fit(build_set_sm, build_set_gt)
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/netcal/Decorator.py", line 62, in new_f
return f(*args, **kwds)
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/netcal/binning/ENIR.py", line 239, in fit
self._multiclass_instances = self._create_one_vs_all_models(X, y, ENIR, self.score_function,
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/netcal/AbstractCalibration.py", line 568, in _create_one_vs_all_models
model.fit(onevsall_confidence, onevsall_ground_truth)
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/netcal/Decorator.py", line 62, in new_f
return f(*args, **kwds)
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/netcal/binning/ENIR.py", line 270, in fit
self._model_scores, self._binning_models = self._elbow(X, y, model_list, self.score_function, alpha=0.001)
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/netcal/Decorator.py", line 35, in new_f
return f(*args, **kwds)
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/netcal/AbstractCalibration.py", line 498, in _elbow
model_scores = self._calc_model_scores(confidences, ground_truth, model_list, score_function)
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/netcal/Decorator.py", line 35, in new_f
return f(*args, **kwds)
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/netcal/AbstractCalibration.py", line 468, in _calc_model_scores
model_scores = np.exp((np.min(score) - score) / 2.)
File "<array_function internals>", line 200, in amin
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 2946, in amin
return _wrapreduction(a, np.minimum, 'min', axis, None, out,
File "/home/labor/miniconda3/envs/cal/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation minimum which has no identity

Used setup:

Ubuntu 20.04

conda environment
conda create --name cal python=3.10
conda activate cal
Repo clone and install
git clone https://github.com/EFS-OpenSource/calibration-framework
cd calibration-framework/
python3 -m pip install .
Execute examples
cd examples/classification/
python CIFAR.py

netcal.binning.BBQ.transform() sometimes returns values that are outside of the [0,1] range

Code to reproduce issue:
`
#insert here any model to calculate the confidence array, I got this error with multiple different models in multiple different datasets for binary classification

            bbq_calibration = BBQ()
            bbq_calibration.fit(y_conf_cal[:,1], y_cal)
            y_conf_bbq = bbq_calibration.transform(y_conf_cal[:,1])

`
sometimes the y_conf_bbq would contain values that go outside 0 and 1, I suspect that it is a floating point error since when I tested to see what numbers it gave outside the [0,1] range I got 1.0000000000000002, but as it was relatively rare I did not try multiple times to see wether different anomalous values are possible.
If indeed it is a floating point error simply clipping the output should be fine to fix this error.

Is classification logit or probability used as input for temperature scaling?

I run the classification examplar code for CIFAR dataset, I find the .npz files used store the classification probability instead of the classification logit. Does it mean there is an discrepancy between the original temperature alrogithm [1] and this reproduced algorithm? Thanks for your explanation.

[1] Chuan Guo, Geoff Pleiss, Yu Sun and Kilian Q. Weinberger: "On Calibration of Modern Neural Networks." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. Get source online https://arxiv.org/abs/1706.04599`_

RuntimeError: On detection mode, it is mandatory to provide binary labels y in [0,1].

Hi,
I am passing the below input to the LogisticCalibration, but it is giving the runtime error.

`confidence_scores = np.array([0.70745564, 0.71694]
matched = np.array([1, 1] # as both are boxes are matched with the ground truth's
relative_x_position = np.array([0.7543349742889405, 0.24766819924116135])
input = np.stack((confidences_scores, relative_x_position), axis=1)

lr = LogisticCalibration(detection=True, use_cuda=False) # flag 'detection=True' is mandatory for this method
lr.fit(input, matched)
calibrated = lr.transform(input)
`

Thanks

TemperatureScaling().transform() binary-case output

TemperatureScaling().transform() returns confidences for the second class in the binary classification case. This behavior seems unintuitive; ideally return shape should be (n, k), but if it has to be (n,) then it should return confidences of the first class rather than the second.

Where can I view the paper "Bayesian Confidence Calibration for Epistemic Uncertainty Modelling"?

Where can I view the paper "Bayesian Confidence Calibration for Epistemic Uncertainty Modelling"?
I can't find it anywhere on the internet including on google scholar search. I can only find is the abstract. I would like to have a read for reference to my paper related to model calibration.
If the paper is yet published, can you please generously send me a copy to [email protected]?
Thank you.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

efs-opensource / calibration-framework Goto Github PK

calibration-framework's Introduction

net:cal - Uncertainty Calibration

Table of Contents

Overview

Update on version 1.3

Update on version 1.2

Update on version 1.1

Installation

Requirements

Calibration Metrics

Confidence Calibration Metrics

Regression Calibration Metrics

Methods

Confidence Calibration Methods

Binning

Scaling

Regularization

Regression Calibration Methods

Non-parametric calibration

Parametric calibration

Visualization

Examples

Classification

Post-hoc Calibration for Classification

Measuring Miscalibration for Classification

Visualizing Miscalibration for Classification

Detection (Confidence of Objects)

Post-hoc Calibration for Detection

Measuring Miscalibration for Detection

Visualizing Miscalibration for Detection

Uncertainty in Confidence Calibration

Post-hoc Calibration with Uncertainty

Measuring Miscalibration with Uncertainty

Probabilistic Regression

Post-hoc Calibration (Parametric)

Post-hoc Calibration (Non-Parametric)

Correlation Estimation and Recalibration

Measuring Miscalibration for Regression

Visualizing Miscalibration for Regression

References

calibration-framework's People

Contributors

Stargazers

Watchers

Forkers

calibration-framework's Issues

Recommend Projects

Recommend Topics

Recommend Org