aangelopoulos / conformal-prediction Goto Github PK

Lightweight, useful implementation of conformal prediction on real data.

Home Page: http://people.eecs.berkeley.edu/~angelopoulos/blog/posts/gentle-intro/

License: MIT License

Python 0.68% Jupyter Notebook 99.18% MATLAB 0.14%

computer-vision conformal conformal-prediction distribution-shift natural-language-processing time-series time-series-prediction uncertainty uncertainty-estimation uncertainty-quantification

conformal-prediction's Introduction

Conformal Prediction

rigorous uncertainty quantification for any machine learning task

This repository is the easiest way to start using conformal prediction (a.k.a. conformal inference) on real data. Each of the notebooks applies conformal prediction to a real prediction problem with a state-of-the-art machine learning model.

No need to download the model or data in order to run conformal

Raw model outputs for several large-scale real-world datasets and a small amount of sample data from each dataset are downloaded automatically by the notebooks. You can develop and test conformal prediction methods entirely in this sandbox, without ever needing to run the original model or download the original data. Open a notebook to see the expected output. You can use these notebooks to experiment with existing methods or as templates to develop your own.

Example notebooks

notebooks/imagenet-smallest-sets.ipynb: Imagenet classification with a ResNet152 classifier. Prediction sets guaranteed to contain the true class with 90% probability.
notebooks/meps-cqr.ipynb: Medical expenditure regression with a Gradient Boosting Regressor and conformalized quantile regression. Prediction intervals guaranteed to contain the true dollar value with 90% probability.
notebooks/multilabel-classification-mscoco.ipynb: Multilabel image classification on the Microsoft Common Objects in Context (MS-COCO) dataset. Set-valued prediction is guaranteed to contain 90% of the ground truth classes.
notebooks/toxic-text-outlier-detection.ipynb: Detecting toxic or hateful online comments via conformal outlier detection. No more than 10% of in-distribution data will get flagged as toxic.
notebooks/tumor-segmentation.ipynb: Segmenting gut polyps from endoscopy images. Segmentation masks contain 90% of the ground truth tumor pixels.
notebooks/weather-time-series-distribution-shift: Predicting future temperatures around the world using time-series data and weighted conformal prediction. Prediction intervals contaion 90% of true temperatures.
notebooks/imagenet-selective-classification.ipynb: When the Imagenet classifier is unsure, it will abstain. Otherwise, it will have an accuracy of 90%, even though the base model was only 77% accurate.
...and more!

Notebooks can be run immediately using the provided Google Colab links

Colab links are in the top cell of each notebook

To run these notebooks locally, you just need to have the correct dependencies installed and press run all cells! The notebooks will automatically download all required data and model outputs. You will need 1.5GB of space on your computer in order for the notebook to store the auto-downloaded data. If you want to see how we generated the precomputed model outputs and data subsamples, see the files in generation-scripts. There is one for each dataset. To create a conda environment with the correct dependencies, run conda env create -f environment.yml. If you still get a dependency error, make sure to activate the conformal environment within the Jupyter notebook.

Citation

This repository is meant to accompany our paper, the Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. In that paper is a detailed explanation of each example and attributions. If you find this repository useful, in addition to the relevant methods and datasets, please cite:

@article{angelopoulos2021gentle,
  title={A gentle introduction to conformal prediction and distribution-free uncertainty quantification},
  author={Angelopoulos, Anastasios N and Bates, Stephen},
  journal={arXiv preprint arXiv:2107.07511},
  year={2021}
}

Videos

If you're interested in learning about conformal prediction in video form, watch our videos below!

A Tutorial on Conformal Prediction

A Tutorial on Conformal Prediction Part 2: Conditional Coverage

A Tutorial on Conformal Prediction Part 3: Beyond Conformal Prediction

conformal-prediction's People

Contributors

Stargazers

Watchers

Forkers

valeman kastnerkyle yyht chineseglue prateekchandrajha jonesn11 22388o souvickg chenweichen jimmy-inl richardsonjf kutigeza srisai85 aborrel luhuanwu steffen-ventz jerome-f yangnyc1024 xinranzhang0909 phamacher arauchen bradfox2 vandvc bhatt-priyadutt rnaimehaom pabloamc firobeid asbe kisdma nimasarajpoor shashirajpandey schinkikami d-laub vishalbelsare fbarez qtwang astridgcn lams0018 hamedmx jimmykimmy68 jxzhangjhu evelynmitchell dashcyrus jerryi00 birajaghoshal dangquang2016 dhanushtds dhanushvarma reminiscenty 601sung livazquezs todowede asali-cs xhulja07 thierrymoudiki jpmonteagudo28 armindadras dakmatt athzpadilla iokaf shanshanchen-biostat siddiquesalman habib61 ziyu-deep black-swan-icl krrcharles rabia174 zhangzheng01310911 zgcharaf hflying paulocr2 harryzhangog thomaslai030111 kedardg rishirelan yang-zhao-cis-tu tranhlok wanghaoyi518 carlomarxdk waith c188 hparaujo

conformal-prediction's Issues

Size of Prediction Sets using APS Different Than Reported in RAPS Paper

Hello,

Thank you so much for providing the conformal prediction tutorial & corresponding notebooks, they are super helpful!

I had a question regarding the size of the prediction sets returned using the APS methods. In the implementation provided in the notebooks, the prediction sets are far larger than reported on your paper than introduced RAPS. The notebook implementation returns sets that are on average >200 labels, whereas the paper reports an average set size of 10.4, on ResNet152.

I have not done extensive evaluation on RAPS, but it seems the notebook implementation also returns slightly larger sets (set size of ~3).

I was wondering if you have any ideas as to what might be causing this discrepancy, and what the best way to replicate the results in the paper might be.

Also, I wasn't sure which repo this issue should be opened in, so apologies if it doesn't fit here. Thanks in advance!

`np.quantile`: deprecated `interpolation` argument

Thanks a lot for writing this very interesting paper and for publishing the notebooks.

FYI, the interpolation argument appears to be deprecated in recent versions of numpy.quantile:

https://numpy.org/devdocs/reference/generated/numpy.quantile.html

Score function for APS

Hello,

Thank you for providing these notebooks for conformal prediction, they have been immensely helpful.

Reading through the section 2.1 of the paper on "Classification with Adaptive Prediction Sets" and the associated notebook, I had some questions about the scoring function.

Namely, the paper provides the score function

$$s(x,y) = \sum_{j=1}^k \hat{f}(x)_{\pi_j(x)}$$

where $y = \pi_j(x)$. Why are we including $\hat{f}(x)_{y}$ in the sum? Doing so would lead to some possibly problematic scores. Consider, for example, a perfect predictor that assigns all of its mass to the correct label $y$, and a completely incorrect predictor that assigns all of its mass to some incorrect label $\ne y$. Both of these predictors would have the same score of 1. This breaks the assumption that a higher score corresponds to misalignment between the forecaster and the true label.

Investigating this issue further, I tried modifying the score function to greedily include all classes up to, but not including, the true label. Intuitively, a higher score would correspond to more probability mass assigned to incorrect labels, which is a better estimate of misalignment. Coding this up in the notebook for APS, this little fix increased the coverage slightly, but more importably it decreased the mean size of the confidence sets to 3.3 (compared to 187.5 in the original notebook). The confidence sets on the imagenet examples also seem to make more sense upon preliminary inspection. This could possibly address an issue raised previously.

Is there a typo/error in the score function of APS that would explain these results?

Thanks in advance!

[Question] Why the upper bound of the selective risk is non-monotonic in the tutorial of selective classification?

While trying to understand the method for selective classification, I tried to run your code and to plot the curves of the selective risk and its upper bound.
This is the code that I added to plot these curves:

selective_risk_values = np.array([selective_risk(lam) for lam in lambdas])
selective_risk_values_ub = np.array([selective_risk_ub(lam) for lam in lambdas])
plt.plot(lambdas, selective_risk_values, label='Selective Risk')
plt.plot(lambdas, selective_risk_values_ub, label='Selective Risk upper bound')
plt.axhline(y=alpha, color='red', linestyle='--', label='alpha')
plt.legend()

I was expecting to see the upper bound to be monotonic and descending, but as you can see in the image below, it is not.

From the paper "Gentle introduction to conformal prediction", I assumed that the upper bound should be monotonic because it was introduced to overcome the issue related to the fact that the selective risk is non monotonic (section 5.5 Selective classification).

When testing it with my own dataset, I get an extreme example of this behaviour.

Is this an expected behaviour or the assumption of the upper bound being monotonic is wrong?

Thank you!

colab weather example won't run

Requires an external file. Maybe change to load it remotely from the repo so colab just works?

Over-coverage/Coverage violation in APS-Randomized algorithm

Thank you for providing such a valuable resource. I have a few inquiries regarding the APS-Randomized algorithm.
To begin, I'd like to refer to the upper bound result for CP calibration, as stated in Theorem 2.2 of "Distribution-Free Predictive Inference For Regression":
$\mathbb{P}(Y_{test} \in C(X_{test}, U_{test}, \hat{q})) < 1 - \alpha + \frac{1}{n-1}$

Upon running the APS-Randomized algorithm for 100 trials, I observed a mean coverage of approximately 93%, consistent with the empirical coverage in the provided repo example (0.93020408163265). I can raise a rationale for this deviation: while the "split conformal algorithm" in the referenced paper operates with a deterministic model ($\mathcal{A}$), while in APS-Randomized, both the generated scores and the threshold are randomized, which may cause potential challenges.

Moreover, apart from the favorable over-coverage exhibited by this algorithm, its conditional coverage, quantifiable using metrics like SSCV, surpasses that of the APS algorithm outlined in the RAPS paper (which is RAPS with $\lambda = 0$), while maintaining identical set sizes. I'm interested in understanding the underlying rationale behind this algorithm and would appreciate insights into its origins, particularly if it was derived from a specific academic paper.
Thank you for your assistance!
Lahav.

Conformal risk control question

Hi, thank you so much for the great work. I have a question regarding the notebook on conformal risk control.

in the notebook, you defined the risk optimization objective as

def lamhat_threshold(lam): return false_negative_rate(cal_sgmd>=lam, cal_gt_masks) - ((n+1)/n*alpha - 1/(n+1))

However, in section 4.3 of the paper, it is defined as $$\alpha - \frac{B-\alpha}{n}$$, which means the denominator should be $n$. Why is it $n+1$ in the code? Thanks.

Predictive uncertainty in weather-time-series-distribution-shift notebook

First of all, I would like to thank all contributors to this repository. Appreciate the great work that goes behind creating and maintaining this repository.

I was looking through the notebook, weather-time-series-distribution-shift.ipynb, and notice that in the last 4 lines of the second section, we have:
sort_idx = np.argsort(times)
pred_mean = pred_mean[sort_idx]
temperatures = temperatures[sort_idx]
times = times[sort_idx]

Should the sorting also be done for the uncertainty data, i.e., by adding the line pred_uncertainty=pred_uncertainty[sort_idx]?

Improved baseline computation for conformal prediction under distribution shift

First many thanks for this awesome repo and the tutorial on split conformal prediction. While going through the conformal prediction under distribution drift section and the corresponding example weather prediction with time-series distribution shift, I noticed that the naive implementation for determining $\hat{q}$ uses expanding window that takes all scores up to time $t$ and compute the quantile and iterates over $t$

naive_qhats = np.array( [np.quantile(scores[:t], np.ceil((t+1)*(1-alpha))/t, interpolation='higher') for t in range(K+1, scores.shape[0]) ] )

But one can think of another approach by using a rolling window of fixed window size $K$ (in the example you were using $K=1000$) and then compute the quantile on each of the windows -- I rewrote and tested the function to support both options below

def compute_qhats(scores, alpha, wsize, opt='fixed_window'):
    lst = []
    qlst = []
    K = wsize
    for t in range(K+1, scores.shape[0]):
        if opt == 'fixed_window':
            start = t - K - 1
            nsamples = K + 1 # = t - start = t - (t - wsize -1) = wsize + 1 = K + 1    QED
        elif opt == 'expanding_window':
            start = 0
            nsamples = t
        q_level = np.ceil((nsamples+1)*(1-alpha))/nsamples
        qlst.append(q_level)
        tmp = np.quantile(scores[start:t], q_level, interpolation='higher')
        lst.append(tmp)
        print('start:', start, 'end:', t)
    return np.array(qlst), np.array(lst)

Plot of the $\hat{q}_{expandingwindow}$ expanding window approach (i.e. naive implementation)

vs. the plot for e $\hat{q}_{rollingwindow}$ rolling window approach

If we compare the results to the weighted conformal prediction approach

it is very similar to the rolling window but with the cost of additional computation for finding infimum of q
$$\hat{q} = inf \{ q : \sum^{n}_{i=1} \tilde{w}_i \mathbb{1} \{s_i \leq q\}\geq 1- \alpha \}$$

that requires finding the roots of the expression above after moving the $1-\alpha$ to the left side of the inequality (i am using the generalized expression but in practice we are using the adaptation to window based version in section 5.3).

Lastly here is the comparison of coverage over time of the three approaches

In fact when computing the overall coverage, the rolling window version achieves the best coverage with score 0.900665 vs. 0.8995545 for the weighted version.

It might be the case that for this data adding the constraint of finding the infimum does not add much, but overall if the argument of weighted conformal prediction is based on weighting the recent observation in a window then a sensibly defined window would be sufficient to counter the drift especially that we are not "learning the weights $w$" but rather fixing them to a uniform across the window in both cases (unless I am missing something here :) )

On a separate note, one minor issue in the code is the size of the window K. The way it is coded now it is translated to be K+1 observations used, and in the case of weighted conformal prediction, when computing qhats you omit the first observation from the computation.

Thank you again for your work and effort to make conformal prediction accessible to the masses.