blent-ai / alepython Goto Github PK

Python Accumulated Local Effects package

License: Apache License 2.0

Python 100.00%

alepython's Introduction

Python Accumulated Local Effects package.

Why ALE?

Explaining model predictions is very common when you have to deploy a Machine Learning algorithm on a large scale. There are many methods that help us understand our model; one these uses Partial Dependency Plots (PDP), which have been widely used for years.

However, they suffer from a stringent assumption: features have to be uncorrelated. In real world scenarios, features are often correlated, whether because some are directly computed from others, or because observed phenomena produce correlated distributions.

Accumulated Local Effects (or ALE) plots first proposed by Apley and Zhu (2016) alleviate this issue reasonably by using actual conditional marginal distributions instead of considering each marginal distribution of features. This is more reliable when handling (even strongly) correlated variables.

This package aims to provide useful and quick access to ALE plots, so that you can easily explain your model through predictions.

For further details about model interpretability and ALE plots, see eg. Molnar (2020).

Install

ALEPython is supported on Python >= 3.5. You can either install package via pip:

pip install alepython

directly from source (including requirements):

pip install git+https://github.com/MaximeJumelle/ALEPython.git@dev#egg=alepython

or after cloning (or forking) for development purposes, including test dependencies:

git clone https://github.com/MaximeJumelle/ALEPython.git
pip install -e "ALEPython[test]"

Usage

from alepython import ale_plot
# Plots ALE of feature 'cont' with Monte-Carlo replicas (default : 50).
ale_plot(model, X_train, 'cont', monte_carlo=True)

Highlights

First-order ALE plots of continuous features
Second-order ALE plots of continuous features

Gallery

First-order ALE plots of continuous features

Second-order ALE plots of continuous features

Work In Progress

First-order ALE plots of categorical features
Enhanced visualization of first-order plots
Second-order ALE plots of categorical features
Documentation and API reference
Jupyter Notebook examples
Upload to PyPi
Upload to conda-forge
Use of matplotlib styles or kwargs to allow overriding plotting appearance

If you are interested in the project, I would be happy to collaborate with you since there are still quite a lot of improvements needed.

References

Apley, Daniel W., and Jingyu Zhu. 2016. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models. https://arxiv.org/abs/1612.08468.

Molnar, Christoph. 2020. Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/.

alepython's People

Contributors

Stargazers

Watchers

alepython's Issues

Concatenating the 2 arrays in 1D ALE for Discrete Variables

When you concatenate the the 2 files (Delta_plus and Delta_neg), why does the feature value get increased by 1 for the delta_plus array and not for the delta_neg array?

This means that the resultant concatenated array has a column (FeatureName) that contains the TRUE feature value for the second part of the array (Delta_neg), but then this same column contains the feature value increased by 1 for the first part of the array (Delta_plus)

The code extract is given below:

Delta_plus = y_hat_plus - y_hat[ind_plus]
Delta_neg = y_hat[ind_neg] - y_hat_neg

# compute the mean of the difference per group
delta_df = pd.concat(
    [
        pd.DataFrame(
            {"eff": Delta_plus, feature: groups[feature_codes[ind_plus] + 1]}
        ),
        pd.DataFrame({"eff": Delta_neg, feature: groups[feature_codes[ind_neg]]}),
    ]
)

KeyError,can not plot

"I created a training set named X to predict age, but I'm getting an error. X matches the model's input. Why is this happening?"

`import numpy as np
import matplotlib.pyplot as plt
from alepython.ale import ale_plot

feature_to_plot='Age'

ale_fig, _ = ale_plot(model, X.sample(10),'Age', monte_carlo=True)

plt.suptitle(f'ALE Plot for {feature_to_plot}', y=1.02)
plt.xlabel(feature_to_plot)
plt.ylabel('ALE')
plt.show()`

KeyError Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
3801 try:
-> 3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File /opt/conda/lib/python3.10/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Age'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
Cell In[94], line 7
3 from alepython.ale import ale_plot
5 feature_to_plot='Age'
----> 7 ale_fig, _ = ale_plot(model, X.sample(10),'Age', monte_carlo=True)
9 plt.suptitle(f'ALE Plot for {feature_to_plot}', y=1.02)
10 plt.xlabel(feature_to_plot)

File /opt/conda/lib/python3.10/site-packages/alepython/ale.py:727, in ale_plot(model, train_set, features, bins, monte_carlo, predictor, features_classes, monte_carlo_rep, monte_carlo_ratio, rugplot_lim)
722 # Make this recursive?
723 if features_classes is None:
724 # The same quantiles cannot be reused here as this could cause
725 # some bins to be empty or contain disproportionate numbers of
726 # samples.
--> 727 mc_ale, mc_quantiles = _first_order_ale_quant(
728 model.predict if predictor is None else predictor,
729 train_set_rep,
730 features[0],
731 bins,
732 )
733 _first_order_quant_plot(
734 ax, mc_quantiles, mc_ale, color="#1f77b4", alpha=0.06
735 )
737 ale, quantiles = _first_order_ale_quant(
738 model.predict if predictor is None else predictor,
739 train_set,
740 features[0],
741 bins,
742 )

File /opt/conda/lib/python3.10/site-packages/alepython/ale.py:359, in _first_order_ale_quant(predictor, train_set, feature, bins)
335 def _first_order_ale_quant(predictor, train_set, feature, bins):
336 """Estimate the first-order ALE function for single continuous feature data.
337
338 Parameters
(...)
357
358 """
--> 359 quantiles, _ = _get_quantiles(train_set, feature, bins)
360 logger.debug("Quantiles: {}.", quantiles)
362 # Define the bins the feature samples fall into. Shift and clip to ensure we are
363 # getting the index of the left bin edge and the smallest sample retains its index
364 # of 0.

File /opt/conda/lib/python3.10/site-packages/alepython/ale.py:328, in _get_quantiles(train_set, feature, bins)
322 if not isinstance(bins, (int, np.integer)):
323 raise ValueError(
324 "Expected integer 'bins', but got type '{}'.".format(type(bins))
325 )
326 quantiles = np.unique(
327 np.quantile(
--> 328 train_set[feature], np.linspace(0, 1, bins + 1), interpolation="lower"
329 )
330 )
331 bins = len(quantiles) - 1
332 return quantiles, bins

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:3807, in DataFrame.getitem(self, key)
3805 if self.columns.nlevels > 1:
3806 return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
3808 if is_integer(indexer):
3809 indexer = [indexer]

File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
-> 3804 raise KeyError(key) from err
3805 except TypeError:
3806 # If we have a listlike key, _check_indexing_error will raise
3807 # InvalidIndexError. Otherwise we fall through and re-raise
3808 # the TypeError.
3809 self._check_indexing_error(key)

KeyError: 'Age'

A quick question regarding the code.

In line 123 of ale.py, you are subsetting the training data based on ith quantile interval ranges using :

subset = train_set[(quantiles[i - 1] <= train_set[feature]) & (train_set[feature] < quantiles[i])]
Shouldn't it be quantiles[i - 1] >= train_set[feature]) instead of quantiles[i - 1] <= train_set[feature] ?

installation fail

C:\Users\newmanchiang>pip install alepython
Collecting alepython
Could not find a version that satisfies the requirement alepython (from versions: )
No matching distribution found for alepython

Is this project still maintained?

Thanks

how to install this in spyder

download failed

 Hello, thank you for developing this package, but I have been unable to download it (I have tried to download it in all the three ways you provided). Could you please give me some suggestions?

@MaximeJumelle

ERROR: Could not find a version that satisfies the requirement alepython (from versions: none)
ERROR: No matching distribution found for alepython

ALEPython for Pytorch

Is ALEPython available for Pytorch?
Thanks~

What do the black lines mean in second-order ALE?

If I want to delete them, what should I do？
Thank you！

Empty ALE Plot

Hello,

I am working an XAI research with the popular Portugese banking dataset from UCI ML repo, and I am trying to plot a first-order ALE plot for a single continuous column called pdays.

The distribution of the data is highly skewed, where the 25%, 50%, and 75% quantile all correspond to the same value.

I understand that in this case, the default number of bins will only be 1 because there are only two unique values from the quantiles. However, I cannot entirely comprehend when the resulting plot is empty.

Is this due to the subtraction of the mean differences at the last step (centering the mean effects)? Please correct me if I was wrong.

Sorry for the trouble and I really appreciate your guidance on this.

Thank you in advance!

please help,thanks

Hello! Author, I don't know what this part of the graph represents that I drew with the package you created. Could you please briefly introduce its meaning and function? 😊

@MaximeJumelle

Implementation incorrect

The ale plot should not be anchored at the midpoint of the bucket. This is because the ALE value represents the average change in response from the bottom of the bucket to the top.

As a simple example:

If each bucket, the average local effect in each bucket, and the observation weight (usually number of observations) are as follows:

bucket	average_local_effect	weight
[0, 1]	22	15
(1, 2]	36	25
(2, 3]	-10	35
(3, 4]	-41	25

Then the (non-centered) ALE will be:

bucket_edge	ALE
0	0
1	22
2	58
3	48
4	7

This is because 22 is the change in prediction between 0 and 1, 36 is the change between 1 and 2, and so on.

We then center with the constant -1.45 (from average_local_effect @ weight / sum(weight)) to get the final ALE:

bucket_edge	ALE
0	-1.45
1	20.55
2	56.55
3	46.55
4	5.55

See the original implementation here.

long running time of ale_plot

Hi~ thanks a lot for this package. I have been working on a multi time series research, and I built a random forest model of different time and 'y'. now i want to inspect the influence of hour (which will also interact with season or month...), but the code running time period is quite long of over 20 mins and still not finished yet. Is that common? Are there anything maybe wrong in my logistic?

'ale_plot(model, X_train, 'hour', monte_carlo=True)'

Code for categorical features is broken/incomplete

I believe that code for _first_order_ale_cat is broken or incomplete.