easeml / datascope Goto Github PK

View Code? Open in Web Editor NEW

34.0 8.0 4.0 41.32 MB

Measuring data importance over ML pipelines using the Shapley value.

Home Page: https://ease.ml/datascope

License: MIT License

Python 93.41% Jupyter Notebook 2.88% Makefile 1.56% Shell 2.16%

data-debugging machine-learning shapley-value

datascope's Introduction

Ease.ml/Datascope: Guiding your Data-centric Data Iterations, over End-to-end ML pipelines

Developing ML applications are data-centric --- often the quality of your model is a reflection of the quality of your underlying data. In the era of data- centric AI, the fundamental question becomes

Which training data example is most important to improve the accuracy/fairness of my ML model?

Once you know these "importances", we can use it to support a range of applications --- clean your data and fix your data bugs, data acquisition, data summarization, etc. (e.g., https://arxiv.org/pdf/1911.07128.pdf).

DataScope is a tool for inspecting ML pipelines by measuring how important each training data point is. The most prominent feature of DataScope is that it supports not only a single ML model, but also any sklearn Pipeline --- it is also super fast, up to four orders of magnitude faster than previous approaches. The secret sauce of DataScope is a collection of new results on computing the Shapley value of a specific family of ML models (K-nearest neighbor classifiers) in PTIME, over relational data provenances. If you want to learn more about how DataScope works, the main reference is https://arxiv.org/abs/2204.11131, and a series of our previous studies on KNN Shapley proxies can be found at https://ease.ml/datascope.

In just seconds, you will be able to get the importance score for each of your training examples, and get your data-centric cleaning/debugging iterations started!

DataScope is part of the Ease.ML data-centric ML DevOps eco-system: https://Ease.ML

References

@misc{https://doi.org/10.48550/arxiv.2204.11131,
  url = {https://arxiv.org/abs/2204.11131},
  author = {Karlaš, Bojan and Dao, David and Interlandi, Matteo and Li, Bo and Schelter, Sebastian and Wu, Wentao and Zhang, Ce},
  title = {Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines},
  publisher = {arXiv}, year = {2022},
}

Quick Start

Install by running:

pip install datascope

We can compute the Shapley importance scores for some scikit-learn pipeline pipeline using a training dataset (X_train, y_train) and a validation dataset (X_val, y_val) as such:

from datascope.importance.common import SklearnModelAccuracy
from datascope.importance.shapley import ShapleyImportance

utility = SklearnModelAccuracy(pipeline)
importance = ShapleyImportance(method="neighbor", utility=utility)
importances = importance.fit(X_train, y_train).score(X_val, y_val)

The variable importances contains Shapley values of all data examples in (X_train, y_train) computed using the nearest neighbor method (i.e. "neighbor").

For a more complete example workflow, see the demo Colab notebook.

Why datascope?

Shapley values help you find faulty data examples much faster than if you were going about it randomly. For example, let's say you are given a dataset with 50% of labels corrupted, and you want to repair them one by one. Which one should you select first?

In the above figure, we run different methods for prioritizing data examples that should get repaired (random selection, various methods that use the Shapley importance). After each repair, we measure the accuracy achieved on an XGBoost model. We can see in the left figure that each importance-based method is better than random. Furthermore, for the KNN method (i.e. the "neighbor" method), we are able to achieve peak performance after repairing only 50% of labels.

ease.ml/datascope speeds up data debugging by allowing you to focus on the most important data examples first

If we look at speed (right figure), we measure three different methods (the "neighbor" method and the "montecarlo" method for 10 iterations and 100 iterations). We can see that our KNN-based importance computation method is orders of magnitude faster than the state-of-the-art Monte-Carlo method.

The "neighbor" method in ease.ml/datascope can compute importances in seconds for datasets of several thousand examples

datascope's People

Contributors

Stargazers

Watchers

Forkers

zhangce babak-1990 xzyaoi markovml

datascope's Issues

Error after Pip Installation

I am trying to install datascope via pip in a Conda env running on Windows.
After pip i getting an installation without errors but upon importing datascope i get

from datascope.importance.common import SklearnModelAccuracy
from datascope.importance.shapley import ShapleyImportance

Traceback (most recent call last):

Input In [2] in <cell line: 2>
from datascope.importance.shapley import ShapleyImportance

File C:\Anaconda3\envs\datashapley\lib\site-packages\datascope\importance\shapley.py:22 in <module>
from .shapley_cy import compute_all_importances_cy

ModuleNotFoundError: No module named 'datascope.importance.shapley_cy'

Any advice?

compatibility with older numpy version

Hi,

I am trying to use your Datascope package but due my project' setup I am required to use numpy version 1.19.2 - is there any way make that work ? I have tried a couple of approaches but the shapley_cy file keeps creating the following error:
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject

thank you for your help!

"compute_all_importances_cy" data type mismatch

Thanks for this great library. I followed the instruction in readme.md file and run the setup.

I get the following error when trying to test on the notebook "DataScope-Demo-1.ipynb":

...\datascope\importance\shapley.py in compute_shapley_1nn_mapfork(distances, utilities, provenance, units, world, null_scores, simple_provenance)
    219     n_test = distances.shape[1]
    220     null_scores = null_scores if null_scores is not None else np.zeros((1, n_test))
--> 221     all_importances = compute_all_importances_cy(unit_distances, unit_utilities, null_scores)
    222 
    223     # Aggregate results.

datascope/importance/shapley_cy.pyx in datascope.importance.shapley_cy.compute_all_importances_cy()

ValueError: Buffer dtype mismatch, expected 'int_t' but got 'long long'

It seems there is an issue with type when calling the compute_all_importances_cy function. It expects integer but receives float(double?).

I tried to modify compute_all_importances_cy in shapley_cy.pyx but I had no luck to fix this bug.

Is regression supported?

Looking at the code my understanding is that only classification is possible. (SklearnModelAccuracy)
I tried using SklearnModelUtility with MSE as a metric but this leads to a NotImplementedError.

Thanks!

Skorch compatibility

Hi, as far as I understand, Datascope is compatible with any scikit-learn pipeline. I'm using PyTorch and skorch (library that wraps PyTorch) to make my classifier scikit-learn compatible.

I'm currently getting the following error when trying to compute the score:

ValueError                                Traceback (most recent call last)
[<ipython-input-49-2e03ddd68d36>](https://localhost:8080/#) in <module>()
----> 1 importances.score(test_data, test_labels)

3 frames
[/usr/local/lib/python3.7/dist-packages/datascope-0.0.3-py3.7-linux-x86_64.egg/datascope/importance/importance.py](https://localhost:8080/#) in score(self, X, y, **kwargs)
     38         if isinstance(y, DataFrame):
     39             y = y.values
---> 40         return self._score(X, y, **kwargs)

[/usr/local/lib/python3.7/dist-packages/datascope-0.0.3-py3.7-linux-x86_64.egg/datascope/importance/shapley.py](https://localhost:8080/#) in _score(self, X, y, **kwargs)
    285         units = np.delete(units, np.where(units == -1))
    286         world = kwargs.get("world", np.zeros_like(units, dtype=int))
--> 287         return self._shapley(self.X, self.y, X, y, self.provenance, units, world)
    288 
    289     def _shapley(

[/usr/local/lib/python3.7/dist-packages/datascope-0.0.3-py3.7-linux-x86_64.egg/datascope/importance/shapley.py](https://localhost:8080/#) in _shapley(self, X, y, X_test, y_test, provenance, units, world)

    314             )
    315         elif self.method == ImportanceMethod.NEIGHBOR:
--> 316             return self._shapley_neighbor(X, y, X_test, y_test, provenance, units, world, self.nn_k, self.nn_distance)
    317         else:
    318             raise ValueError("Unknown method '%s'." % self.method)

[/usr/local/lib/python3.7/dist-packages/datascope-0.0.3-py3.7-linux-x86_64.egg/datascope/importance/shapley.py](https://localhost:8080/#) in _shapley_neighbor(self, X, y, X_test, y_test, provenance, units, world, k, distance)
    507             assert isinstance(X_test, spmatrix)
    508             X_test = X_test.todense()
--> 509         distances = distance(X, X_test)
    510 
    511         # Compute the utilitiy values between training and test labels.

sklearn/metrics/_dist_metrics.pyx in sklearn.metrics._dist_metrics.DistanceMetric.pairwise()

ValueError: Buffer has wrong number of dimensions (expected 2, got 4)

Here's a snippet of my code:

from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

net = reset_model(seed = 0) # gives scikit-learn compatible skorch model

pipeline = Pipeline([("model", net)])

pipeline.fit(train_dataset, train_labels)
y_pred = pipeline.predict(test_dataset)

plot_loss(net)
accuracy_dirty = accuracy_score(y_pred, test_labels)
print("Pipeline accuracy in the beginning:", accuracy_dirty)

The above works fine, and I'm able to compute the accuracy of my baseline model.

However, when trying to run importances.score(test_data, test_labels) I'm getting the error mentioned above.

from datascope.importance.common import SklearnModelAccuracy
from datascope.importance.shapley import ShapleyImportance

net = reset_model(seed = 0)
pipeline = Pipeline([("model", net)])

utility = SklearnModelAccuracy(pipeline)
importance = ShapleyImportance(method="neighbor", utility=utility)
importances = importance.fit(train_data, train_labels)
importances.score(test_data, test_labels)

Here's the shape of my data:

train_data.shape, train_labels.shape
((2067, 3, 224, 224), (2067,))

test_data.shape, test_labels.shape
((813, 3, 224, 224), (813,))

Would be happy is someone could point me in the right direction! Not sure if this error is skorch related or the images are not supported yet? Thanks :)

easeml / datascope Goto Github PK

datascope's Introduction

Ease.ml/Datascope: Guiding your Data-centric Data Iterations, over End-to-end ML pipelines

References

Quick Start

Why datascope?

datascope's People

Contributors

Stargazers

Watchers

Forkers

datascope's Issues

Error after Pip Installation

compatibility with older numpy version

"compute_all_importances_cy" data type mismatch

Is regression supported?

Skorch compatibility

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent