easeml / datascope Goto Github PK
View Code? Open in Web Editor NEWMeasuring data importance over ML pipelines using the Shapley value.
Home Page: https://ease.ml/datascope
License: MIT License
Measuring data importance over ML pipelines using the Shapley value.
Home Page: https://ease.ml/datascope
License: MIT License
I am trying to install datascope via pip in a Conda env running on Windows.
After pip i getting an installation without errors but upon importing datascope i get
from datascope.importance.common import SklearnModelAccuracy
from datascope.importance.shapley import ShapleyImportance
Traceback (most recent call last):
Input In [2] in <cell line: 2>
from datascope.importance.shapley import ShapleyImportance
File C:\Anaconda3\envs\datashapley\lib\site-packages\datascope\importance\shapley.py:22 in <module>
from .shapley_cy import compute_all_importances_cy
ModuleNotFoundError: No module named 'datascope.importance.shapley_cy'
Any advice?
Hi,
I am trying to use your Datascope package but due my project' setup I am required to use numpy version 1.19.2 - is there any way make that work ? I have tried a couple of approaches but the shapley_cy file keeps creating the following error:
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject
thank you for your help!
Hi, as far as I understand, Datascope is compatible with any scikit-learn pipeline. I'm using PyTorch and skorch (library that wraps PyTorch) to make my classifier scikit-learn compatible.
I'm currently getting the following error when trying to compute the score:
ValueError Traceback (most recent call last)
[<ipython-input-49-2e03ddd68d36>](https://localhost:8080/#) in <module>()
----> 1 importances.score(test_data, test_labels)
3 frames
[/usr/local/lib/python3.7/dist-packages/datascope-0.0.3-py3.7-linux-x86_64.egg/datascope/importance/importance.py](https://localhost:8080/#) in score(self, X, y, **kwargs)
38 if isinstance(y, DataFrame):
39 y = y.values
---> 40 return self._score(X, y, **kwargs)
[/usr/local/lib/python3.7/dist-packages/datascope-0.0.3-py3.7-linux-x86_64.egg/datascope/importance/shapley.py](https://localhost:8080/#) in _score(self, X, y, **kwargs)
285 units = np.delete(units, np.where(units == -1))
286 world = kwargs.get("world", np.zeros_like(units, dtype=int))
--> 287 return self._shapley(self.X, self.y, X, y, self.provenance, units, world)
288
289 def _shapley(
[/usr/local/lib/python3.7/dist-packages/datascope-0.0.3-py3.7-linux-x86_64.egg/datascope/importance/shapley.py](https://localhost:8080/#) in _shapley(self, X, y, X_test, y_test, provenance, units, world)
314 )
315 elif self.method == ImportanceMethod.NEIGHBOR:
--> 316 return self._shapley_neighbor(X, y, X_test, y_test, provenance, units, world, self.nn_k, self.nn_distance)
317 else:
318 raise ValueError("Unknown method '%s'." % self.method)
[/usr/local/lib/python3.7/dist-packages/datascope-0.0.3-py3.7-linux-x86_64.egg/datascope/importance/shapley.py](https://localhost:8080/#) in _shapley_neighbor(self, X, y, X_test, y_test, provenance, units, world, k, distance)
507 assert isinstance(X_test, spmatrix)
508 X_test = X_test.todense()
--> 509 distances = distance(X, X_test)
510
511 # Compute the utilitiy values between training and test labels.
sklearn/metrics/_dist_metrics.pyx in sklearn.metrics._dist_metrics.DistanceMetric.pairwise()
ValueError: Buffer has wrong number of dimensions (expected 2, got 4)
Here's a snippet of my code:
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
net = reset_model(seed = 0) # gives scikit-learn compatible skorch model
pipeline = Pipeline([("model", net)])
pipeline.fit(train_dataset, train_labels)
y_pred = pipeline.predict(test_dataset)
plot_loss(net)
accuracy_dirty = accuracy_score(y_pred, test_labels)
print("Pipeline accuracy in the beginning:", accuracy_dirty)
The above works fine, and I'm able to compute the accuracy of my baseline model.
However, when trying to run importances.score(test_data, test_labels)
I'm getting the error mentioned above.
from datascope.importance.common import SklearnModelAccuracy
from datascope.importance.shapley import ShapleyImportance
net = reset_model(seed = 0)
pipeline = Pipeline([("model", net)])
utility = SklearnModelAccuracy(pipeline)
importance = ShapleyImportance(method="neighbor", utility=utility)
importances = importance.fit(train_data, train_labels)
importances.score(test_data, test_labels)
Here's the shape of my data:
train_data.shape, train_labels.shape
((2067, 3, 224, 224), (2067,))
test_data.shape, test_labels.shape
((813, 3, 224, 224), (813,))
Would be happy is someone could point me in the right direction! Not sure if this error is skorch related or the images are not supported yet? Thanks :)
Thanks for this great library. I followed the instruction in readme.md file and run the setup.
I get the following error when trying to test on the notebook "DataScope-Demo-1.ipynb":
...\datascope\importance\shapley.py in compute_shapley_1nn_mapfork(distances, utilities, provenance, units, world, null_scores, simple_provenance)
219 n_test = distances.shape[1]
220 null_scores = null_scores if null_scores is not None else np.zeros((1, n_test))
--> 221 all_importances = compute_all_importances_cy(unit_distances, unit_utilities, null_scores)
222
223 # Aggregate results.
datascope/importance/shapley_cy.pyx in datascope.importance.shapley_cy.compute_all_importances_cy()
ValueError: Buffer dtype mismatch, expected 'int_t' but got 'long long'
It seems there is an issue with type when calling the compute_all_importances_cy function. It expects integer but receives float(double?).
I tried to modify compute_all_importances_cy in shapley_cy.pyx but I had no luck to fix this bug.
Looking at the code my understanding is that only classification is possible. (SklearnModelAccuracy)
I tried using SklearnModelUtility with MSE as a metric but this leads to a NotImplementedError.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.