sdv-dev / sdmetrics Goto Github PK
View Code? Open in Web Editor NEWMetrics to evaluate quality and efficacy of synthetic datasets.
Home Page: https://docs.sdv.dev/sdmetrics
License: MIT License
Metrics to evaluate quality and efficacy of synthetic datasets.
Home Page: https://docs.sdv.dev/sdmetrics
License: MIT License
A few of the metrics in this library will return a single number that is actually the combination of several values. For example,
A singular metric that summarizes multiple scores may not be that useful for debugging or auditing in greater detail.
What if for each metric class, we added a get_details
method that will return the intermediary results used to generate the summarized metric?
Eg.
Columns of type id
should be dropped when computing metrics. ID columns are not synthetically generated, so should not contribute to the metric calculation. Currently they are being classified as categorical, which is incorrect.
DetectionMetric should use the provided metadata
to drop id
columns from the real and synthetic data when computing the metric. This logic should also be applied to any other relevant metrics.
If I have a table with some missing values, I want to synthesize data with missing values too -- ideally in the same ratio. I'm curious whether CSTest
is an appropriate signal of this? If it isn't, should we modify it to be?
Details: From the API reference
This function applies the single column CSTest metric to all the discrete columns found in the table and then returns the average of all the scores obtained.
I know that the SDV
internally creates a new, discrete binary column representing whether a column is null. But I don't now if this column is used in the CSTest
computation because it's dropped before returning the synthetic data.
Update KSTest
to include some new features for usability and evaluation.
KSComplement
to be more descriptive'datetime'
. If there are datetime columns, convert them to numerical using pandas built-in functionality (NOT RDTs)NaN
values instead of filling them with 0s. By converting to 0s right now, we are changing the distributionsNaN values should be supported by numerical privacy metrics, but currently it raises ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The code below reproduces this issue:
import pandas as pd
from sdmetrics.single_table.privacy import NumericalLR
data = pd.DataFrame({
'key': [1, 2, None],
'sensitive': [1, 2, 3]
})
privacy_metric = NumericalLR.compute(
data,
data,
key_fields=['key'],
sensitive_fields=['sensitive']
)
print(privacy_metric) # this will print nan
If you are already running SDMetrics, please indicate the following details about the environment in
which you are running it:
When the primary_key is set, the generated data index restarts from zero.
As a consequence, detection metrics can trivially detect generated instances by setting a threshold on the primary_key.
I will propose a patch to remove primary_key columns if sets form these tests.
SDMetrics package should now support Python 3.9
Please indicate the following details about the environment in which you found the bug:
Metrics TSFClassifierEfficacy
and TSFCDetection
crash when we try to compute the metrics of time series with non-fixed length. That is because sktime.classification.compose.TimeSeriesForestClassifier
does not handle time series of variable length. We get the following error:
ValueError: Tabularization failed, it's possible that not all series were of equal length
We could either pad or truncate the timeseries handle this situation.
import pandas as pd
from sdmetrics.timeseries import TSFCDetection, TSFClassifierEfficacy
real = pd.DataFrame({
"seq_index": [1, 1, 2, 2, 2],
"dim_0": [0, 0, 0, 0, 1],
"dim_1": [3, 4, 3, 3, 3]
})
synth = pd.DataFrame({
"seq_index": [1, 1, 2, 2, 2],
"dim_0": [1, 1, 0, 0, 1],
"dim_1": [4, 4, 3, 3, 3]
})
TSFClassifierEfficacy.compute(real, synth, entity_columns=['seq_index'], target='dim_0')
full trace
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-29d0a52ef600> in <module>()
----> 1 TSFClassifierEfficacy.compute(real, synth, entity_columns=['seq_index'], target='dim_0')
/usr/local/lib/python3.7/dist-packages/sdmetrics/timeseries/efficacy/base.py in compute(cls, real_data, synthetic_data, metadata, entity_columns, target)
107 real_data, synthetic_data, metadata, entity_columns, target)
108
--> 109 return cls._compute_score(real_data, synthetic_data, entity_columns, target)
/usr/local/lib/python3.7/dist-packages/sdmetrics/timeseries/efficacy/base.py in _compute_score(cls, real_data, synthetic_data, entity_columns, target)
78
79 real_acc = cls._scorer(real_x_train, real_x_test, real_y_train, real_y_test)
---> 80 synt_acc = cls._scorer(synt_x, real_x_test, synt_y, real_y_test)
81
82 return synt_acc / real_acc
/usr/local/lib/python3.7/dist-packages/sdmetrics/timeseries/ml_scorers.py in tsf_classifier(X_train, X_test, y_train, y_test)
19 ]
20 clf = Pipeline(steps)
---> 21 clf.fit(X_train, y_train)
22 return clf.score(X_test, y_test)
23
/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
339 """
340 fit_params_steps = self._check_fit_params(**fit_params)
--> 341 Xt = self._fit(X, y, **fit_params_steps)
342 with _print_elapsed_time('Pipeline',
343 self._log_message(len(self.steps) - 1)):
/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
305 message_clsname='Pipeline',
306 message=self._log_message(step_idx),
--> 307 **fit_params_steps[name])
308 # Replace the transformer of the step with the fitted
309 # transformer. This is necessary when loading the transformer
/usr/local/lib/python3.7/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
752 with _print_elapsed_time(message_clsname, message):
753 if hasattr(transformer, 'fit_transform'):
--> 754 res = transformer.fit_transform(X, y, **fit_params)
755 else:
756 res = transformer.fit(X, y, **fit_params).transform(X)
/usr/local/lib/python3.7/dist-packages/sktime/transformations/base.py in fit_transform(self, Z, X)
89 else:
90 # Fit method of arity 2 (supervised transformation)
---> 91 return self.fit(Z, X).transform(Z)
92
93 # def inverse_transform(self, Z, X=None):
/usr/local/lib/python3.7/dist-packages/sktime/transformations/panel/compose.py in transform(self, X, y)
217 # them into a single column
218 if isinstance(X, pd.DataFrame):
--> 219 Xt = from_nested_to_2d_array(X)
220 else:
221 Xt = from_3d_numpy_to_2d_array(X)
/usr/local/lib/python3.7/dist-packages/sktime/utils/data_processing.py in from_nested_to_2d_array(X, return_numpy)
178 if Xt.ndim != 2:
179 raise ValueError(
--> 180 "Tabularization failed, it's possible that not "
181 "all series were of equal length"
182 )
ValueError: Tabularization failed, it's possible that not all series were of equal length
The latest pomegranate
version, 0.14.7
, does not work well with numpy<1.22
because of incompatibilities between the Python code and the compiled C backend.
When installed, SDMetrics currently ends up using numpy==1.21.5
because of third party restrictions (numba
is not compatible with numpy~=1.22
yet), which means that SDMetrics crashes when pomegranate
is used with the following error:
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
An example of this can be seen here: https://github.com/sdv-dev/SDMetrics/runs/4871113960?check_suite_focus=true
We can fix this by temporarily capping pomegranate
on <0.14.7
until numba
extends support to the latest numpy version.
Currently our code is being validated only by flake8
'vanilla' and just a few plugins. We would like to increase the code style checks by adding more add-on's that follow our code style and our standards.
Also we would like to ensure that our docstrings are properly written and follow the rest of our format.
We have performed this task already on RDT , more precisely on the following issue:
sdv-dev/RDT#248 (comment)
We need to add pydocstyle
plugin with the following lines on our setup.cfg
file as we are following the google convention.
[pydocstyle]
convention = google
add-ignore = D107, D407, D417
Flake8 comes with a lot of different addons that we can use to adapt it to our codestyle and checking, here is a list of plugins that I found to be interesting for us:
flake8-builtins
- Check for python builtins being used as variables or parameters.flake8-comprehensions
- Helps you write better list/set/dict comprehensions.flake8-debugger
- Debug statement checker.flake8-variables-names
- Extension that helps to make more readable variables names.Dlint
- Tool for encouraging best coding practices and helping ensure Python code is secure.flake8-mock
- Provides checking mock non-existent methods.flake8-fixme
- Check for FIXME, TODO and other temporary developer notes.flake8-eradicate
- Plugin to find commented out or dead code.flake8-mutable
- Extension for mutable default arguments.flake8-print
- Check for print statements in python files.flake8-pytest-style
- Plugin checking common style issues or inconsistencies.flake8-quotes
- Extension for checking quotes in python.flake8-multiline-containers
- Plugin to ensure a consistent format for multiline containers.pandas-vet
- Plugin that provides opinionated linting for pandas code.pep8-naming
- Check the PEP-8 naming conventions.flake8-expression-complexity
- Plugin to validate expressions complexity.flake8-sfs
- String formatting.sdv.evaluate
call sometimes fails with a ValueError: Input contains infinity or a value too large for dtype('float64').
The exception is raised in pipeline fit
step inside sdmetrics/detection/tabular/logistic.py
.
We should review if we can prevent this, or at least capture it and return a 0.
This is the full traceback:
------------------
from sdv.evaluation import evaluate
evaluate(new_data, data)
------------------
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-349ebfb54984> in <module>
1 from sdv.evaluation import evaluate
2
----> 3 evaluate(new_data, data)
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdv/evaluation.py in evaluate(synthetic_data, real_data, metadata, root_path, table_name, metrics, get_report, aggregate)
152 computed = {}
153 for metric in metrics:
--> 154 computed[metric] = METRICS[metric](synth, real, metadata, details=get_report)
155
156 if get_report:
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdv/evaluation.py in _logistic_detection(synthetic, real, metadata, details)
98
99 def _logistic_detection(synthetic, real, metadata=None, details=False):
--> 100 return _tabular_metric(LogisticDetector(), synthetic, real, metadata, details)
101
102
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdv/evaluation.py in _tabular_metric(sdmetric, synthetic, real, metadata, details)
86 return list(metrics)
87
---> 88 return np.mean([metric.value for metric in metrics])
89
90
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdv/evaluation.py in <listcomp>(.0)
86 return list(metrics)
87
---> 88 return np.mean([metric.value for metric in metrics])
89
90
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdmetrics/detection/tabular/base.py in metrics(self, metadata, real_tables, synthetic_tables)
48 Metric: The next metric.
49 """
---> 50 yield from self._single_table_detection(metadata, real_tables, synthetic_tables)
51 yield from self._parent_child_detection(metadata, real_tables, synthetic_tables)
52
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdmetrics/detection/tabular/base.py in _single_table_detection(self, metadata, real_tables, synthetic_tables)
57 auroc = self._compute_auroc(
58 real_tables[table_name][table_fields],
---> 59 synthetic_tables[table_name][table_fields])
60
61 yield Metric(
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdmetrics/detection/tabular/base.py in _compute_auroc(self, real_table, synthetic_table)
123 kf = StratifiedKFold(n_splits=3, shuffle=True)
124 for train_index, test_index in kf.split(X, y):
--> 125 self.fit(X[train_index], y[train_index])
126 y_pred = self.predict_proba(X[test_index])
127 auroc = roc_auc_score(y[test_index], y_pred)
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdmetrics/detection/tabular/logistic.py in fit(self, X, y)
22 ('classifier', LogisticRegression(solver="lbfgs")),
23 ])
---> 24 self.model.fit(X, y)
25
26 def predict_proba(self, X):
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
328 """
329 fit_params_steps = self._check_fit_params(**fit_params)
--> 330 Xt = self._fit(X, y, **fit_params_steps)
331 with _print_elapsed_time('Pipeline',
332 self._log_message(len(self.steps) - 1)):
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
294 message_clsname='Pipeline',
295 message=self._log_message(step_idx),
--> 296 **fit_params_steps[name])
297 # Replace the transformer of the step with the fitted
298 # transformer. This is necessary when loading the transformer
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
738 with _print_elapsed_time(message_clsname, message):
739 if hasattr(transformer, 'fit_transform'):
--> 740 res = transformer.fit_transform(X, y, **fit_params)
741 else:
742 res = transformer.fit(X, y, **fit_params).transform(X)
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
691 else:
692 # fit method of arity 2 (supervised transformation)
--> 693 return self.fit(X, y, **fit_params).transform(X)
694
695
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
1200 X = self._validate_data(X, accept_sparse='csc', estimator=self,
1201 dtype=FLOAT_DTYPES,
-> 1202 force_all_finite='allow-nan')
1203
1204 q_min, q_max = self.quantile_range
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
418 f"requires y to be passed, but the target y is None."
419 )
--> 420 X = check_array(X, **check_params)
421 out = X
422 else:
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
643 if force_all_finite:
644 _assert_all_finite(array,
--> 645 allow_nan=force_all_finite == 'allow-nan')
646
647 if ensure_min_samples > 0:
~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
97 msg_err.format
98 (type_err,
---> 99 msg_dtype if msg_dtype is not None else X.dtype)
100 )
101 # for object dtype data, we only check for NaNs (GH-13254)
ValueError: Input contains infinity or a value too large for dtype('float64').
Please indicate the following details about the environment in which you found the bug:
When using sdmetrics.compute_metrics
we get None
for metrics that were not able to resolve keyword arguments. In most cases, we expect the user to pass a dictionary for metrics
to compute_metrics
but if the signature of these metrics differ, we get an error that causes us to catch it then store None
in its location.
For example, take the following two time series metrics TSFCDetection
and TSFClassifierEfficacy
. The first one is expected to be called with
TSFCDetection.compute(data, sampled, entity_columns=entity_columns)
and the second one with an additional argument target
TSFClassifierEfficacy.compute(data, sampled, entity_columns=entity_columns, target=target)
compute_metrics
will pass target
to both classes which TSFCDetection
cannot handle, thus crashing.
import pandas as pd
from sdmetrics.timeseries import TSFCDetection, TSFClassifierEfficacy
length = 10
real = pd.DataFrame({
"seq_index": [1, 2, 3] * length,
"dim_0": [0, 0, 0] * length,
"dim_1": [4, 4, 4] * length
})
synth = pd.DataFrame({
"seq_index": [1, 2, 3] * length,
"dim_0": [1, 0, 0] * length,
"dim_1": [4, 4, 3] * length
})
metrics = {
'TSFClassifierEfficacy': TSFClassifierEfficacy,
'TSFCDetection': TSFCDetection
}
sdmetrics.compute_metrics(metrics, real, synth, entity_columns=['seq_index'], target='dim_0')
Currently metrics return one score which can be defined in an arbitrary range and can be either a MAXIMIZATION or MINIMIZATION metric.
We could add a normalize
method to all the base classes (with specific overrides for the individual metrics that require it) which gets the raw score as input and returns it normalized between 0 and 1 and using always MAXIMIZATION goal.
raw_score = WhateverMetric.compute(real, synthetic, metadata)
normalized = WhateverMetric.normalize(raw_score)
To implement several privacy metrics for single_tables.
As a user, I want only the relevant errors surfaced to me and expected behavior to be suppressed.
For now, focus on the KS Test metric (see #129 and #130 for more details)
MetricComputationError
to be used when there is a mathematical error when computing the metric (eg. when calling scipy or dividing by zero)For tabular and relational tests compute
method:
None
and throw a warning. This is not an error; the metric is simply undefined.>>> InvertedKSTest.compute(real_data, synthetic_data, metadata)
Warning: Incompatible data types. The InvertedKSTest is only defined for column types ['datetime', 'numerical']. None were found in the data.
None
MetricComputationError
>>> InvertedKSTest.compute(real_data, synthetic_data, metadata)
MetricComputationError: <message>
MetricComputationError
, show a warning but keep going with the other columns>>> InvertedKSTest.compute(real_data, synthetic_data, metadata)
Warning: InvertedKSTest returned a MetricComputationError for column 'user_age'. Skipping this column.
Warning: InvertedKSTest returned a MetricComputationError for column 'weight'. Skipping this column.
0.699382
Please indicate the following details about the environment in which you found the bug:
My data is a sequence of a single type with no entity_columns set. As described in the user guide, model training works fine, however, I'm unable to run the detection metrics to check the "goodness" of fit. The error I get is ValueError: No group keys passed!
Looking into sources, indeed there's the problem as the code relies on the entity_columns
variable:
SDMetrics/sdmetrics/timeseries/detection.py
Lines 37 to 49 in dd00b17
I'm wondering whether a simple change can fix the problem correcly (see below). Could you please confirm if this is the right way of thinking?
This is change in _build_x
:
def _build_x(data, transformer, entity_columns):
X = pd.DataFrame()
if entity_columns:
for entity_id, entity_data in data.groupby(entity_columns):
# code as in the original detection.py L41-47...
else:
entity_data = transformer.transform(data)
entity_data = pd.Series({
column: entity_data[column].values
for column in entity_data.columns
})
X = pd.DataFrame([entity_data])
return X
and one more in the compute
method, line
to be changed to:
if entity_columns:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, stratify=y)
else:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)
A simple example based on the user guide:
from sdv.demo import load_timeseries_demo
from sdv.timeseries import PAR
from sdv.metrics.timeseries import LSTMDetection
data = load_timeseries_demo()
no_context = data[['Symbol', 'Date', 'Open', 'Close', 'Volume']].copy()
real_data = no_context[no_context.Symbol == 'TSLA'].copy()
del real_data['Symbol']
sequence_index = 'Date'
model = PAR(sequence_index = sequence_index)
print('Fitting model...')
model.fit(real_data)
print('Sampling data...')
synth_data = model.sample()
print('Running evaluation...')
val = LSTMDetection.compute(real_data, synth_data, metadata={
'sequence_index': sequence_index,
'fields': {
'Date': {'type': 'datetime'},
'Open': {'type': 'numerical', 'subtype': 'float'},
'Close': {'type': 'numerical', 'subtype': 'float'},
'Volume': {'type': 'numerical', 'subtype': 'integer'}
}})
Once this is run, the error traceback is as follows:
val = LSTMDetection.compute(real_data, synth_data, metadata={
File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\sdmetrics\timeseries\detection.py", line 85, in compute
real_x = cls._build_x(real_data, transformer, entity_columns)
File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\sdmetrics\timeseries\detection.py", line 40, in _build_x
for entity_id, entity_data in data.groupby(entity_columns):
File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\frame.py", line 6515, in groupby
return DataFrameGroupBy(
File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\groupby\groupby.py", line 525, in __init__
grouper, exclusions, obj = get_grouper(
File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\groupby\grouper.py", line 821, in get_grouper
raise ValueError("No group keys passed!")
ValueError: No group keys passed!
The code crashes when there are less member of a class than the requested number of splits for the dataset. Having a try/except inside single table detection metrics base and returning Nan could solve the issue.
ValueError: n_splits=3 cannot be greater than the number of members in each class.
Hi all!
I am one of the smartnoise-sdk maintainers (part of the OpenDP collaboration). Specifically, I work on differentially private (DP) data synthesizers.
It would be nice if SDMetrics had some more methods geared towards DP synthesizers! (specifically, methods from from https://arxiv.org/pdf/2004.07740.pdf, https://arxiv.org/pdf/1604.06651.pdf and https://arxiv.org/pdf/1806.11345.pdf)
SDMetrics will be able to produce pMSE (Snoke et al) and Wasserstein randomization test (Arnold et al) scores for single_table synthetic data (under privacy). (Potentially, also SRA scores (Jordon et al), although this is not as high priority, and may require too much support code to be feasible.)
Here, we have some light implementations of the aforementioned methods. Though we use them to evaluate DP synthetic data, these metrics would also work for general purpose synthetic data (pMSE and Wasserstein essentially fit the interface described by the single_table metrics as is).
Reasoning for transition: The SDMetrics package is far more mature and well supported than our DP synthetic data gym, and so we would like to be able to use SDMetrics instead of our gym for smartnoise synthesizer evaluations. Metric parity would be nice before that transition, and so we hope that we can contribute at least pMSE, hopefully Wasserstein, and perhaps SRA, to the SDMetrics package.
I'm adding this issue to gather feedback, before I begin this effort in earnest! Would these metrics be welcome in SDMetrics? Are there concerns/limitations I should be aware of?
Please indicate the following details about the environment in which you found the bug:
The latest SDMetrics version (which is installed by default when installing SDV) has incompatible copula requirements with downstream SDV.
On a fresh virtual environment, install pip-tools
.
Place the following on a file named requirements.in
sdv
#sdmetrics==0.4.1
Type the following commands
pip install -r requirements.in
pip-compile requirements.in
pip-compile
reports:
Could not find a version that matches copulas<0.7,<0.8,>=0.6.1,>=0.7.0 (from sdv==0.14.1->-r requirements.txt (line 1))
Tried: 0.0.0, 0.0.0, 0.1.0, 0.1.0, 0.1.1, 0.1.1, 0.2.0, 0.2.0, 0.2.1, 0.2.1, 0.2.3, 0.2.3, 0.2.4, 0.2.4, 0.2.5, 0.2.5, 0.3.0, 0.3.0, 0.3.2, 0.3.2, 0.3.3, 0.3.3, 0.4.0, 0.4.0, 0.5.0, 0.5.0, 0.5.1, 0.5.1, 0.6.0, 0.6.0, 0.6.1, 0.6.1, 0.7.0, 0.7.0
Skipped pre-versions: 0.3.0.dev0, 0.3.0.dev0, 0.3.2.dev1, 0.3.2.dev1, 0.3.3.dev0, 0.3.3.dev0, 0.4.0.dev0, 0.4.0.dev0, 0.5.0.dev0, 0.5.0.dev0, 0.5.0.dev1, 0.5.0.dev1, 0.5.1.dev0, 0.5.1.dev0, 0.5.1.dev1, 0.5.1.dev1, 0.5.2.dev0, 0.5.2.dev0, 0.5.2.dev1, 0.5.2.dev1, 0.6.0.dev0, 0.6.0.dev0, 0.6.1.dev0, 0.6.1.dev0, 0.7.0.dev0, 0.7.0.dev0
There are incompatible versions in the resolved dependencies:
copulas<0.8,>=0.7.0 (from sdmetrics==0.4.2->sdv==0.14.1->-r requirements.txt (line 1))
copulas<0.7,>=0.6.1 (from sdv==0.14.1->-r requirements.txt (line 1))
From the setup.py
of both projects, we can verify the above requirements.
pip install works correctly, but we get the following (snippet):
Collecting llvmlite<0.39,>=0.38.0rc1
Using cached llvmlite-0.38.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
Collecting charset-normalizer~=2.0.0; python_version >= "3"
Using cached charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Collecting idna<4,>=2.5; python_version >= "3"
Using cached idna-3.3-py3-none-any.whl (61 kB)
Collecting certifi>=2017.4.17
Using cached certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
Collecting urllib3<1.27,>=1.21.1
Using cached urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
ERROR: numba 0.55.1 has requirement numpy<1.22,>=1.18, but you'll have numpy 1.22.3 which is incompatible.
ERROR: rdt 0.6.4 has requirement scipy<1.8,>=1.5.4, but you'll have scipy 1.8.0 which is incompatible.
ERROR: sdmetrics 0.4.2 has requirement copulas<0.8,>=0.7.0, but you'll have copulas 0.6.1 which is incompatible.
Installing collected packages: tqdm, typing-extensions, torch, numpy, six, python-dateutil, pytz, pandas, deepecho, scipy, threadpoolctl, joblib, scikit-learn, llvmlite, numba, pyts, pyyaml, psutil, rdt, fonttools, cycler, pyparsing, packaging, pillow, kiwisolver, matplotlib, copulas, sdmetrics, charset-normalizer, idna, certifi, urllib3, requests, torchvision, ctgan, graphviz, text-unidecode, Faker, sdv
When uncommenting sdmetrics
from requirements.in
, both commands run "correctly".
Furthermore, when pip-compile
and pip
have cached sdmetrics==0.4.1
, they both select that version instead and no error is shown.
The following file never compiles:
sdv==0.14.1
sdmetrics==0.4.2
I don't know what the appropriate solution to something like this would be. I'm not a library developer.
Currently, compute_metrics
does not capture what metrics are erroring out and what errors are being thrown. Instead, metrics that error out are being reported as NaN
s and ultimately are ignored. As a result, end users have little information about what metrics are erroring out and why. Capturing this information could be useful for users of the library (such as sdv.evaluate
) when debugging usage.
We could add an error
column to the final scores DataFrame.
The pomegranate
dependency has very narrow sub-dependency ranges, which makes it easily incompatible with other libraries.
This provokes problems such as this one when using SDV on Google Colab.
To avoid this, we can make pomegranate
an optional dependency by using a lazy import inside the bayesian_network.py module, which would raise an ImportError
that invites the user to install pomegranate
if not found.
The GMLogLikelihood metric was added to cover the metrics that existed in the original SDGym iteration, which used GM Log Likelhood metric over datasets that were simulated using GaussianMixtures.
However, even though the implementation was optimized and improved to make the output as stable and meaningful as possible, the scores produced when this metric is run on datasets which not simulated from GMs tends to be very noisy and may produce inconsistent results between runs. As a consequence of this, the ranking-based integration test fails randomly.
We may want to remove this metric from the ranking test and have a separated one which is tested using GM simulated data, and also add a disclaimer in the documentation indicating what this metric is best suited for.
The Numerical Privacy Metrics throw an error whenever the target columns (sensitive_fields) contain missing values.
Go through the User Guide to import & load data. Then, scroll down to the Privacy Metrics section.
The following code should work as-is according to the user guide.
NumericalLR.compute( real_data, synthetic_data,
key_fields=['second_perc', 'mba_perc', 'degree_perc'],
sensitive_fields=['salary'])
However, when I try to run this, I get an error from sklearn
because the salary
column contains NaN
values:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Note: The same error is thrown when any of the key_fields
containing missing values too. Eg. if I switch around salary
and degree_perc
in the above example.
This used to work, so either this was a recent change on SDV
or in sklearn
. What were we doing before? Were we dropping the NaN
values, filling them or imputing them?
Also, maybe it's ok if it crashes upon first running. Maybe the user can re-run with a flag for handling them missing values.
We are calling pydocstyle for rdt, when it should be called for SDMetrics:
Line 116 in 18baa70
The current CI workflows and local test and lint commands do not catch dependency incompatibilities.
For example, installing the repository for development on the v0.6.0-dev branch results in these errors:
ERROR: numba 0.54.1 has requirement numpy<1.21,>=1.17, but you'll have numpy 1.21.3 which is incompatible.
ERROR: sphinx-rtd-theme 0.5.2 has requirement docutils<0.17, but you'll have docutils 0.18 which is incompatible.
ERROR: autopep8 1.6.0 has requirement pycodestyle>=2.8.0, but you'll have pycodestyle 2.7.0 which is incompatible.
A pip check command should be made part of the local and CI tests to make sure that our dependency tree is always clean.
As a user, I want to get a per-column breakdown of KS Scores when using this test on a tabular or multi-table dataset.
See #129 for background.
Create a new method compute_breakdown
that has the same arguments as compute
but returns a per-column breakdown instead of a summarized score.
Requirements:
None
when the data type is not compatibleTabular:
from sdv.metrics.tabular import KSComplement
KSComplement.compute_breakdown(real_data, synthetic_data, metadata)
{
'age': 0.545656,
'weight': 2343434,
'gender': None, # the data type is categorical, which is not compatible
'gpa': Error # this is a valid type but there was an error running it
}
Relational: Returned object is in the nested form, with the table names at the top
from sdv.metrics.relational import KSComplement
KSComplement.compute_breakdown(real_data, synthetic_data, metadata)
{
'users': {
'age': 0.545656,
'weight': 2343434,
'gender': None, # the data type is categorical, which is not compatible
'gpa': Error # this is a valid type but there was an error running it
},
'transactions': {
'transaction_id': None,
'purchase_amt': 0.988191
}
}
Eventually, we'll want to do the same thing for other metrics that are actually summarize of multiple scores
The LogisticParentChildDetection metric crashes with a KeyError
if the names of the primary_key and foreign_key are different and there is another field on either of the tables that is called like the key on the other table.
For example, the parent table has the field id
as its primary key and a child table contains both the id
as its own primary key and parent_id
as the foreign key to the parent. When this happens, the id
fields end up converted to id_x
and id_y
during the merge, and then the del
statements after that fail.
In [1]: import pandas as pd
In [2]: parent = pd.DataFrame({'id': [1, 2, 3, 4]})
In [3]: child = pd.DataFrame({'id': [1, 2, 3, 4], 'parent_id': [1, 2, 3, 4]})
In [4]: foreign_keys = [('parent', 'id', 'child', 'parent_id')]
In [5]: data = {'parent': parent, 'child': child}
In [6]: from sdmetrics.multi_table import LogisticParentChildDetection
In [7]: LogisticParentChildDetection.compute(data, data, foreign_keys=foreign_keys)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/.virtualenvs/SDMetrics/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2894 try:
-> 2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'id'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-7-91c73836e519> in <module>
----> 1 LogisticParentChildDetection.compute(data, data, foreign_keys=foreign_keys)
~/Projects/MIT/SDMetrics/sdmetrics/multi_table/detection/parent_child.py in compute(cls, real_data, synthetic_data, metadata, foreign_keys)
104 scores = []
105 for foreign_key in foreign_keys:
--> 106 real = cls._denormalize(real_data, foreign_key)
107 synth = cls._denormalize(synthetic_data, foreign_key)
108 scores.append(cls.single_table_metric.compute(real, synth))
~/Projects/MIT/SDMetrics/sdmetrics/multi_table/detection/parent_child.py in _denormalize(data, foreign_key)
61 )
62
---> 63 del flat[parent_key]
64 if child_key != parent_key:
65 del flat[child_key]
~/.virtualenvs/SDMetrics/lib/python3.8/site-packages/pandas/core/generic.py in __delitem__(self, key)
3709 # there was no match, this call should raise the appropriate
3710 # exception:
-> 3711 loc = self.axes[-1].get_loc(key)
3712 self._mgr.idelete(loc)
3713
~/.virtualenvs/SDMetrics/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
-> 2897 raise KeyError(key) from err
2898
2899 if tolerance is not None:
KeyError: 'id'
The Privacy Metrics assume an adversarial attack model where a user with access to a few key_fields
might be able to predict sensitive_fields
.
I understand that we need to fit different models based on whether the sensitive_fields
are categorical vs. numeric. However, it is expected that all the key_fields
are also of the same type. Does this need to be the case? What if I think some categorical columns might be crucial in leaking numeric data (and vice versa)?
Depending on the type of the sensitive_fields
, it would be nice to convert the input columns so that they are compatible with the tests.
sensitive_fields
are numeric, then we can convert categorical key_fields
to numeric similar to how we do it in KSTestExtendedsensitive_fields
are categorical, then it may be possible to bin the key_fields
The latest versions of the libraries pandas
, sktime
and pomegranate
are not supported by sdmetrics
:
Library | Upper bound (unsupported) | Latest release |
---|---|---|
pandas |
1.1.5 | 1.3.1 |
pomegranate |
0.14.2 | 0.14.5 |
sktime |
0.6 | 0.7.0 |
We should investigate why and update the code if necessary to support them.
The workflow tests.yml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.
Right now the metrics are computed based on real data vs synthetic data for ML efficacy. While this information is perfect to gauge if a model could be fit that is good enough, it would also be interesting to learn how much performance we lose because of the synthesization.
Not sure how to best integrate it with the other metrics? Maybe as additional return values? To stay backwards compatible, returning them conditionen on the caller adding the arg compute_relative_performance=True
to compute
?
synth_f1 = BinaryDecisionTreeClassifier.compute(data, new_data, target='placed')
vs
synth_f1, real_f1, rel_perf = \
BinaryDecisionTreeClassifier.compute(data, new_data, target='placed',
compute_relative_performance=True)
Anyhow the result I'd like to see is something like this:
from sdv.demo import load_tabular_demo
from sdv.metrics.tabular import BinaryDecisionTreeClassifier, BinaryAdaBoostClassifier, BinaryMLPClassifier
from sdv.tabular import CopulaGAN
data = load_tabular_demo('student_placements')
model = CopulaGAN()
model.fit(data)
new_data = model.sample(200)
for clf in [BinaryDecisionTreeClassifier, BinaryAdaBoostClassifier, BinaryMLPClassifier]:
r_f1 = clf.compute(data, data, target='placed')
s_f1 = clf.compute(data, new_data, target='placed')
print(f'{clf.__name__:30s} real f1: {r_f1:5.4f} synth f1: {s_f1:5.4f} performance: {s_f1/r_f1:5.2f}')
Resulting in:
BinaryDecisionTreeClassifier real f1: 1.0000 synth f1: 0.5391 performance: 0.54
BinaryAdaBoostClassifier real f1: 1.0000 synth f1: 0.6296 performance: 0.63
BinaryMLPClassifier real f1: 1.0000 synth f1: 0.5693 performance: 0.57
I am a bit at a loss here if it is ok to compare both models so directly, as the SDV generation process may produce NaNs and infinities, that are silently replaced in the evaluation code, but may still have an impact.
The relational KSTest
is supposed to run the KSTest
on all numerical columns in all tables and return the average score.
However, this test crashes if it encounters a table that has no numerical columns. I expect this test to succeed as long as there is at least 1 numerical column in any of the tables.
Use the relational demo dataset and pass it in with the metadata.
from sdv.metrics.demos import load_multi_table_demo
from sdv.metrics.relational import KSTest
real_data, synthetic_data, metadata = load_multi_table_demo()
KSTest.compute(real_data, synthetic_data, metadata)
Output:
/usr/local/lib/python3.7/dist-packages/sdmetrics/single_table/base.py in _select_fields(cls, metadata, types)
78
79 if len(fields) == 0:
---> 80 raise IncomputableMetricError(f'Cannot find fields of types {types}')
81
82 return fields
IncomputableMetricError: Cannot find fields of types ('numerical',)
I believe this is happening because table sessions
has no numerical columns. Interestingly, it does work if I exclude the metadata
object -- because then it starts assuming that the id
field is a numerical column.
KSTest.compute(real_data, synthetic_data)
0.8555555555555556
The latest version of scipy causes the following error:
AttributeError: 'str' object has no attribute 'decode' #981
Downgrading to a previous version fixes the issue, as suggested here.
When attempting to evaluate data that contains PII
fields, this fails because the fake data didn't contain a given record.
Using the SDV
tabular demo for PII
:
from sdv.demo import load_tabular_demo
from sdv.tabular import GaussianCopula
data_pii = load_tabular_demo('student_placements_pii')
model = GaussianCopula(
primary_key='student_id',
anonymize_fields={
'address': 'address'
}
)
model.fit(data_pii)
new_data_pii = model.sample(200)
from sdv.metrics.tabular import KSTestExtended
KSTestExtended.compute(data_pii, new_data_pii)
This will end up producing the following error:
~/.virtualenvs/SDV/lib/python3.7/site-packages/rdt/transformers/categorical.py in _get_value(self, category)
111 category = np.nan
112
--> 113 mean, std = self.intervals[category][2:]
114
115 if self.fuzzy:
KeyError: 'USS Fowler\nFPO AA 99303'
Where the KeyError
will change depending on the data that you may have on the real dataset.
Simply drop all the PII
fields that are within the real_data
and the synthetic_data
in order to evaluate with this metric.
Here is a working solution for this demo:
ks_data_pii = data_pii.drop('address', axis=1)
ks_new_data_pii = new_data_pii.drop('address', axis=1)
KSTestExtended.compute(ks_data_pii, ks_new_data_pii)
Having an easy way of measuring the privacy of synthesized data would be very useful for users of the tool. It could be added on top of the existing evaluation metrics sdv-dev/SDV#52 .
An easy way to measure it would be to calculate average euclidean distance to the closest neighbour between real and synthetic data. It was used in TableGAN paper. However that would apply only to numerical data which in case of SDV sometimes is not enough.
It could also be implemnted in a way that the user can specify which fields he wants to use in the evaluation, for example some more sensitive fields should be taken into account while others can be ignored.
The current README
doesn't print the latest output. More specifically, the command sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata)
currently doesn't print the same as what the README prints (e.g. the current code produces a column named error
containing None
values which the README doesn't have, as well as other changes).
The current class organization includes some metrics which are usable on their own, like BNLikelhood
, which have subclasses and therefore are not picked by the get_subclasses
method. As a result of this, BNLikelihood
is not available in the sdv.evaluation.evaluate
function: sdv-dev/SDV#327
The class organization and subclass selection should be reviewed to ensure that all the usable metrics are properly selected by the get_subclasses
method.
As a end user we are finding it difficult to interpret the result of SDMetrics. The documentation for SD matrices could be more elaborated, so that user can interpret the quality of data generated easily. There are some parameter as in the report which are hard to guess -
More compressive documentation is needed for SDMetrics.
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
The CategoricalSVM
class should be added to the imports in the sdmetrics/single_table/privacy/__init__.py
and sdmetrics/single_table/__init__.py
files. It should also be added to the readme.
The workflow tests.yml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.
sktime==0.5.2
is already released but SDMetrics only supports 'sktime>=0.4,<0.5'
.
On top of that, only sktime>0.5
versions are available in Conda, so SDMetrics can only support conda installation if we upgrade to the latest sktime versions.
Current time series metrics in SDMetrics are detection/classifier based. It would be beneficial to have a metric that assesses the quality of the synthetic time series and the original one. An example of such metric would be something to compare the autocorrelation of the original time series and the correlation of the synthetic one.
In this case, the sampled sequences do not preserve the correlation of the time series with itself.
Since the most important value in autocorrelation are the ones with low lag values, we can take the maximum as a "metric" of how well AC is. Other ideas of how we can construct a metric to assess the seasonality/periodicity of the signal can be constructed around the FFT of the two signals.
The logistic detection metric is calculated as 1 minus the average ROC AUC score across all validation splits.
A ROC AUC of 0.5 means that the classifier is making predictions at random because data is indistinguishable.
If the goal is to determine how similar synthetic data and original data are, shouldn't the goal of this metric to be as near as 0.5 (meaning it can't distinguish between the two sets) and not maximize this metric ?
Please help me clarify this
Thank you
To compare a single column of data (synthetic vs. real), the SDV offers KSTest
for numerical values and CSTest
for categorical.
While KSTest
is easy to understand & reason about, CSTest
is not: It's returning a p-value, which is based on a confidence interval and subject to change if there are a different # of rows (even if the categories are in the same proportions) .
I would like another, easy-to-understand metric for comparing 2 categorical distributions. It should generally tell me if the frequencies of the categories are similar between the real vs. synthetic data.
Should we add a TVD
metric to compute the distance between 2 categorical columns? Per this tutorial the metric definition would be:
1/2 * SUM{across all categories}( abs(synthetic_frequency - real_frequency) )
It seems to me that TVD
for categorical variables is a similar & useful complement to the KSTest
. It's computing the absolute value of differences between the two distributions, and it's also naturally bounded between 0 and 1.
I noticed that when I use the SDMetrics
version for CSTest
for a single column, it is providing higher pvalues than if I directly use scipy.stats.chisquare
.
Upon closer examination, I think this is because CSTest
is normalizing the frequencies (so they add up to 1) before calling chisquare
. Why is this done? In fact, if I read the scipy docs, it provides a note that makes me believe it's expecting the total counts, not the normalized frequency:
This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5.
Concrete example below:
expected = ['A']*10 + ['B']*10 + ['C']*10
observed = ['A']*15 + ['B']*15
# returns p value 0.7788
CSTest.compute(pd.DataFrame(data=expected), pd.DataFrame(data=observed))
# using normalized frequencies, pvalue is the same
chisquare([0.5, 0.5, 0], [0.33333333, 0.333333333, 0.33333333])
# using actual frequencies, pvalue is much lower at 0.00055
chisquare([15, 15, 0], [10, 10, 10])
If this is to be an indication of whether the synthesized data properly fits the real data, shouldn't we stop normalizing this way? As we synthesize more data points, we should expect more & more confidence about whether the synthesized data matches the real data?
Currently SDMetrics
only provides the two samples KS test to compare numerical values. We should consider adding other tests as an optional parameter, so the user can choose a test which better matches their understanding of their data.
Additionally, we should explore substituting the KS test with the Anderson-Darling test as the default, since it is a more powerful test overall. Such change would require experimentation to show that the AD test indeed outperforms the KS test in most use cases.
The LSTM classifier doesn't support variable length time series unless they are sorted from longest to shortest. Since we don't need ONNX compatibility, we can remove this restriction:
RuntimeError:
lengths
array must be sorted in decreasing order whenenforce_sorted
is True. You can passenforce_sorted=False
to pack_padded_sequence and/or pack_sequence to sidestep this requirement if you do not need ONNX exportability.
And vice-versa. Currently if the wrong datatype is passed it will simply return nan
. It should raise an error instead.
Below is code to reproduce this phenomena:
import pandas as pd
from sdmetrics.single_table.privacy import CategoricalCAP
data = pd.DataFrame({ # data containing only numerical values
'key': [1.4, 10.12, 3.4],
'sensitive': [10.9, 9.8, 8.8]
})
score = CategoricalCAP.compute( # privacy metric that's supposed to only work with categorical values
data,
data,
key_fields=['key'],
sensitive_fields=['sensitive']
)
print(score) # this will print `nan`
Following with the issue #102 , sktime
does not support python 3.9 so we would like to use pyts
instead.
The workflow tests.yml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.