Giter VIP home page Giter VIP logo

sdmetrics's Introduction

“DAI-Lab” An open source project from Data to AI Lab at MIT.

Development Status PyPI Shield Downloads Travis CI Shield Coverage Status

Metrics for Synthetic Data Generation Projects

Overview

The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after. It includes a variety of metrics such as:

  • Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
  • Detection metrics which use machine learning to try to distinguish between real and synthetic data.
  • Descriptive metrics which compute descriptive statistics on the real and synthetic datasets independently and then compare the values.

Install

Requirements

SDMetrics has been developed and tested on Python 3.5, 3.6, 3.7 and 3.8

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDMetrics is run.

Install with pip

The easiest and recommended way to install SDMetrics is using pip:

pip install sdmetrics

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Basic Usage

Let's run the demo code from SDV to generate a simple synthetic dataset:

from sdv import load_demo, SDV

metadata, real_tables = load_demo(metadata=True)

sdv = SDV()
sdv.fit(metadata, real_tables)

synthetic_tables = sdv.sample_all(20)

Now that we have a synthetic dataset, we can evaluate it using SDMetrics by calling the evaluate function which returns an instance of MetricsReport with the default metrics:

from sdmetrics import evaluate

report = evaluate(metadata, real_tables, synthetic_tables)

Examining Metrics

This report object makes it easy to examine the metrics at different levels of granularity. For example, the overall method returns a single scalar value which functions as a composite score combining all of the metrics. This score can be passed to an optimization routine (i.e. to tune the hyperparameters in a model) and minimized in order to obtain higher quality synthetic data.

print(report.overall())

In addition, the report provides a highlights method which identifies the worst performing metrics. This provides useful hints to help users identify where their synthetic data falls short (i.e. which tables/columns/relationships are not being modeled properly).

print(report.highlights())

Visualizing Metrics

Finally, the report object provides a visualize method which generates a figure showing some of the key metrics.

figure = report.visualize()
figure.savefig("sdmetrics-report.png")

Advanced Usage

Specifying Metrics

Instead of running all the default metrics, you can specify exactly what metrics you want to run by creating an empty MetricsReport and adding the metrics yourself. For example, the following code only computes the machine learning detection-based metrics.

The MetricsReport object includes a details method which returns all of the metrics that were computed.

from sdmetrics import detection
from sdmetrics.report import MetricsReport

report = MetricsReport()
report.add_metrics(detection.metrics(metadata, real_tables, synthetic_tables))

Creating Metrics

Suppose you want to add some new metrics to this library. To do this, you simply need to write a function which yields instances of the Metric object:

from sdmetrics.report import Metric

def my_custom_metrics(metadata, real_tables, synthetic_tables):
    name = "abs-diff-in-number-of-rows"

    for table_name in metadata.get_tables():

        # Absolute difference in number of rows
        nb_real_rows = len(real_tables[table_name])
        nb_synthetic_rows = len(synthetic_tables[table_name])
        value = float(abs(nb_real_rows - nb_synthetic_rows))

        # Specify some useful tags for the user
        tags = set([
            "priority:high",
            "table:%s" % table_name
        ])

        yield Metric(name, value, tags)

To attach your metrics to a MetricsReport object, you can use the add_metrics method and provide your custom metrics iterator:

from sdmetrics.report import MetricsReport

report = MetricsReport()
report.add_metrics(my_custom_metrics(metadata, real_tables, synthetic_tables))

See sdmetrics.detection, sdmetrics.efficacy, and sdmetrics.statistical for more examples of how to implement metrics.

Filtering Metrics

The MetricsReport object includes a details method which returns all of the metrics that were computed.

from sdmetrics.report import MetricsReport

report = evaluate(metadata, real_tables, synthetic_tables)
report.details()

To filter these metrics, you can provide a filter function. For example, to only see metrics that are associated with the users table, you can run

def my_custom_filter(metric):
  if "table:users" in metric.tags:
    return True
  return False

report.details(my_custom_filter)

Examples of standard tags implemented by the built-in metrics are shown below.

Tag Description
priority:high This tag tells the user to pay extra attention to this metric. It typically indicates that the objects being evaluated by the metric are unusually bad (i.e. the synthetic values look very different from the real values).
table:TABLE_NAME This tag indicates that the metric involves the table specified by TABLE_NAME.
column:COL_NAME This tag indicates that the metric involves the column specified by COL_NAME. If the column names are not unique across the entire database, then it needs to be combined with the table:TABLE_NAME tag to uniquely identify a specific column.

As this library matures, we will define additional standard tags and/or promote them to first class attributes.

What's next?

For more details about SDMetrics and all its possibilities and features, please check the documentation site.

sdmetrics's People

Contributors

csala avatar k15z avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.