Giter VIP home page Giter VIP logo

metasyn-disclosure-control's Introduction

Metasyn disclosure control

Python package Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.

A privacy plugin for metasyn, based on statistical disclosure control (SDC) rules of thumb as found in the following documents:

While producing synthetic data with metasyn is already a great first step towards protecting privacy, it doesn't adhere to official standards. For example, fitting a uniform distribution will disclose the lowest and highest values in the dataset, which may be a privacy issue in sensitive data. This plugin solves these kinds of problems.

Currently, the disclosure control plugin is work in progress. Especially in light of this, we disclaim any responsibility as a result of using this plugin.

Installing the plugin

The easiest way to install the plugin is by using pip.

In your terminal (with the Python environment that you want to install to active) run the following command: pip install git+https://github.com/sodascience/metasyn-disclosure-control.git. Alternatively, to install it within a Jupyter notebook: !pip install git+https://github.com/sodascience/metasyn-disclosure-control.git.

Usage

Basic usage for our built-in titanic dataset is as follows:

from metasyn import MetaFrame, demo_dataframe
from metasyncontrib.disclosure import DisclosurePrivacy
from metasyncontrib.disclosure.string import DisclosureFaker

df = demo_dataframe("titanic")

spec = [
    {"name": "PassengerId", "distribution": {"unique": True}},
    {"name": "Name", "distribution": DisclosureFaker("name")},
]

mf = MetaFrame.fit_dataframe(
    df=df, 
    dist_providers=["builtin", "metasyn-disclosure"],
    privacy=DisclosurePrivacy(),
    var_specs=spec
)
mf.synthesize(5)
shape: (5, 13)
┌─────────────┬────────────────────┬────────┬──────┬───┬────────────┬────────────┬─────────────────────┬────────┐
│ PassengerId ┆ Name               ┆ Sex    ┆ Age  ┆ … ┆ Birthday   ┆ Board time ┆ Married since       ┆ all_NA │
│ ---         ┆ ---                ┆ ---    ┆ ---  ┆   ┆ ---        ┆ ---        ┆ ---                 ┆ ---    │
│ i64         ┆ str                ┆ cat    ┆ i64  ┆   ┆ date       ┆ time       ┆ datetime[μs]        ┆ f32    │
╞═════════════╪════════════════════╪════════╪══════╪═══╪════════════╪════════════╪═════════════════════╪════════╡
│ 0           ┆ Benjamin Cox       ┆ female ┆ 27   ┆ … ┆ 1931-12-01 ┆ 14:33:06   ┆ 2022-07-30 02:16:37 ┆ null   │
│ 1           ┆ Mr. David Robinson ┆ female ┆ null ┆ … ┆ 1906-02-18 ┆ null       ┆ 2022-08-03 13:09:19 ┆ null   │
│ 2           ┆ Randy Mosley       ┆ male   ┆ 24   ┆ … ┆ 1933-01-06 ┆ 15:52:54   ┆ 2022-07-18 18:52:05 ┆ null   │
│ 3           ┆ Vincent Maddox     ┆ female ┆ 24   ┆ … ┆ 1937-02-10 ┆ 16:58:30   ┆ 2022-07-23 20:29:49 ┆ null   │
│ 4           ┆ Kristin Holland    ┆ male   ┆ 17   ┆ … ┆ 1939-12-09 ┆ 18:07:45   ┆ 2022-08-05 02:41:51 ┆ null   │
└─────────────┴────────────────────┴────────┴──────┴───┴────────────┴────────────┴─────────────────────┴────────┘

Implementation details

The rules of thumb, roughly, are:

  • at least 10 units
  • at least 10 degrees of freedom
  • no group disclosure
  • no dominance

For most distributions, we implemented micro-aggregation. This technique pre-averages a sorted version of the data, which then supplied to the original fitting mechanism. The idea is that during this pre-averaging step, we ensure that the rules of thumb are followed, so that the fitting method doesn't need to do anything in particular. While from a statistical point of view, we are losing more information than we probably need, it should ensure the safety of the data.

Contributing

You can contribute to this metasyn plugin by giving feedback in the "Issues" tab, or by creating a pull request.

To create a pull request:

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Contact

This is a project by the ODISSEI Social Data Science (SoDa) team. Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact Raoul Schram or Erik-Jan van Kesteren.

SoDa logo

metasyn-disclosure-control's People

Contributors

qubixes avatar vankesteren avatar samuwhale avatar

Stargazers

 avatar  avatar Thom Volker avatar

Watchers

Javier Garcia-Bernardo avatar  avatar  avatar

Forkers

samuwhale

metasyn-disclosure-control's Issues

Categorical distribution if largest label >90%

So even if by chance we have a >0.9 we will return the default one here?
Actually, we should return the default one but with the correct labels


Are you sure about this? This might be worth checking with someone I think. The two cases:

Only one label left (because of counts > 11 rule) -> Probably not allowed to disclose the label?
Multiple labels left -> If we set it to the default distribution, we leak the information that either of them has 90%, is that okay? We could also generate a new distribution with counts > 11, so that they become indistinguishable. Outside information might still recover the fact that it was random though. And running it twice would reveal it as well of course.

Does partition_size get passed correctly always?

@qubixes I managed to create a reprex, let's discuss this!

It seems from the toml the partition_size does not necessarily get imposed properly, see a reprex below:

import tempfile

import polars as pl
from metasyn import MetaFrame
from metasyn.config import MetaConfig

config_toml_11 = """
dist_providers = ["builtin", "metasyn-disclosure"]

[privacy]
name = "disclosure"
parameters.partition_size = 11
"""
config_11 = tempfile.NamedTemporaryFile("w", suffix=".toml", delete=False)
config_11.write(config_toml_11)
config_11.close()

config_toml_55 = """
dist_providers = ["builtin", "metasyn-disclosure"]

[privacy]
name = "disclosure"
parameters.partition_size = 55
"""
config_55 = tempfile.NamedTemporaryFile("w", suffix=".toml", delete=False)
config_55.write(config_toml_55)
config_55.close()

df = pl.DataFrame({"x1": range(30)})
mf_no = MetaFrame.fit_dataframe(df)
mf_11 = MetaFrame.fit_dataframe(df, var_specs=MetaConfig.from_toml(config_11.name))
mf_55 = MetaFrame.fit_dataframe(df, var_specs=MetaConfig.from_toml(config_55.name))

print(mf_no)
print(mf_11)
print(mf_55) # ??? why does 55 work & give same result as 11?? it should not!

Even though this info is correct:

mf_55[0].creation_method
{'created_by': 'metasyn', 'privacy': {'name': 'disclosure', 'parameters': {'partition_size': 55}}}

Bug in compute_dominance

sorry, no reprex but here is the traceback

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\erikj\surfdrive\SoDa\projects\synthetic_data\metasynth-YOUth-pilot\synthesize.py", line 41, in metaframe_from_filename
    return MetaFrame.fit_dataframe(df=df_real, meta_config=config) #, dist_providers="metasynth-disclosure", privacy=DisclosurePrivacy())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\erikj\AppData\Roaming\Python\Python311\site-packages\metasyn\metaframe.py", line 148, in fit_dataframe
    var = MetaVar.fit(
          ^^^^^^^^^^^^
  File "C:\Users\erikj\AppData\Roaming\Python\Python311\site-packages\metasyn\var.py", line 183, in fit
    distribution = provider_list.fit(series, var_type, dist_spec, privacy)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\erikj\AppData\Roaming\Python\Python311\site-packages\metasyn\provider.py", line 226, in fit
    return self._find_best_fit(series, var_type, unique, privacy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\erikj\AppData\Roaming\Python\Python311\site-packages\metasyn\provider.py", line 278, in _find_best_fit
    dist_instances = [d.fit(series, **privacy.fit_kwargs) for d in dist_list]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\erikj\AppData\Roaming\Python\Python311\site-packages\metasyn\provider.py", line 278, in <listcomp>
    dist_instances = [d.fit(series, **privacy.fit_kwargs) for d in dist_list]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\erikj\surfdrive\SoDa\projects\synthetic_data\metasyn-disclosure-control\metasyncontrib\disclosure\numerical.py", line 16, in fit
    sub_series = micro_aggregate(pl_series, n_avg)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\erikj\surfdrive\SoDa\projects\synthetic_data\metasyn-disclosure-control\metasyncontrib\disclosure\utils.py", line 83, in micro_aggregate
    sub_values, dominance = _create_subsample(values, *cur_settings)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\erikj\surfdrive\SoDa\projects\synthetic_data\metasyn-disclosure-control\metasyncontrib\disclosure\utils.py", line 49, in _create_subsample
    dominance = max(_compute_dominance(block_values, reverse=False),
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\erikj\surfdrive\SoDa\projects\synthetic_data\metasyn-disclosure-control\metasyncontrib\disclosure\utils.py", line 24, in _compute_dominance
    return np.max(dominance)
           ^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in amax
  File "C:\Users\erikj\AppData\Roaming\Python\Python311\site-packages\numpy\core\fromnumeric.py", line 2820, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\erikj\AppData\Roaming\Python\Python311\site-packages\numpy\core\fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: zero-size array to reduction operation maximum which has no identity

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.