sodascience / metasyn Goto Github PK

View Code? Open in Web Editor NEW

31.0 5.0 8.0 6.94 MB

Transparent generation of synthetic tabular data with privacy guarantees

Home Page: https://metasyn.readthedocs.io

License: MIT License

Python 99.75% Dockerfile 0.25%

disclosure metadata open-data synthetic-data

metasyn's People

Contributors

Stargazers

Watchers

Forkers

a-mora jelletreep ppdewolf megodoonch thomvolker bagheria samuwhale vankesteren

metasyn's Issues

Add plugin system for new distributions

Probably some thought needs to be put in how it interacts with the plugin system for privacy packages.

Publish metasynth on pypi

https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository

Make a getting started document

Exponential distribution implementation

Do not automatically infer uniqueness

currently, we perform inference for deciding whether a certain variable is unique or not. This should be a user choice.

Do not create unique variables automatically
(optional) Display message for detected potential key / id / unique variables

Add 'prop_missing' setter to spec of MetaDataset.from_dataframe()

Update schema for created by key

Add provenance, timestamp

RegexDistribution: give warning as frac_used > 1

bla_distribution = RegexDistribution([(r"[ABCDEF]",1),(r"[0-9]",3)])

Empty columns lose name in meta dataset

Dataframe read from CSV, has several empty columns. When generating a meta dataset from the df, all empty columns get an empty name. The column names are imported, as I can address them when specifying a spec for MetaDataset.from_dataframe(), yet they fail to show up in the new meta dataset.

int64 with prop_missing > 0 gives error on empty values when synthesizing

When calling synthesize(), var of type int64 with prop_missing > 0 (details below) sometimes gives:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

presumably on the missing fields (doesn't happen with prop_missing=0).

{ "name": "Last page", "type": "discrete", "dtype": "int64", "prop_missing": 0.1, "distribution": { "name": "DiscreteUniformDistribution", "parameters": { "low": 6, "high": 7 } } },

Naming for metadata format

Options:

Statistical Metadata Format
Synthesizable Metadata Format
Generative Metadata Format

Publish documentation on read-the-docs

Implement synthetic date/times

schema

generate a schema so that correctness can be checked based on the serialized metadata only

Columns with only NA's result in a lot of errors

Reparameterize date & time distributions

Date now has begin_time and end_time as parameters. Let's just make them start & end?

it's a bit confusing otherwise for dates.

Fix tests so they are more reliable, not dependent on luck

mismatch in behaviour of different RegexDistribution notations

RegexDistribution(r"[ABCDEF]\d{2,3,5}")
generates
ValueError: Failed to determine regex from '{2,3,5}'
while
RegexDistribution([(r"[ABCDEF]",1),(r"\d{2,3,5}",1)])
passes without errors and is transformed to
[('[ABCDEF]', 1), ('\d{1,1}', 1)]

Unique / key variables or IDs

How to handle?

MetaDataset.from_dataframe(df, primary_key = "person_id")

Fix api in readthedocs

Make repository public

Make pandas optional

We only need it for type checking, so we should make pandas optional.

Switch to pyproject.toml

Structured vs. unstructured StringVar

We need to think about detecting whether a string is structured (e.g., and ID) or unstructured (free text). Generation of these then needs to be done with smart underlying distributions.

e.g., regex for structured

\w{4}\d{4}\s\w{2,5}
matches

sdfa1231 ll
ewfa1444 lb
ewfa1444 lbb66

Add additional distributions

for float: lognormal
for int: poisson
~~for ordered categorical: latent normal / threshold~~

to be continued...

Improve tutorial explaining for which variables the dtype needs to be manually set.

Column specification in user interface

Instead of

MetaSynth.from_dataframe(df, unique={"PassengerId": True}, distribution={"PassengerId": UniqueKeyDistribution})

flip it so that we have a colspec arg

generation_spec = {
  "PassengerId": {"unique": True, "distribution": UniqueKeyDistribution}
}

MetaSynth.from_dataframe(df, spec=generation_spec)

Faster regex

RegexDistribution: allow missing legth spec in all cases

RegexDistribution([(r"[ABCDEF]",2),(r"\d",3)])

Add badges to repo

Update README with installation instructions

Split distributions into different files

in a smart way; needs to be accessible with many distributions

Replace _example_distribution with default_distribution

Interesting distribution for all_NA variable?

{'name': 'all_NA', 'description': None, 'type': 'string', 'dtype': 'Utf8', 'prop_missing': 1.0, 'distribution': '\\d{3,4}'}

Idea: remove some inheritance in favor of decorators

One thing that I feel is not quite satisfactory is the multiple inheritance structure in some places of the code. I have read some articles, and perhaps we can do it better with decorators.

def distribution(implements=None, provenance=None, var_type=None, is_unique=None):
    def _wrap(cls):
        if implements is not None:
            cls.implements = implements
#        if provenance is not None:
#            cls.provenance = provenance
        if var_type is not None:
            cls.var_type = var_type
        if is_unique is not None:
            cls.is_unique = is_unique
        return cls
    return _wrap

This would be the decorator that you would put in front of the new implementation:

import metasynth as ms

@ms.distribution(implements="core.uniform", var_type="continuous", is_unique=False)
class UniformDistribution(BaseDistribution):
    ...

Advantages:

No multiple inheritance, where the order of the classes matters.
Can remove the ContinuousDistribution, BaseDisclosure, etc. classes.
Possibly easier to understand for developers (perhaps not for beginners?)
More flexible? I could see a world where the decorator might also register distributions for example.
Inheritance becomes more optional.

Disadvantages:

Decorators might be something users don't know about.
Costs time to implement.

One more pass over advanced tutorial

Rename + reparameterize categorical distribution

I feel we should refactor our categorical (multinoulli) distribution to be more in line with common practice.

https://github.com/sodascience/meta-synth/blob/4c895cd926217d35532b871117047d9a8affdffa/metasynth/distribution/categorical.py#L12-L20

Changes to be made:

it should be called CategoricalDistribution
it is theoretically a subclass of DiscreteDistribution (we should get rid of the existing CategoricalDistribution at this level)
its parameters are labels (what is now called categories) and probabilities (what is now counts).
This should also go in the json schema

I'm happy to discuss here the implications, also of making both Category and integer-valued columns DiscreteDistribution.

Investigate some issues with datetime

Something weird is happening with the from_dataframe method in MetaDataset. The type of the start/end of the distributions has inconsistent typing.

Add distributions with outliers

Enforce uniqueness of variable names

We should not have multiple columns with the same name. Let's find ways of enforcing this

in GMF specification
building a check in metasynth

var = MetaVar.detect(my_series)
var.fit(dist = "lognormal") # but it would be nice if smart matching

MetaDataset.from_dataframe(df, dist = {"income": "lognormal"})

[C:\Users\erikj\AppData\Roaming\Python\Python311\site-packages\metasynth\dataset.py:234](file:///C:/Users/erikj/AppData/Roaming/Python/Python311/site-packages/metasynth/dataset.py:234): DeprecationWarning: read_text is deprecated. Use files() instead. Refer to https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy for migration advice.
  schema = json.loads(read_text("metasynth.schema", "generative_metadata_format.json"))
[c:\Program](file:///C:/Program) Files\Python311\Lib\importlib\resources\_legacy.py:80: DeprecationWarning: open_text is deprecated. Use files() instead. Refer to https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy for migration advice.
  with open_text(package, resource, encoding, errors) as fp

Pretty print method for MetaDataset

Would be nice to have a prettier print method for MetaDataset!