sodascience / metasyn Goto Github PK
View Code? Open in Web Editor NEWTransparent generation of synthetic tabular data with privacy guarantees
Home Page: https://metasyn.readthedocs.io
License: MIT License
Transparent generation of synthetic tabular data with privacy guarantees
Home Page: https://metasyn.readthedocs.io
License: MIT License
Probably some thought needs to be put in how it interacts with the plugin system for privacy packages.
currently, we perform inference for deciding whether a certain variable is unique or not. This should be a user choice.
Add provenance, timestamp
bla_distribution = RegexDistribution([(r"[ABCDEF]",1),(r"[0-9]",3)])
Dataframe read from CSV, has several empty columns. When generating a meta dataset from the df, all empty columns get an empty name. The column names are imported, as I can address them when specifying a spec for MetaDataset.from_dataframe(), yet they fail to show up in the new meta dataset.
When calling synthesize(), var of type int64 with prop_missing > 0 (details below) sometimes gives:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
presumably on the missing fields (doesn't happen with prop_missing=0).
{ "name": "Last page", "type": "discrete", "dtype": "int64", "prop_missing": 0.1, "distribution": { "name": "DiscreteUniformDistribution", "parameters": { "low": 6, "high": 7 } } },
Options:
generate a schema so that correctness can be checked based on the serialized metadata only
Date now has begin_time
and end_time
as parameters. Let's just make them start
& end
?
it's a bit confusing otherwise for dates.
RegexDistribution(r"[ABCDEF]\d{2,3,5}")
generates
ValueError: Failed to determine regex from '{2,3,5}'
while
RegexDistribution([(r"[ABCDEF]",1),(r"\d{2,3,5}",1)])
passes without errors and is transformed to
[('[ABCDEF]', 1), ('\d{1,1}', 1)]
How to handle?
MetaDataset.from_dataframe(df, primary_key = "person_id")
We only need it for type checking, so we should make pandas optional.
We need to think about detecting whether a string is structured (e.g., and ID) or unstructured (free text). Generation of these then needs to be done with smart underlying distributions.
e.g., regex for structured
\w{4}\d{4}\s\w{2,5}
matches
sdfa1231 ll
ewfa1444 lb
ewfa1444 lbb66
to be continued...
Instead of
MetaSynth.from_dataframe(df, unique={"PassengerId": True}, distribution={"PassengerId": UniqueKeyDistribution})
flip it so that we have a colspec arg
generation_spec = {
"PassengerId": {"unique": True, "distribution": UniqueKeyDistribution}
}
MetaSynth.from_dataframe(df, spec=generation_spec)
RegexDistribution([(r"[ABCDEF]",2),(r"\d",3)])
in a smart way; needs to be accessible with many distributions
{'name': 'all_NA', 'description': None, 'type': 'string', 'dtype': 'Utf8', 'prop_missing': 1.0, 'distribution': '\\d{3,4}'}
One thing that I feel is not quite satisfactory is the multiple inheritance structure in some places of the code. I have read some articles, and perhaps we can do it better with decorators.
def distribution(implements=None, provenance=None, var_type=None, is_unique=None):
def _wrap(cls):
if implements is not None:
cls.implements = implements
# if provenance is not None:
# cls.provenance = provenance
if var_type is not None:
cls.var_type = var_type
if is_unique is not None:
cls.is_unique = is_unique
return cls
return _wrap
This would be the decorator that you would put in front of the new implementation:
import metasynth as ms
@ms.distribution(implements="core.uniform", var_type="continuous", is_unique=False)
class UniformDistribution(BaseDistribution):
...
Advantages:
ContinuousDistribution
, BaseDisclosure
, etc. classes.Disadvantages:
I feel we should refactor our categorical (multinoulli) distribution to be more in line with common practice.
CategoricalDistribution
DiscreteDistribution
(we should get rid of the existing CategoricalDistribution
at this level)labels
(what is now called categories
) and probabilities
(what is now counts
).I'm happy to discuss here the implications, also of making both Category and integer-valued columns DiscreteDistribution.
Something weird is happening with the from_dataframe
method in MetaDataset. The type of the start/end of the distributions has inconsistent typing.
We should not have multiple columns with the same name. Let's find ways of enforcing this
If original data has no microseconds, now we still generate microseconds in the synthesized data
package is called metasynth
repo is called meta-synth
let's keep metasynth?
It would be nice to be able to constrain the distributions that are being chosen for specific variables.
e.g., for income variable if I know it's lognormal, only try lognormal
var = MetaVar.detect(my_series)
var.fit(dist = "lognormal") # but it would be nice if smart matching
MetaDataset.from_dataframe(df, dist = {"income": "lognormal"})
When doing MetaDataset.from_json(file_path), the variable descriptions seem to get left out of the imported dataset.
also in tests
When running MetaDataset.from_json() I get the following deprecation warning:
[C:\Users\erikj\AppData\Roaming\Python\Python311\site-packages\metasynth\dataset.py:234](file:///C:/Users/erikj/AppData/Roaming/Python/Python311/site-packages/metasynth/dataset.py:234): DeprecationWarning: read_text is deprecated. Use files() instead. Refer to https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy for migration advice.
schema = json.loads(read_text("metasynth.schema", "generative_metadata_format.json"))
[c:\Program](file:///C:/Program) Files\Python311\Lib\importlib\resources\_legacy.py:80: DeprecationWarning: open_text is deprecated. Use files() instead. Refer to https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy for migration advice.
with open_text(package, resource, encoding, errors) as fp
Would be nice to have a prettier print method for MetaDataset!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.