Giter VIP home page Giter VIP logo

metasyn's People

Contributors

qubixes avatar samuwhale avatar vankesteren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

metasyn's Issues

Do not automatically infer uniqueness

currently, we perform inference for deciding whether a certain variable is unique or not. This should be a user choice.

  • Do not create unique variables automatically
  • (optional) Display message for detected potential key / id / unique variables

Empty columns lose name in meta dataset

Dataframe read from CSV, has several empty columns. When generating a meta dataset from the df, all empty columns get an empty name. The column names are imported, as I can address them when specifying a spec for MetaDataset.from_dataframe(), yet they fail to show up in the new meta dataset.

int64 with prop_missing > 0 gives error on empty values when synthesizing

When calling synthesize(), var of type int64 with prop_missing > 0 (details below) sometimes gives:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

presumably on the missing fields (doesn't happen with prop_missing=0).

{ "name": "Last page", "type": "discrete", "dtype": "int64", "prop_missing": 0.1, "distribution": { "name": "DiscreteUniformDistribution", "parameters": { "low": 6, "high": 7 } } },

schema

generate a schema so that correctness can be checked based on the serialized metadata only

Structured vs. unstructured StringVar

We need to think about detecting whether a string is structured (e.g., and ID) or unstructured (free text). Generation of these then needs to be done with smart underlying distributions.

e.g., regex for structured

\w{4}\d{4}\s\w{2,5}
matches

sdfa1231 ll
ewfa1444 lb
ewfa1444 lbb66

Column specification in user interface

Instead of

MetaSynth.from_dataframe(df, unique={"PassengerId": True}, distribution={"PassengerId": UniqueKeyDistribution})

flip it so that we have a colspec arg

generation_spec = {
  "PassengerId": {"unique": True, "distribution": UniqueKeyDistribution}
}

MetaSynth.from_dataframe(df, spec=generation_spec)

Idea: remove some inheritance in favor of decorators

One thing that I feel is not quite satisfactory is the multiple inheritance structure in some places of the code. I have read some articles, and perhaps we can do it better with decorators.

def distribution(implements=None, provenance=None, var_type=None, is_unique=None):
    def _wrap(cls):
        if implements is not None:
            cls.implements = implements
#        if provenance is not None:
#            cls.provenance = provenance
        if var_type is not None:
            cls.var_type = var_type
        if is_unique is not None:
            cls.is_unique = is_unique
        return cls
    return _wrap

This would be the decorator that you would put in front of the new implementation:

import metasynth as ms

@ms.distribution(implements="core.uniform", var_type="continuous", is_unique=False)
class UniformDistribution(BaseDistribution):
    ...

Advantages:

  • No multiple inheritance, where the order of the classes matters.
  • Can remove the ContinuousDistribution, BaseDisclosure, etc. classes.
  • Possibly easier to understand for developers (perhaps not for beginners?)
  • More flexible? I could see a world where the decorator might also register distributions for example.
  • Inheritance becomes more optional.

Disadvantages:

  • Decorators might be something users don't know about.
  • Costs time to implement.

Rename + reparameterize categorical distribution

I feel we should refactor our categorical (multinoulli) distribution to be more in line with common practice.

https://github.com/sodascience/meta-synth/blob/4c895cd926217d35532b871117047d9a8affdffa/metasynth/distribution/categorical.py#L12-L20

Changes to be made:

  • it should be called CategoricalDistribution
  • it is theoretically a subclass of DiscreteDistribution (we should get rid of the existing CategoricalDistribution at this level)
  • its parameters are labels (what is now called categories) and probabilities (what is now counts).
  • This should also go in the json schema

I'm happy to discuss here the implications, also of making both Category and integer-valued columns DiscreteDistribution.

Investigate some issues with datetime

Something weird is happening with the from_dataframe method in MetaDataset. The type of the start/end of the distributions has inconsistent typing.

rename repository?

package is called metasynth
repo is called meta-synth

let's keep metasynth?

Interface for selecting specific distributions

It would be nice to be able to constrain the distributions that are being chosen for specific variables.

e.g., for income variable if I know it's lognormal, only try lognormal

var = MetaVar.detect(my_series)
var.fit(dist = "lognormal") # but it would be nice if smart matching

MetaDataset.from_dataframe(df, dist = {"income": "lognormal"})

Deprecation warning for from_json

When running MetaDataset.from_json() I get the following deprecation warning:

[C:\Users\erikj\AppData\Roaming\Python\Python311\site-packages\metasynth\dataset.py:234](file:///C:/Users/erikj/AppData/Roaming/Python/Python311/site-packages/metasynth/dataset.py:234): DeprecationWarning: read_text is deprecated. Use files() instead. Refer to https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy for migration advice.
  schema = json.loads(read_text("metasynth.schema", "generative_metadata_format.json"))
[c:\Program](file:///C:/Program) Files\Python311\Lib\importlib\resources\_legacy.py:80: DeprecationWarning: open_text is deprecated. Use files() instead. Refer to https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy for migration advice.
  with open_text(package, resource, encoding, errors) as fp

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.