Giter VIP home page Giter VIP logo

tgan's Introduction

TGAN: A Tabular Data Synthesizer

TGAN is a tabular data synthesizer. It can generate fully synthetic data from real data. Currently, TGAN can generate numerical columns and categorical columns. This software can do random search for TGAN parameters on multiple datasets using multiple GPUs.

Citation

If you use TGAN, please cite the following work:

Lei Xu, Kalyan Veeramachaneni. 2018. Synthesizing Tabular Data using Generative Adversarial Networks.

@article{xu2018synthesizing,
  title={Synthesizing Tabular Data using Generative Adversarial Networks},
  author={Xu, Lei and Veeramachaneni, Kalyan},
  journal={arXiv preprint arXiv:1811.11264},
  year={2018}
}

Quick Start

requirements

  • pandas
  • numpy
  • sklearn
  • tensorflow-gpu
  • tensorpack
> pip3 install pandas numpy sklearn tensorflow-gpu tensorpack

Run Demo

This demo shows how to generate synthetic version of census and covertype dataset. Generated synthetic datasets will be stored in expdir/census and expdir/covertype, while the GAN models will be stored in train_log.

> # Dowload data
> mkdir data
> wget -O data/census-train.csv https://s3.amazonaws.com/hdi-demos/tgan-demo/census-train.csv
> wget -O data/covertype-train.csv https://s3.amazonaws.com/hdi-demos/tgan-demo/covertype-train.csv
> python3 src/launcher.py demo_config.json

This demo runs around 20 hours on our server which has 2 GTX 1080 GPUs.

How it works?

Datasets

The input to this software is a csv file and a json config.

  • csv file should not have header or index. It should only contain numerical columns and categorical columns. It should not contain any missing value.
  • json file specifies a list of experiments. Each experiment includes
    • name: the name of an experiment. We will create a folder in this name.
    • num_random_search: iterations of random hyper parameter search.
    • train_csv: path to the training csv file.
    • continuous_cols: a list of column indexes which is numerical. (Index starts from 0.)
    • epoch: Number of epoches to train the model.
    • steps_per_epoch: Number of optimization steps in each epoch.
    • output_epoch: How many models to evaluate? output_epoch <= min(epoch, 5).
    • sample_rows: In evaluation, how many rows should the synthesizer generate?

Example JSON

[{
    'name': 'census',
    'num_random_search': 10,
    'train_csv': 'data/census-train.csv',
    'continuous_cols': [0, 5, 16, 17, 18, 29, 38],
    'epoch': 5,
    'steps_per_epoch': 10000,
    'output_epoch': 3,
    'sample_rows': 10000
}, ...]

Training and Automated Evaluation

Split Training data.

We split training data into two parts. expdir/{name}/data_I.csv has 80% data while expdir/{name}/data_II.csv has 20% data. We use data_I to train GAN and use data_II to evaluate the model.

Random Search

All tunable hyper parameters are listed in src/TGAN_synthesize.py:23 tunable_variables. In each iteration of random search, we randomly select a value for each tunable variable. We than train the model and generate synthetic data using the last output_epoch stored models.

Evaluation

We train a decision tree classifier (with max depth=20) on the synthetic dataset and compute the accuracy of that model on data_II. User can pick the best hyper parameter for a dataset by reading expdir/{name}/exp-result.csv.

Outputs

expdir/{name}

  • data_I.csv, data_II.csv: splited data.
  • exp-params.json: hyper parameters selected in random search.
  • exp-result.csv: has num_random_search rows and output_epoch columns, showing the classification accuracy of default classifier trained on a synthetic data and tested on data_II.
  • train.npz: convert data_I.csv to an npz file.
  • synthetic{iter}_{epoch}.csv/npz: Synthetic data. iter is the random search iteration id. epoch is the training epoch.

train_log/TGAN_synthsizer:{name}-{iter}

  • This folder contains training log and last 5 models.

TODOs

  • Select evaluation metrics for different experiment, e.g. F1, accuracy, AUC, etc.
  • Select default classifier for evaluation.
  • Support regression.

tgan's People

Contributors

leix28 avatar

Watchers

Ben Margetts avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.