Giter VIP home page Giter VIP logo

dagan's Introduction

DAGAN

DAGAN is a framework used in adaptive data augmentation for supervised learning over missing data.

An example is showed in the following figure. Suppose that a hospital trains a classifier that predicts cardiovascular disease (i.e., cardio) for patients based on a labeled source dataset Ds, which contains examination features, such as cholesterol (chol) and glucose (gluc), patient-reported features, such as smoking (smoke) and alcohol intake (alcohol), and demographics of patients, such as age. We can observe that Ds contains missing values in attributes smoke and alcohol, possibly because some patients may not want to report their habits. However, when being deployed in a production environment for prediction, the missing pattern of the unlabeled target data Dt might be different, as shown in Figure (b). There could be many reasons for such noise shift. For example, the model is deployed to predict another patient cohort or even in another hospital, where patients have missing values in examination features instead of smoking or alcohol habits. Not surprisingly, the model performance often degrades significantly when encountering the noise shift in the target data.

avatar As showed in the following figure, for a binary classification problem that classifies data points to two groups, a true-group and a false-group, we have a labeled source Ds, where the true (resp. false) data points are in the area of “+”, the green circles (resp. “ ”, the pink triangles), and unlabeled target data Ut. The original model 𝑓 may cause both false positives (two red triangles) and false negatives (two light green circles). To tackle this problem, DAGAN extracts noise patterns from target data Ut, and adapts the source data with the extracted target noise patterns while still preserving supervision signals in the source. Then, by retraining it on the adapted data, we can get model 𝑓+ better serving the target. avatar

Paper and Data

For more details, please refer to our paper Adaptive Data Augmentation for Supervised Learning over Missing Data. Public datasets used in the paper can be downloaded from the datasets page.

Quick Start

Step1: Requirements

Before running the code, please make sure your Python version is above 3.6. Then install necessary packages by :

pip3 install -r requirements.txt

Step2: Parameters

You need to write a .json file as the configuration. The keyworks should include :

  • name: required, name of the output file
  • source: required, path of the source data
  • target: required, path of the target data
  • target_mask: required, path of the target mask matrix
  • gen_model: required, neural network of the generator
  • normalize_cols: required, index of the numerical attributes normalized by simple-normalization
  • gmm_cols: required, index of the numerical attributes normalized by GMM-normalization
  • one-hot_cols: required, index of the categorical attributes encoded by one-hot encoding
  • ordinal_cols:mrequired, index of the categorical attributes encoded by ordinal encoding
  • epochs: required, number of training epochs
  • steps_per_epoch: required, steps per epoch
  • rand_search: required, whether to search hyper-parameters randomly, yes or no ,
  • param: required if rand_search is 'no', hyper-parameter of the neural network

Folder "code/params" contains examples, you can run the code using those parameter files directly, or write a self-defined parameter file to train a new dataset.

Note that the parameters in "code/params" are tuned on the GeForce RTX 3090 with cuda version=11.1.

Step3: Run

Run the code with the following command :

python code/train.py [parameter file] [gpu_id]

A example running command:

python code/train.py code/params/param-eyestate-MNAR 0

Step4: Evaluation

Run the code with the following command :

python code/evaluate.py --train=[training file path] --test=[test file] 
                        --label_col=[name of label column] --output=[output filename] 
                        --device=[gpu id]

A example running command:

python code/evaluate.py --train=expdir/ipums-LSTM/ --test=dataset/ipums/ipums_test.csv 
                        --label_col=movedin --output=ipums_result 
                        --device=0

The Team

DAGAN was developed by Renmin University of China Phd student Tongyu Liu and grad student Yinqing Luo, under the supervision of Professor Ju Fan and Professor Xiaoyong Du.

dagan's People

Contributors

ruclty avatar ruc-datalab avatar

Stargazers

LeeSQ avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.