jsyoon0823 / gain Goto Github PK

Codebase for Generative Adversarial Imputation Networks (GAIN) - ICML 2018

Python 100.00%

gain's Introduction

Codebase for "Generative Adversarial Imputation Networks (GAIN)"

Authors: Jinsung Yoon, James Jordon, Mihaela van der Schaar

Paper: Jinsung Yoon, James Jordon, Mihaela van der Schaar, "GAIN: Missing Data Imputation using Generative Adversarial Nets," International Conference on Machine Learning (ICML), 2018.

Paper Link: http://proceedings.mlr.press/v80/yoon18a/yoon18a.pdf

Contact: [email protected]

This directory contains implementations of GAIN framework for imputation using two UCI datasets.

UCI Letter (https://archive.ics.uci.edu/ml/datasets/Letter+Recognition)
UCI Spam (https://archive.ics.uci.edu/ml/datasets/Spambase)

To run the pipeline for training and evaluation on GAIN framwork, simply run python3 -m main_letter_spam.py.

Note that any model architecture can be used as the generator and discriminator model such as multi-layer perceptrons or CNNs.

Command inputs:

data_name: letter or spam
miss_rate: probability of missing components
batch_size: batch size
hint_rate: hint rate
alpha: hyperparameter
iterations: iterations

Example command

$ python3 main_letter_spam.py --data_name spam 
--miss_rate: 0.2 --batch_size 128 --hint_rate 0.9 --alpha 100
--iterations 10000

Outputs

imputed_data_x: imputed data
rmse: Root Mean Squared Error

gain's People

Contributors

Stargazers

Watchers

Forkers

cococ0j katarzynaszlachetka chandan-iiti qin-peng rintukutum jainsaab hcmy ryanlu32 dongxy2017 oyjwhy rahimentezari happyfarmergo leejiyoon52 hezhang1994 yanzhaowu scott198510 bb4l xuw10 aijinz rahulsinghpatel wangjianlongnba ajindal1 fengkuang-yu yajwang shoraj philmar1 tonydeep infinityjay hoonseo0409 chrisqian333 zjiang4 sth4k caro112 hakanaku1234 hameedshaf doandongnguyen fatimaparveen shubhampachori12110095 askery zhuangdingyi 3point14thon michaelfpalmer faprz roufaidalaidi softengers godfrey-leung lidan456 tuong-ai mendessp natnaelt jehoonchae const-yield jvpoulos abhisheksubu92 guoshengcui dr-munirshah maxiaoba bagheria rvinas kellsky wuyina rupi-rs kiw9761 hxxhust163 lixt314 zong-my ragagar csetraynor yunghun2005 saqibejaz silver-birch-wawa uvaiskarni tanmdl imp0821 danielxyang skachano shevious alienncheng utopfish purbayankar tbehrent kw-lee hidenver2016 pushkarphd2019 cosmoluminous kevin-jisukim akaraspt mneunhoe shuowang-ai cappuccinoll xen-biomed stvsd1314 wangjiali0310 yuzhong-deng g-francio wuxinnnnn ystoll phmaa littleyueyue11 zhengxu001

gain's Issues

mixed (categorical and numerical) data

How to run GAIN when we have mixed (categorical and numerical) data?

about minibatch

Hi, thanks for such innovative ideal open source! However, I have a question: in the code of gain.py, lines 159 to 161, when taking minibatch training, is random, which does not guarantee that all the data is traversed. Why is training not designed to go through all batches in order to complete the training of an epoch？

No train test split?

Dear Yoon,
I was reading through the GAIN code and It seems that the code is not performing train-test split, is there a reason why there is no separation?

I might have overlooked something too, in that case , can you point out where you have done it?

alpha

Hallo!
In your paper in the Supplementary Materials 3. you wrote you choosed alpha among [0.1, 0.5, 1, 2, 10].
In the code you set a default value of 100 for alpha.
Is this a typo?
Thanks!
Mela

Using GAIN in inductive mode

Can this package be somehow used in inductive mode i.e. can this be trained on a data and then applied to another dataset (test data) without the need of training it on the test data? Which will be equivalent with the sklearn style of first fit_transform() on the training data and then transform() only on the test data?

Unecessarily repeated Z sampling

In these 2 lines New_X_mb is computed using random noise Z, then is sent to Generator and Discriminator as Z again, so they are used to re-sample.

I don't think there's an issue except for unecessary re-sampling of Z, because the values will all get substituted.

GAIN/MNST_Code_Example.py

Line 194 in 1996ed9

New_X_mb = M_mb * X_mb + (1-M_mb) * Z_mb # Missing Data Introduce

GAIN/MNST_Code_Example.py

Line 208 in 1996ed9

New_X_mb = M_mb * X_mb + (1-M_mb) * Z_mb

Model for the MNIST dataset

Hello, thank you for your work and for publishing the code!

In the supplementary materials to the paper, you mention using the GAIN model for MNIST data imputation. Is the code for this task available anywhere?

Best regards!

The test loss didn't converge while running GAIN_Letter.py

Hi! I read your paper this week. I ran GAIN_Letter.py and recorded train and test loss and plot all values. But the test loss didn't converge.

I didn't edit any code but iteration times. There was just one warning message. So why the test loss value didn't converge?

Here are the outputs:
`WARNING:tensorflow:From C:\Users\Michael\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-05-11 23:51:08.686711: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2

0%| | 0/500000 [00:00<?, ?it/s]Iter: 0
Train_loss: 0.2544
Test_loss: 0.2435

0%| | 1/500000 [00:00<37:37:01, 3.69it/s]
0%| | 25/500000 [00:00<26:30:21, 5.24it/s]
0%| | 51/500000 [00:00<18:43:05, 7.42it/s]
0%| | 77/500000 [00:00<13:16:01, 10.47it/s]Iter: 100
Train_loss: 0.1807
Test_loss: 0.183

0%| | 103/500000 [00:00<9:27:10, 14.69it/s]
0%| | 129/500000 [00:00<6:46:53, 20.47it/s]
0%| | 155/500000 [00:00<4:54:36, 28.28it/s]
0%| | 182/500000 [00:00<3:35:50, 38.60it/s]Iter: 200
Train_loss: 0.1521
Test_loss: 0.1615

0%| | 208/500000 [00:01<2:41:04, 51.72it/s]
0%| | 234/500000 [00:01<2:02:26, 68.03it/s]
0%| | 261/500000 [00:01<1:35:19, 87.37it/s]
0%| | 287/500000 [00:01<1:16:25, 108.98it/s]Iter: 300
Train_loss: 0.1367
Test_loss: 0.1388

0%| | 313/500000 [00:01<1:03:40, 130.79it/s]
0%| | 340/500000 [00:01<54:11, 153.68it/s]
0%| | 366/500000 [00:01<47:37, 174.84it/s]
0%| | 393/500000 [00:01<42:57, 193.85it/s]Iter: 400
Train_loss: 0.1231
Test_loss: 0.1307
...
`

My dataset is 203454KB, I can't get the dataset after filling, because my dataset is too big? It gives some mistakes.

the mistake is as follows, can you help me? Very thanks!!!!!!
esource exhausted: OOM when allocating tensor with shape[12241,16996] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1320, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1408, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[12241,16996] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node concat}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Sigmoid/_29]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main_letter_spam.py", line 96, in
imputed_data, rmse = main(args)
File "main_letter_spam.py", line 45, in main
imputed_data_x = gain(miss_data_x, gain_parameters)
File "/lxt/gain/GAIN/gain.py", line 169, in gain
imputed_data = sess.run([G_sample], feed_dict = {X: X_mb, M: M_mb})[0]
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 930, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[12241,16996] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node concat (defined at /lxt/gain/GAIN/gain.py:94) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Sigmoid/_29]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Errors may have originated from an input operation.
Input Source operations connected to node concat:
Placeholder (defined at /lxt/gain/GAIN/gain.py:59)
Placeholder_1 (defined at /lxt/gain/GAIN/gain.py:61)

Original stack trace for 'concat':
File "main_letter_spam.py", line 96, in
imputed_data, rmse = main(args)
File "main_letter_spam.py", line 45, in main
imputed_data_x = gain(miss_data_x, gain_parameters)
File "/lxt/gain/GAIN/gain.py", line 113, in gain
G_sample = generator(X, M)
File "/lxt/gain/GAIN/gain.py", line 94, in generator
inputs = tf.concat(values = [x, m], axis = 1)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py", line 1271, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1217, in concat_v2
"ConcatV2", values=values, axis=axis, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 800, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3479, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1961, in init

Congeniality of gain

Hi, I am really interested in congeniality of gain you mentioned in section 6.5 in your paper. Do you mind sharing the code of this part to me? My email address is [email protected]. Thanks for your time.

hyperparameters

How do you choose the best hyperparameters for GAIN ?

Hyperparameters training

Hello, How I can use GAIN with tensorboard hparms to get the best hyperparameters for a specifies dataset ?

original data

In the paper the original data have missing values and data matrix has got missing values as zero. There is also a random matrix that has available values as zeros and missing values as some gaussian random. The Mask matrix indicates the missing values as 0 and available values as one. However, I am not able to correlate these to the values in data loader. My understaning of data_loader.py is
data_m -> Mask matrix
but i do not undetstand the other two..
looks like miss_data_x is neither random mtarix or data matrix but contains np.nan in missing values. Please correct me if I am wrong
Also, I am confused if the data_x contains missing values? In the paper the original data has missing values as 'X' what is X? in data_x is it np.nan?

Please help

RMSE is not stable

Each time I run the code the value of RMSE changes, why is it not stable?

Why G_loss doesn’t converge much?

Hi Jinsung,

Thank you for your fabulous paper. I got some inspirations in gapfilling satellite image time series from your model.

Just have one question. When I implement your model and tried to train on my own data, which is 1000 by 1000 pixel images with 78 bands, I just find the G_loss didn't converge very much. Usually, it is around 13 to 14, which is, in my experience, quite a large number. Have you met this situation when you developed your model?

Cheers,
Steve

why are MSE_test_loss and MSE_train_loss in different types?

Hi ,I noticed that in your code,
MSE_train_loss = tf.reduce_mean((M * X - M * G_sample)**2) / tf.reduce_mean(M)
MSE_test_loss = tf.reduce_mean(((1-M) * X - (1-M)*G_sample)**2)// tf.reduce_mean(1-M)
question is
why "1-M" in the test loss but "M" in the train loss,

Thanks!

Is the adversarial loss actually helping?

Hi Yoon,

I was wondering how the results from table 1 were obtained. I have been playing around with the code and to me, it doesn't seem clear that the adversarial loss is helping (as reported in the results section, concretely table 1).

For example, when I run the code for the SPAM dataset (default implementation and hyperparameters) the RMSE score is ~0.053. However, when I set the adversarial loss to 0 by modifying the following line in gain.py:

G_loss_temp = -tf.reduce_mean((1-M) * tf.log(D_prob + 1e-8)) * 0.

the RMSE score is also ~0.053. Am I missing anything? I am observing something similar for another dataset. I am considering using GAIN for a project and I would greatly appreciate an explanation on how the results of table 1 were obtained.

How to decide Missingness Mechanism

Hello Jsyoon,

Your work is really nice and thank you for sharing the codes.
The question I have difficulty in finding the answer is this; How can I choose the missing mechanism?
I want to create a missing mechanism of type MCAR, MAR and MNAR, but I could not find it in the code. Can we identify other missing mechanisms by using the function you wrote to create missing data? Or how did you do this?

Thank you.

Using GAIN to imputing the missing cytokines data

Dear Yoon

I found this newly developed tool to imputing the missing data using GAN. It seems that your tool is significantly better than the traditional method. We have some cytokines data for cancer patients, however, about 20% of the data is missing due to below detection.

Hence, I would like to query you whether it is possible to use your tool to do the imputation for the missing data with this type of feature.

The missing data is the value below the smallest value in the vector of one cytokine.

Thank you very much.

Why not both L_G and L_D relevant to V(D,G)?

Why not both L_G (Eq. 18)
$L_G\doteq\sum \left\{(1-m_i) \log \hat{m}_i \oplus L_M \right\}$

and L_D (Eq. 16)
$L_D\doteq\sum \left\{m_i \log \hat{m}_i \oplus (1-m_i) \log [1-\hat{m}_i] \right\}$

directly relevant to V(D,G) (Eq. 4)
$V(D,G)\doteq \sum \left\{M^T \log D(\hat{X},H) \oplus (1-M)^T \log [1-D(\hat{X},H)]\right\}$ ?

L_D is relevant (obviously directly derived from V(D,G)) but L_G is irrelevant.

Batch size for general dataset

Hi, while reading your fabulous paper, I just got a few question.

Q1

For this MNIST dataset, we can use mini batch.
But for general, matrix-shaped dataset, how did you design mini batch?
I am just guessing that the size of mini-batch might be the size of selected rows.
i.e. X.shape=[n,p] => X_mini_batch.shape=[k,p] (k<n)
Am I guessing right?

Q2

Is it possible that, G actually knows the full dataset?
I mean, since you are masking dataset randomly at each epoch, if we run total 10000 epoch, G might have seen total dataset without masking.
(By simple calculation, I found out that G have seen 99.99% of unmasked training dataset.)
I think for each data(X_i) in dataset(X), we should fix the mask to examine the model performance more precisely.
(And, this might be the case of the real world)
I'm bit confused.. am I wrong?
Did you also used this random-masking on your other experiments in this paper?

Q3

I think your are using dataset matrix in vectorized form.
For MNIST dataset, it might be okay since all variables in this data mean the same thing.
But for general dataset such as cancer... etc, each column has different meaning(also, different type)
I think simply vectorizing these datasets might cause significant loss of the information..
Was it okay for you, or did you modified G&D for those datasets?

제가 부족한 부분이 많아서 질문이 좀 많습니다, 죄송합니다..
깃헙 이슈라서 영어로 적긴 했는데, 한국말이 가능하시다면 한국말로 설명해주셔도 괜찮습니다(사실 그걸 더 선호합니다ㅎㅎ)
논문 정말 재밌게 잘 읽었습니다!

Differences with the paper

Hello!
Thank you for your great paper and this code.

I was wondering why some hyperparameters are different from the ones in the paper.
For example, you use ReLu in the code but Tanh in the paper. Or you choose alpha = 100 in the code by default, but alpha = 0.1, 0.5, 1, 2 or 10 in the paper.
And the hint mechanism is completely different (do the mathematical results still hold?).
What should be followed to reproduce the results of the paper?

MinMax Normalization?

Hello. I found that your work is really superb.
I am now trying to apply GAIN to my research. During the process, I became curious about how you normalized the data. As far as I know, MinMax normalization is formulated as below.

(x- min)/(max-min)

But I figured out that your normalizing code is written with (x-min)/(max).

Is there something I am missing with your code? Or can we just normalize the data with just dividing with the maximum value?

I will look forward to your reply.

Thank you.

How to deal with categorical variables in this method

Hi Jinsung,

Thanks for developing this method! It's very cool and useful.

In supplement material, I saw you showed the imputation performance for categorical variables. How did you deal with categorical variables in this method?

I tried to transform the dataset with One Hot Encoding and send to the network without any modification, but the imputation result (false prediction rate) is very similar to random guess.

Hao

RMSE

Why you used a custom formula to calculate RMSE ?

Why isn't the loss calculated only with b_i=0 values of the Hints.

Hello,

First of all, congratulations four your fantastic paper. I have been reading it and working with your repository; however, I have a doubt and would appreciate it if you could answer it.

I know a couple of changes exist between the implementation described in the original paper and the one in this repository. Also, I have checked some closed issues as #2, where the new implementation of the Hint matrix is described.

In your original article, when you are talking about how the algorithm works in section 5, it can be seen that the discriminator loss is only calculated with the b_i = 0 of each sample, that is, the positions where there isn't a hint. Also, in the same paragraph, it can be seen that if you train with all the values, the discriminator will overfit to the hint matrix.

Despite this, when I check your code, I have the impression that you calculate D_loss and G_loss with all the values of the hint matrix (b_i=0 and b_i=1) in the lines 136-139.

I want to ask you if this change is due to the difference in the definition of the Hint matrix and why doesn't this new way of calculating the loss ends in the discriminator overfitting to the hint matrix.

Thank you very much for your attention!

Changing only missing values? and scoring?

Hi There,

First off, thank you for your amazing paper and software. I've been using it to learn and it's really cool.

I had a few questions:

Is there a way to have it only change the missing values and not the other parts? I am importing my pandas dataframe and filled my missing fields with 0 as well as I tried np.nan but had the same effect.
I assume I have to play with the models(g/d) to find the right settings but I'm not sure how I can get feedback on its progress. Is there a way to get a score for each iteration so I can understand where to make adjustments?

I'm new to gans and your tool so I apologize if these are basic questions.

Thanks again for the great tool and I'm really enjoying it.

Could you please provide Requirements.txt file

To identify the compatible package versions

Hint mechanism different in the paper?

In the paper the hint mechanism is generated by first selecting one feature by row in a vector called k.

Then a matrix b with the same size of the mask m is created, with a one per row according to k and the rest set to zero.

Then the hint is created with the equation:

h = b * m + 0.5 * (1.0 - b)

Which means that the hint is almost a copy of m but has exactly one 0.5 per row.

In your implementation, the hint is created by removing ones of the mask m with a probability of 0.9 (or a probability of keeping them of 0.1). There are no 0.5 values in the hint.

# 1. Mini batch size
mb_size = 128
# 2. Missing rate
p_miss = 0.5
# 3. Hint rate
p_hint = 0.9
# 4. Loss Hyperparameters
alpha = 10
# 5. Imput Dim (Fixed)
Dim = 784

def sample_M(m, n, p):
    A = np.random.uniform(0., 1., size = [m, n])
    B = A > p
    C = 1.*B
    return C

M_mb = sample_M(mb_size, Dim, p_miss)
H_mb1 = sample_M(mb_size, Dim, 1-p_hint)
H_mb = M_mb * H_mb1

Am I understanding something wrong?
Thank you.

Training Query

Hi,
I've been implementing your code, and its been really simple to work with due to the great structure- thank you!
I have a question about the GAIN_v2.py script, I noticed that you're training the first and third layers of the generator, but all three layers of the discriminator. What was the rationale behind not training this central layer? I couldn't find a reference to it in the paper, but may have missed it.
Thanks! Robyn

Bug introduced in new version of the code?

Greetings!

I noticed that in the newest version of the code, in the training phase only the missing data is given to the gain module. So the X in the gain module is actually the data that is missing according to M. So the loss function can't compare the predictions with the original values.

No split training and testing sets？

Hello, I see that your code did not split the dataset into training and testing sets. You trained all the data and then tested the entire dataset. Isn't this data leakage?

Departures from paper wrt discriminator loss, generator first layer and activation functions?

Hello,

I have a few comments if I may.

In your code, at line 157 we see that
D_loss1 = -tf.reduce_mean(M * tf.log(D_prob + 1e-8) + (1-M) * tf.log(1. - D_prob + 1e-8)) * 2

Where does this "*2" come from? (I cannot see it in and/or derive it by the paper)

Also, at line 124 we see that
inputs = tf.concat(axis = 1, values = [inp,m]) # Mask + Data Concatenate

which is also not clarified/stated in the paper as it seems that the generators input is substantially more informative (also input and output dimensions are not the same).

Finally, at the appendix you say that the activation functions at the internal layers that you use are tanh though in your code you use relu; is there a particular reason why you do that?

Many thanks,
Maria Skoularidou

Hint matrix

In gain.py, line 165-166,
H_mb_temp = binary_sampler(hint_rate, batch_size, dim)
H_mb = M_mb * H_mb_temp

To my understanding, H_mb_temp is the binary B matrix in the paper for the minibatch, and M_mb is binary mask matrix for the mini batch.
Then H_mb is binary.
But H_mb should have some 0.5 in the entries.
Also according to the paper, it seems to be H_mb = M_mb * H_mb_temp + 0.5 * (1-H_mb_temp).
Is there something I don't understand?
Thank you!

Baselines calculated only on test set (?)

Hi,

By reading the paper I think that the baselines (like MICE, missforest etc.) are calculated only on the test-set. On the other hand, GAIN learns a model from the bigger training set and then predicts on the test set.

What are your thoughts on that subtle difference?

How to get a complete data set after filling？

After reading the article and reading your code, I feel it is wonderf. But there is a question: How do I use the trained GAN to fill my missing data set? And how can I exported this dataset be completely ?

What is the advantage of adversarial Loss?

Thank you very much for your work. But after going through your paper, I cannot get the sense on how dose adversarial loss contribute in the missing data imputation. I have tried on your provided example and only use MSE_loss to train the generator. It seems that the test mse with only MSE_Loss (Line 162: G_loss = MSE_train_loss ) and that with MSE_Loss + Adversarial Loss (Line 162 G_loss = G_loss1 + alpha * MSE_train_loss ) are quite similar. Could you kindly explain more on how adversarial loss contributes in imputation, and maybe some other examples? Thank you very much!