yaringal / concretedropout Goto Github PK

View Code? Open in Web Editor NEW

243.0 243.0 68.0 2.09 MB

Code for Concrete Dropout as presented in https://arxiv.org/abs/1705.07832

License: MIT License

Jupyter Notebook 100.00%

concretedropout's People

Contributors

Stargazers

Watchers

concretedropout's Issues

tensorflow 2 version

I am trying to port the implementation of concrete dropout in keras in https://github.com/yaringal/ConcreteDropout/blob/master/concrete-dropout-keras.ipynb to tensorflow 2. This is mostly straightforward, as tf 2 has most of the keras API built into it. However, the custom losses are being cleared before fitting.

After the model is defined, and before compiling it, I can see that the losses for each concrete dropout layer have been added to the model losses by the line self.layer.add_loss(regularizer) run when the layers are built:

>>> print(model.losses)
[<tf.Tensor: id=64, shape=(), dtype=float32, numpy=-8.4521576e-05>, <tf.Tensor: id=168, shape=(), dtype=float32, numpy=-0.000650166>, <tf.Tensor: id=272, shape=(), dtype=float32, numpy=-0.000650166>, <tf.Tensor: id=376, shape=(), dtype=float32, numpy=-0.000650166>, <tf.Tensor: id=479, shape=(), dtype=float32, numpy=-0.000650166>]

After the compilation, however, model.losses becomes an empty list, and the assertion assert len(model.losses) == 5 fails. If I choose to ignore the assertion, the fact that the layer losses are being neglected shows up in the warning WARNING:tensorflow:Gradients do not exist for variables ['concrete_dropout/p_logit:0', 'concrete_dropout_1/p_logit:0', 'concrete_dropout_2/p_logit:0', 'concrete_dropout_3/p_logit:0', 'concrete_dropout_4/p_logit:0'] when minimizing the loss. when training the model.

After digging into the compilation code in https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/python/keras/engine/training.py#L184 I believe the problematic lines are

    # Clear any `_eager_losses` that was added.
    self._clear_losses()

Does anyone know why this is being done, and how to add custom losses to tensorflow 2 models in an analogous way?

I posted this as a question in stackoverflow, https://stackoverflow.com/questions/61164373/loss-added-to-custom-layer-in-tensorflow-2-is-cleared-when-compiling

Inconsistency of your code and paper.

I read your paper Concrete Dropout. I find an inconsistency of your code and paper.
The regularizer of kernel matrix should be proportional to 1-p. (Eq.(3) of your paper)
But in your code, it is inversely proportional to 1-p.

kernel_regularizer = self.weight_regularizer * K.sum(K.square(weight)) / (1. - self.p)

I am not sure whether I misunderstand your paper or code.

How to adapt heteroscedastic loss for a classification problem?

How could we adapt this for a classification problem using crossentropy?:

def heteroscedastic_mse(true, pred):
        mean = pred[:, :D]
        log_var = pred[:, D:]
        precision = K.exp(-log_var)
        return K.sum(precision * (true-mean)**2. + log_var, -1)

I found a post where it was suggested to use a montecarlo simulation, however accuracy gets stuck at a very low value, and won't go any up:

def heteroscedastic_categorical_crossentropy(true, pred):
        mean = pred[:, :D]
        log_var = pred[:, D:]
        
        log_std = K.sqrt(log_var)
        
        # variance depressor
        logvar_dep = K.exp(log_var) - K.ones_like(log_var)
        
        #undistorted loss
        undistorted_loss = K.categorical_crossentropy(mean, true, from_logits=True)
        
        # apply montecarlo simulation
        T = 100
        iterable = K.variable(np.ones(T))
        dist = distributions.Normal(loc=K.zeros_like(log_std), scale=log_std)
        monte_carlo_results = K.map_fn(\
                        gaussian_categorical_crossentropy(true, mean, \
                                                          dist, \
                                                          undistorted_loss,\
                                                          D), iterable, \
                                                          name='monte_carlo_results')
    
        var_loss = K.mean(monte_carlo_results, axis=0) * undistorted_loss
    
        return var_loss + undistorted_loss + K.sum(logvar_dep,-1)

where gaussian_categorical_crossentropy is defined by:

def gaussian_categorical_crossentropy(true, pred, dist, undistorted_loss, num_classes):
  def map_fn(i):
    std_samples = dist.sample(1)
    distorted_loss = K.categorical_crossentropy(pred + std_samples[0], true, 
                                                from_logits=True)
    diff = undistorted_loss - distorted_loss
    return -K.elu(diff)
  return map_fn

The source of the last code:
https://github.com/kyle-dorman/bayesian-neural-network-blogpost

Thanks in advance!

'LSTM' object has no attribute 'kernel'

Hi,

since I have updated keras from 2.0.4 to 2.2.0, I am getting

line 76, in build
weight = self.layer.kernel
AttributeError: 'LSTM' object has no attribute 'kernel'

when trying to use the ConcreteDropout class (copied from your notebook) with LSTM/SimpleRNN which wasn't the case before; Normal Dense() layers are working fine. I am trying to fix it myself, but maybe you already got a solution for this.

Thanks a lot,
Sarem

pytorch version: small error?

Dear Yarin,

fascinating research, which I am now trying to use in my own. I believe there is a small error in the 'fit_model' function of the pytorch version concerning the computation of the batches, which affects execution speed even when probably benign to the results.

I believe this code corrects the error:

...
for i in range(self.nb_epoch):
for batch in range(int(np.ceil(self.X.shape[0] / self.batch_size))):
_x = self.X[self.batch_size * batch : self.batch_size * (batch+1)]
_y = self.Y[self.batch_size * batch : self.batch_size * (batch+1)]
x = torch.FloatTensor(_x).cuda() # 32-bit floating point
y = torch.FloatTensor(_y).cuda()
mean, log_var, regularization = self.model(x) # forward pass
loss = heteroscedastic_loss(y, mean, log_var) + regularization
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
...

Kind regards,

Hans

dropout_regularizer

In the paper, entropy of a Bernoulli random variable is
H(p) := -p * log(p) - (1-p) * log(1-p)

But in the code, dropout_regularizer was computed by
dropout_regularizer = self.p * K.log(self.p)
dropout_regularizer += (1. - self.p) * K.log(1. - self.p)
dropout_regularizer *= self.dropout_regularizer * input_dim

Could you please explain the meaning of the dropout_regularizer *= self.dropout_regularizer * input_dim. I cannot find related equation of this code in your paper.

Thanks for your kind help in advance.

value to set for weight_regularizer & dropout_regularizer when dataset size is unknown

Hi,

To calculate the value of weight_regularizer & dropout_regularizer the dataset_size N is used. What if we don't know the dataset size in advance like in case of Lifelong Reinforcement learning in which case re-training happens on a periodic basis on new data as it comes.

I am not able to figure out in such cases how we will be able to use this concrete dropout implementation.

Can you please suggest on how to handle such situations using Concrete dropout. Will really appreciate it.

Thanks in Advance
Nikhil

Contradiction between eqn. (3) and dropout_regularizer? Weight matrix shape vs input shape.

When reading the paper, eqn. 3 states that the dropout probability should be regularized by multiplying the entropy of p by the dimensionality K of the weight matrix.
However, in the code we multiply the entropy term by input_dim = np.prod(input_shape[1:]), which seems to just return the overall dimension of the input (and the code in the arxiv pdf used input_shape[-1]). For convolution layers for example, shouldn't we set K = input_volume_depth * filter_xdim * filter_ydim * output_volume_depth? I would appreciate it @yaringal if you could help me to clear up my confusion.

Confusion about initialization in bigger nets

Hey,
I'm trying out concrete dropout with bigger nets (namely DenseNet121 and ResNet18) and for that tried to port the Keras implementation for spatial concrete dropout to PyTorch.
Since it works for DenseNet121 (model converges) but strangely not for ResNet18, I was wondering, if maybe the initialization I used was wrong.
For both weight_regularizer and dropout_regularizer I used the initialization given in the MNIST example of the spatial concrete dropout Keras implementation (both dependent by division on the train dataset length). However when looking at the paper, you seem to have used 0.01 x N x H x W for the dropout regularizer when using bigger models, but this multiplication would lead to a much much bigger factor than the 2. / N specified in the example.
What kind of initialization is right?
I would greatly appreciate if you could clear up my confusion!
Cheers!

weight_regularizer / dropout_regularizer missing factor 2

The current implementation of weight/dropout regularizer seems to be not proper for cross-entroy loss.
If I understand it correctly, the factor 2 in length scale is removed in this code (you set \lambda = (l**2)(1-p) / (\tau * N)).

This is proper for MSE loss. But the form of cross entroy loss is exactly the same as the negative log-likelihood version of Euclidean loss. So I think the factor 2 can not be removed in this case.

Thanks.

p higher than 0.5 after training...

Quick and hopefully not stupid question:
I'm trying to use the ConcreteDropout class to train a convnet (classifying images into 12 classes). The first strange thing I observed was that usually the first 3 convolutional layers have higher dropout probabilities than the dense layers afterwards (independent from N), but the one which actually makes me worry is that sometimes the probabilities are higher than 0.5... see sample output below:

print np.array([K.eval(layer.p) for layer in model.layers if hasattr(layer, "p")])
[0.59234613 0.4666404  0.2114246  0.10445894 0.10087071]

Full model structure:

N = len(train_images)
l = 1e-5  # lenghts scale parameter (tau - model precision parameter is 1 for classification)
wd = l**2. / N  # this will be the l2 weight regularizer
dd = 1. / N  # this will regularize dropout (depends only on dataset size)

K.clear_session()
model = Sequential()
model.add(ConcreteDropout(Convolution2D(24, (11, 11), strides=(4, 4),
                                        padding="same", activation="relu",
                                        kernel_initializer="he_uniform", bias_initializer="zeros",
                                        data_format="channels_last"),
                          weight_regularizer=wd, dropout_regularizer=dd,
                          input_shape=(256, 256, 1)))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding="valid", data_format="channels_last"))

model.add(ConcreteDropout(Convolution2D(96, (5, 5),
                                        padding="same", activation="relu",
                                        kernel_initializer="he_uniform", bias_initializer="zeros",
                                        data_format="channels_last"),
                          weight_regularizer=wd, dropout_regularizer=dd))
model.add(MaxPooling2D(pool_size=(3, 3), padding="valid", data_format="channels_last"))

model.add(ConcreteDropout(Convolution2D(96, (3, 3),
                                        padding="same", activation="relu",
                                        kernel_initializer="he_uniform", bias_initializer="zeros",
                                        data_format="channels_last"),
                          weight_regularizer=wd, dropout_regularizer=dd))
model.add(MaxPooling2D(pool_size=(3, 3), padding="valid", data_format="channels_last"))

model.add(Flatten())
model.add(ConcreteDropout(Dense(512, activation="relu",
                                kernel_initializer="he_uniform", bias_initializer="zeros"),
                          weight_regularizer=wd, dropout_regularizer=dd))

model.add(ConcreteDropout(Dense(512, activation="relu",
                                kernel_initializer="he_uniform", bias_initializer="zeros"),
                          weight_regularizer=wd, dropout_regularizer=dd))

model.add(Dense(12, activation="softmax",
                kernel_initializer="he_uniform", bias_initializer="zeros"))

opt = optimizers.SGD(lr=0.005, momentum=0.9, nesterov=True)

model.compile(loss="categorical_crossentropy",
              optimizer=opt,
              metrics=["categorical_accuracy"])

history = History()

@yaringal @joeyearsley

Biases in weight_regularizer?

First of all, great work,
In your thesis, the "Dropout as a Bayesian Approximation..." and "Concrete Dropout" article, @yaringal, you seem to apply the Dropout distribution only to the weights and not the biases, which then leads to a p-dependant regularization term that only includes the weight matrices.

However, in the pytorch implementation (I didn't check the other ones) of the regularization term you sum the squares of layer.parameters() which will collect the biases as well. This will lead to a p-dependant regularization term for the biases, which is probably not what you want if you start optimizing p. Is this a bug or am I missing something?

why upscaling weight by 1/1-p after concrete dropout

#3 (comment)

I can't find why weights are upscaled by 1/1-p after concrete dropout in the paper. Can anyone tell me why?

yaringal / concretedropout Goto Github PK

concretedropout's People

Contributors

Stargazers

Watchers

Forkers

concretedropout's Issues

Recommend Projects

Recommend Topics

Recommend Org