The pnn.pytorch.update from juefeix

How to train on ImageNet

Hi,

According to your experiment setting in paper, we need n_masks=64/128/256 to train ImageNet efficiently using perturbative layers. That would cost huge memory on GPU. So how many GPUs you used for PNN-ResNet50 when training ImageNet? And how much the batch_size is for each gpu and how long would the training process take?

And are there any experiments about sharing noisy masks between layers?

Thanks.

PerturbBottleneck

I want to introduce PNN to ResNet50.
So, are there any plans to release the PerturbBottleneck class?

Having trouble reproducing results

I've been trying to reproduce your results on MNIST, CIFAR10 and ImageNet but I can't quite get the same accuracies as in your paper. I tend to trust your results so I think it's more of a code or hyperparameters problem.

I had no problem on MNIST, I got 99.40%.

On CIFAR10 I ran the exact command supposed to yield 90.35% accuracy:
python main.py --net-type 'noiseresnet18' --dataset-test 'CIFAR10' --dataset-train 'CIFAR10' --nfilters 128 --batch-size 10 --learning-rate 1e-4 --first_filter_size 3 --level 0.1 --optim-method Adam --nepochs 450
I obtained 89.38%. The same command with 256 filters instead of 128 gets me to 90.76%. It's ok compared to 90.35%, as I don't know the variance of PNNs, but you also say that after rerunning experiments following MK's remark, the 94% of your paper more or less holds. My results are quite far from that, in between MK's and that of the paper.

ImageNet is more problematic. The code on this repo doesn't seem to be actually intended to run on ImageNet, so I needed to fix a few things first. I don't think it's important but I'll put the modifications here just so you have all the informations. I added this in main.py between line 123 and 125:

 elif self.dataset_train_name.startswith("ImageNet"):
        self.nclasses = 1000
        self.input_size = 224
        if self.filter_size < 7:
           self.avgpool = 14  #TODO
        elif self.filter_size == 7:
           self.avgpool = 7

just copying what was above and changing nclasses and input_size, but maybe I should have changed avgpool as well. There was also a little change needed because some arguments for creating ImageNet dataloaders are not parsed, namely input_filename_train and input_filnename_test.
I had to change line 494 in models.py from pool = 1 to pool = 7 because otherwise there was a dimension mismatch at the linear layer. A pooling with kernel size 1 and stride 1 is useless, so I guess it was a typo. Finally, I had to change NoiseLayer to this so I could run on two GPUs (there was a device problem with your implementation):

class NoiseLayer(nn.Module):
    def __init__(self, in_planes, out_planes, level, act):
        super(NoiseLayer, self).__init__()
        self.register_parameter('noise', None)
        self.level = level
        self.layers = nn.Sequential(
            act,
            nn.BatchNorm2d(in_planes),  # TODO paper does not use it!
            nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=1),
        )

    def set_noise(self, input):
        noise = input.new(input.data[0].size()).uniform_()
        noise = self.level * (2*noise - 1)
        self.noise = nn.Parameter(noise)

    def forward(self, x):
        if self.noise is None:
            self.set_noise(x)

        y = torch.add(x, self.noise)
        return self.layers(y)  # input, perturb, relu, batchnorm, conv1x1

I tried 3 different settings, always with noiseresnet-18, Adam with learning rate 1e-4, first filter size of 7 and noise level of 0.1.
First was 64 filters and batch size 128, I got 39%.
Second was 128 filters and batch size 128, I got 52%.
Now I'm running 256 filters and batch size 64 (which you can imagine is slow). It's not finished yet but at epoch 24 the validation accuracy is 47%, which is about 10% (in absolute) behind your best model in Figure 5 of your paper at the same epoch. The curve looks like it's going to reach about 60% at most. Also I doubt you put as many as 256 filters in the first layer of your model (it should be 64 if it's a proper ResNet-18, but tell me if not), so there is probably something wrong with what I do.

I suspect the problem is with the number of masks but there is something I don't understand about it. From your paper I get that there is an intermediate representation of m channels, with m a multiple of p, the number of input channel and m/p the number of mascks per input channel, but a few things seem incoherent to me with respect to that.
First, equation 2 only works with m=p. Shouldn't the index on x be a fixed one indicating the input channel, so we apply m masks to the same input channel ? Otherwise it doesn't really fit to the multiple different "views" of a single channel that was already in LBCNNs. In that case you would also need to indicate that index on nu if you want to have m x p perturbations masks, like in part 4.1. Moreover, it would mean there is a confusion between m and m / p : m is the number of masks per input channel in equation 2 and becomes the total number of channels of the intermediary representation in part 4.1. Following the first case, the intermediary representation would actually be of size mxp and you would have m x p x q learnable parameters, otherwise the output would be to big by a factor of p. The parameter saving would then be h x w / m, which is the same as p x h x w / m is if we replace m / p by m. Also, your implementation only allows for m=p, unlike MK's. So then what does N masks (e.g. 256) mean exactly in your paper ? While reading, I assumed it was a fixed number throughout all layers, but that means the last block(s) would have less than one mask per input channel, which does not make sense to me. Is it then the number for the first perturbative layer, meaning m=4p for N=256 ? In that case what code did you actually run if the one on this repo only allows for m=p ? Or is it simply the number of filters and m=p throughout the network ? Then the difference of performance between my model and yours would come from other hyperparameters.

Sorry, I didn't realize before starting this issue that my understanding was so fuzzy, and it got longer than expected. I hope you will find the time to answer my questions, which are in brief:

could you clarify how exactly the masks are added to the input and combined together ? which interpretation is the right one ?
what are the new accuracies you get after rerunning the experiments following MK's post ?
did you use this code to run your experiments ? if yes (which doesn't seem to be the case for ImageNet) which command should I use to reproduce your results ?

89.79 in my first try

Just run python main.py --net-type 'noiseresnet18' --dataset-test 'CIFAR10' --dataset-train 'CIFAR10' --nfilters 128 --batch-size 10 --learning-rate 1e-4 --first_filter_size 3 --level 0.1 --optim-method Adam --nepochs 450 , and I got best 89.79 in epoch 353

juefeix / pnn.pytorch.update Goto Github PK

pnn.pytorch.update's People

Contributors

Stargazers

Watchers

Forkers

pnn.pytorch.update's Issues

How to train on ImageNet

PerturbBottleneck

Having trouble reproducing results

89.79 in my first try

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent