I've been trying to reproduce your results on MNIST, CIFAR10 and ImageNet but I can't quite get the same accuracies as in your paper. I tend to trust your results so I think it's more of a code or hyperparameters problem.
I had no problem on MNIST, I got 99.40%.
On CIFAR10 I ran the exact command supposed to yield 90.35% accuracy:
python main.py --net-type 'noiseresnet18' --dataset-test 'CIFAR10' --dataset-train 'CIFAR10' --nfilters 128 --batch-size 10 --learning-rate 1e-4 --first_filter_size 3 --level 0.1 --optim-method Adam --nepochs 450
I obtained 89.38%. The same command with 256 filters instead of 128 gets me to 90.76%. It's ok compared to 90.35%, as I don't know the variance of PNNs, but you also say that after rerunning experiments following MK's remark, the 94% of your paper more or less holds. My results are quite far from that, in between MK's and that of the paper.
ImageNet is more problematic. The code on this repo doesn't seem to be actually intended to run on ImageNet, so I needed to fix a few things first. I don't think it's important but I'll put the modifications here just so you have all the informations. I added this in main.py
between line 123 and 125:
elif self.dataset_train_name.startswith("ImageNet"):
self.nclasses = 1000
self.input_size = 224
if self.filter_size < 7:
self.avgpool = 14 #TODO
elif self.filter_size == 7:
self.avgpool = 7
just copying what was above and changing nclasses
and input_size
, but maybe I should have changed avgpool
as well. There was also a little change needed because some arguments for creating ImageNet dataloaders are not parsed, namely input_filename_train
and input_filnename_test
.
I had to change line 494 in models.py
from pool = 1
to pool = 7
because otherwise there was a dimension mismatch at the linear layer. A pooling with kernel size 1 and stride 1 is useless, so I guess it was a typo. Finally, I had to change NoiseLayer
to this so I could run on two GPUs (there was a device problem with your implementation):
class NoiseLayer(nn.Module):
def __init__(self, in_planes, out_planes, level, act):
super(NoiseLayer, self).__init__()
self.register_parameter('noise', None)
self.level = level
self.layers = nn.Sequential(
act,
nn.BatchNorm2d(in_planes), # TODO paper does not use it!
nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=1),
)
def set_noise(self, input):
noise = input.new(input.data[0].size()).uniform_()
noise = self.level * (2*noise - 1)
self.noise = nn.Parameter(noise)
def forward(self, x):
if self.noise is None:
self.set_noise(x)
y = torch.add(x, self.noise)
return self.layers(y) # input, perturb, relu, batchnorm, conv1x1
I tried 3 different settings, always with noiseresnet-18, Adam with learning rate 1e-4, first filter size of 7 and noise level of 0.1.
First was 64 filters and batch size 128, I got 39%.
Second was 128 filters and batch size 128, I got 52%.
Now I'm running 256 filters and batch size 64 (which you can imagine is slow). It's not finished yet but at epoch 24 the validation accuracy is 47%, which is about 10% (in absolute) behind your best model in Figure 5 of your paper at the same epoch. The curve looks like it's going to reach about 60% at most. Also I doubt you put as many as 256 filters in the first layer of your model (it should be 64 if it's a proper ResNet-18, but tell me if not), so there is probably something wrong with what I do.
I suspect the problem is with the number of masks but there is something I don't understand about it. From your paper I get that there is an intermediate representation of m channels, with m a multiple of p, the number of input channel and m/p the number of mascks per input channel, but a few things seem incoherent to me with respect to that.
First, equation 2 only works with m=p. Shouldn't the index on x be a fixed one indicating the input channel, so we apply m masks to the same input channel ? Otherwise it doesn't really fit to the multiple different "views" of a single channel that was already in LBCNNs. In that case you would also need to indicate that index on nu if you want to have m x p perturbations masks, like in part 4.1. Moreover, it would mean there is a confusion between m and m / p : m is the number of masks per input channel in equation 2 and becomes the total number of channels of the intermediary representation in part 4.1. Following the first case, the intermediary representation would actually be of size mxp and you would have m x p x q learnable parameters, otherwise the output would be to big by a factor of p. The parameter saving would then be h x w / m, which is the same as p x h x w / m is if we replace m / p by m. Also, your implementation only allows for m=p, unlike MK's. So then what does N masks (e.g. 256) mean exactly in your paper ? While reading, I assumed it was a fixed number throughout all layers, but that means the last block(s) would have less than one mask per input channel, which does not make sense to me. Is it then the number for the first perturbative layer, meaning m=4p for N=256 ? In that case what code did you actually run if the one on this repo only allows for m=p ? Or is it simply the number of filters and m=p throughout the network ? Then the difference of performance between my model and yours would come from other hyperparameters.
Sorry, I didn't realize before starting this issue that my understanding was so fuzzy, and it got longer than expected. I hope you will find the time to answer my questions, which are in brief:
- could you clarify how exactly the masks are added to the input and combined together ? which interpretation is the right one ?
- what are the new accuracies you get after rerunning the experiments following MK's post ?
- did you use this code to run your experiments ? if yes (which doesn't seem to be the case for ImageNet) which command should I use to reproduce your results ?