The coatnet-pytorch from chinhsuanwu

Hi,
I noticed that the value of self.relative_bias_table is always all 0, then the following:
relative_bias = self.relative_bias_table.gather(
0, self.relative_index.repeat(1, self.heads))
is actually meaningless (it is all 0)?
Thanks!

About the # params

Hey, I tried with your implementation, and I found the calculated #param is a little bit different from the paper, and I am curious about the reason, could you please help me out?

Take coatnet_0 for example, the calculated result is 17789624 ( 17789624 / 2^20 = 16.97), and the reported #param of the paper is 25M

Thanks in advance

Love from China

very good work!

About CoAtNet-6 and CoAtNet7

Hello, first off, really appreciate your work! Now, how can I use coatnet-6 and coatnet-7? Is it adding a sequential of a 'C' block followed by a 'T' block at s3?

Models seem not converging.

Hi,

I tried to train CoAtNet_0 with tiny image net from cs231n (200 classes). Seems the model does not converge.

Could it be that the implementation is not 100% correct? For example, the positional embedding indexing part.
I went through the code and I think other components should be correct.

Except for the pos embedding indexing, I'm not good enough to comprehend it. Do you have a reference for the implementation of the positional embedding indexing part?

Model Pre-Trained CoAtNet

Are there any pre-trained models for CoAtNet? (e.g: ImageNet [1k, 21k], COCO, ...)

When will you push your entire code?

About the stochastic depth

Hi, I can't found the code about stochastic depth in your implementation.

And I add the stochastic depth code and train a CoAtNet-Tiny on ImageNet 1k, but got 79.27%@top1.

Have you reproduce the results reported by the paper?

Inconsistencies in MBConv, with corrected code provided

I've found the MBConv to have some computational inconsistencies. The following corrected code works, where I've changed the stride of the projection operation (self.proj) and moved it out of the if downsample statement. Further, the squeeze and excite block has been appropriately initialized (I've added my squeeze and excite block too here for completeness). I've also added the channel projection operation on the downsample is false branch of MBConv forward method:

class SqueezeAndExcite(nn.Module):
    def __init__(self, in_channels, expansion=0.25): # keep the reduction fixed
        super().__init__()
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(in_channels, int(in_channels * expansion)),
            nn.GELU(),
            nn.Linear(int(in_channels * expansion), in_channels),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avgpool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y.expand_as(x)

class MBConv(nn.Module):
    def __init__(self, inp, oup, expansion, downsample):
        super().__init__()
        self.downsample = downsample
        stride = 1 if not downsample else 2
        hidden_dim = int(expansion * inp)

        if self.downsample:
            self.pool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.proj = nn.Conv2d(inp, oup, kernel_size=1, stride=1, padding=0, bias=False)

        if expansion == 1:
            self.conv = nn.Sequential(
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, stride=stride, 
                          padding=1, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.GELU(),
                nn.Conv2d(hidden_dim, oup, kernel_size=1, stride=1, padding=0, bias=False),
                nn.BatchNorm2d(oup)
            )
        else:
            self.conv = nn.Sequential(
                nn.Conv2d(inp, hidden_dim, kernel_size=1, stride=stride, padding=0, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.GELU(),
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, stride=1, 
                          padding=1, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.GELU(),
                SqueezeAndExcite(hidden_dim, expansion=0.25),
                nn.Conv2d(hidden_dim, oup, kernel_size=1, stride=1, padding=0, bias=False),
                nn.BatchNorm2d(oup)
            )

        self.conv = PreNorm(norm=nn.BatchNorm2d, model=self.conv, dimension=inp)

    def forward(self, x):
        if self.downsample:
            return self.proj(self.pool(x)) + self.conv(x)
        else:
            return self.proj(x) + self.conv(x)

dots + relative positon bias having issue with dimension mismatch in Transformer Block!!!

I try to run the CoAtNet wihtout modifying anything,

def coatnet_0(): num_blocks = [1, 1, 1, 1, 1] # L channels = [64, 96, 192, 384, 768] # D return CoAtNet((50, 50), 3, num_blocks, channels, num_classes=3)

img = torch.randn(1, 3, 50, 50)

net = coatnet_0() out = net(img) print(out.shape)

The error is that the dot is computed from downsampled image,
and the relative bias position is calculated from original image, i do not think that atteniton mechanism is miscalcuting the dots and relative position bias.
i think the issue is from how attention is iplement in Transformer Block, somehow the attention in transformer block is not utilizing downsampled image for relative position bias, instead it is calculated based on original image.
how can i solve this issue?

RuntimeError Traceback (most recent call last)
Cell In[15], line 1
----> 1 out = net(img)
2 print(out.shape)