In the original paper, the authors state that " we remove the whole head (two l

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

That's correct! <a href="https://github.com/jeonsworld/ViT-pytorch/blob/878ebc5bd1

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Model Architecture For Fine-tuning about vit-pytorch HOT 11 CLOSED

jeonsworld commented on June 6, 2024

Model Architecture For Fine-tuning

from vit-pytorch.

Comments (11)

jeonsworld commented on June 6, 2024

Hi,
The zero-initialized linear layer can be checked in the following code. The zero-initialized linear layer is also used in Big Transfer (Bit) and shares one more related link.

from vit-pytorch.

chaoyanghe commented on June 6, 2024

OK. But where is removing two linear layers and replacing it by a single layer? I only see you zero-initialized a single linear layer.

from vit-pytorch.

chaoyanghe commented on June 6, 2024

Another issue is that you used the Conv2d layer to extract features in the embedding layer. Is this "hybrid architecture" described in the original paper? Could you also support a version which does not rely on CNN?

from vit-pytorch.

jeonsworld commented on June 6, 2024

As far as I know, the two linear layers in the paper are the MLP layer whose activation function is tanh. You can check that part from the following link (pre_logits and head).
Additionally, in this repository, we create a single linear layer that matches the target class without weight loads of pre_logits and head.

from vit-pytorch.

jeonsworld commented on June 6, 2024

As far as I know, the hybrid model replaces Conv2d with a different backbone. With a simple implementation you can remove Conv2d and use the BiT backbone for this part. Please refer to timm for detailed implementation of hybrid model.

ViT used Conv2d for patch embedding, and I think not using CNN is a big challenge.

from vit-pytorch.

chaoyanghe commented on June 6, 2024

Please check Equation 1 in the original paper. That's the non-hybrid version.

from vit-pytorch.

jeonsworld commented on June 6, 2024

The currently implemented part is the same as Equation (1).
Please refer to the following link for the part about Equation (1).

from vit-pytorch.

chaoyanghe commented on June 6, 2024

Line 144 x = self.patch_embeddings(x)

self.patch_embeddings is conv2d?

from vit-pytorch.

jeonsworld commented on June 6, 2024

That's correct!
Conv2d's kernel_size is the patch size.

from vit-pytorch.

chaoyanghe commented on June 6, 2024

@jeonsworld https://github.com/google-research/vision_transformer/blob/0040316f123353eaba186e7be914e58e656cc120/vit_jax/models.py#L215

Okay. Now I get your point. Look at this comment. The author of ViT said s2d (sequence 2D) + embedding is equal to a Conv operation. I think this is the key. So, in essence, ViT is also CNN-based model. We cannot say we should give up CNN.

from vit-pytorch.

jeonsworld commented on June 6, 2024

Thanks for the nice comment. I think this gave me a better insight.

from vit-pytorch.

Model Architecture For Fine-tuning about vit-pytorch HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent