Giter VIP home page Giter VIP logo

Comments (11)

jeonsworld avatar jeonsworld commented on June 6, 2024

Hi,
The zero-initialized linear layer can be checked in the following code. The zero-initialized linear layer is also used in Big Transfer (Bit) and shares one more related link.

from vit-pytorch.

chaoyanghe avatar chaoyanghe commented on June 6, 2024

OK. But where is removing two linear layers and replacing it by a single layer? I only see you zero-initialized a single linear layer.

from vit-pytorch.

chaoyanghe avatar chaoyanghe commented on June 6, 2024

Another issue is that you used the Conv2d layer to extract features in the embedding layer. Is this "hybrid architecture" described in the original paper? Could you also support a version which does not rely on CNN?

from vit-pytorch.

jeonsworld avatar jeonsworld commented on June 6, 2024

As far as I know, the two linear layers in the paper are the MLP layer whose activation function is tanh. You can check that part from the following link (pre_logits and head).
Additionally, in this repository, we create a single linear layer that matches the target class without weight loads of pre_logits and head.

from vit-pytorch.

jeonsworld avatar jeonsworld commented on June 6, 2024

As far as I know, the hybrid model replaces Conv2d with a different backbone. With a simple implementation you can remove Conv2d and use the BiT backbone for this part. Please refer to timm for detailed implementation of hybrid model.

ViT used Conv2d for patch embedding, and I think not using CNN is a big challenge.

from vit-pytorch.

chaoyanghe avatar chaoyanghe commented on June 6, 2024

Please check Equation 1 in the original paper. That's the non-hybrid version.

from vit-pytorch.

jeonsworld avatar jeonsworld commented on June 6, 2024

image

The currently implemented part is the same as Equation (1).
Please refer to the following link for the part about Equation (1).

from vit-pytorch.

chaoyanghe avatar chaoyanghe commented on June 6, 2024

Line 144 x = self.patch_embeddings(x)

self.patch_embeddings is conv2d?

from vit-pytorch.

jeonsworld avatar jeonsworld commented on June 6, 2024

That's correct!
Conv2d's kernel_size is the patch size.

from vit-pytorch.

chaoyanghe avatar chaoyanghe commented on June 6, 2024

@jeonsworld https://github.com/google-research/vision_transformer/blob/0040316f123353eaba186e7be914e58e656cc120/vit_jax/models.py#L215

Okay. Now I get your point. Look at this comment. The author of ViT said s2d (sequence 2D) + embedding is equal to a Conv operation. I think this is the key. So, in essence, ViT is also CNN-based model. We cannot say we should give up CNN.

from vit-pytorch.

jeonsworld avatar jeonsworld commented on June 6, 2024

Thanks for the nice comment. I think this gave me a better insight.

from vit-pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.