Giter VIP home page Giter VIP logo

Comments (3)

rinongal avatar rinongal commented on July 26, 2024 2

Every place where we use CLIP, we use the same weighted combination of the two models, yes. In practice, for many of our results (as you saw in the supp table), we set the weight of one of the models to 0, which effectively means we used just one model.

The 32/B model has larger patch sizes, and it focuses less on local content and more on global things like style. The 16/B model helps somewhat when you want to improve smaller scale attributes like shape. There's also a 14/L model, but it almost always makes the results worse :) You can add it to help improve identity preservation, but you'll probably want to give it a low weight.

from stylegan-nada.

rinongal avatar rinongal commented on July 26, 2024 1

The entire pipeline could have been implemented using any of the available CLIP models, or a mix thereof. Setting the weight of the ViT-B/16 CLIP model to 0.0 just means that it did not contribute to any of the loss / layer selection calculations. The other CLIP model (ViT-B/32) would still be used, and you could simply rank the layers according to its output (rather than a weighted sum of the outputs of several CLIP models).

The instances where adaptive layer selection is off are the instances where the number of selected layers is the same as the number of W-code inputs for the model (e.g. 18 for the 1024x1024 FFHQ model).

It does make sense to use values other than 1.0 and 0.0. Each CLIP model leads to different visual effects. Using models with smaller patch sizes (16, 14) leads to better identity preservation. Using larger models (32) typically leads to better representation for styles etc. You can use values between 1.0 and 0.0 in order to interpolate between these preferences and decide how much importance you want to place on each.

If you're only using one CLIP model, then you are correct that you may as well just use 1.0 or 0.0 and play with the scale of the loss instead.

from stylegan-nada.

lebeli avatar lebeli commented on July 26, 2024

I see, there was a misunderstanding on my part. So for both, the global and directional loss, you use the same two CLIP models (ViT-B/32 and ViT-B/16)? And for both losses you sum the individual CLIP losses from ViT-B/32 and ViT-B/16?

Edit: One last question. Does the big CLIP Model focus more on global features and the smaler one more on local features? Or what is the difference?

from stylegan-nada.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.