Giter VIP home page Giter VIP logo

Comments (3)

EmilienDupont avatar EmilienDupont commented on May 27, 2024 2

This is a very interesting question. The answer is that I indeed did try both. In the code (as you noticed) I ended up using the KL between two discrete probability distributions as opposed to two concrete/gumbel-softmax probability distributions. This is also what was used in the gumbel-softmax paper, see e.g. on page 6

We use a learned categorical prior rather than a Gumbel-Softmax prior in the training objective. Thus, the minimization objective during training is no longer a variational bound if the samples are not discrete. In practice, we find that optimizing this objective in combination with temperature annealing still minimizes actual variational bounds on validation and test sets.

So as the paper notes, it works in practice. However, as you say, it is more theoretically appealing to use the KL between two concrete distributions (as the objective is then an ELBO). I thought that this was especially important in the case of JointVAE since the discrete KL term carries even more weight than in a regular VAE model. I was worried that approximating the KL term with discrete instead of concrete distributions could hinder discrete disentanglement. So I experimented quite a lot with it.

The main problem is that there is no closed form expression for the KL between two concrete distributions. However you play around with it, you will end up with an expectation you cannot compute analytically, so you will have to resort to some form of approximation. I tried various MC estimates of this concrete KL term, but found that all of them had high variance and (experimentally) did not lead to good results. I also tried to directly estimate the gradient through various methods but with no success either. So I ended up using the discrete KL term which yields good results experimentally. However, it would be interesting to explore this further. It might be possible to come up with a low variance estimate of the concrete KL term and its gradients, in which case it would probably make sense to use that instead of the discrete KL term.

Hope this answers your question!

from joint-vae.

sootlasten avatar sootlasten commented on May 27, 2024

Very interesting, thank you for the thorough response. I'm experimenting with explicitly minimizing the total correlation between the latents using the density-ratio trick, and it seems to have a completely different behavior from beta-VAE when it comes to the categorical code, as it does not encode the digit information into the gumbel-softmax variable. I was thinking that it might be due to the specifics of the KL relaxation, I'll keep on investigating.

May I ask a related follow up question as well though, related to your code. In the encoder, you transform the logits for the Gumbel-Softmax through a softmax function, yielding the alphas to be normalized, and then you go on to use these normalized alphas to directly sample from the Gumbel-Softmax as per equation 10 in the Concrete paper. However, notice that your alphas were normalized to lie in the range (0,1), as opposed to the more relaxed range of (0, \infinity), the latter of which could be achieved by treating the linear output of the encoder as \log alpha directly. What was the reasoning behind this approach?

from joint-vae.

EmilienDupont avatar EmilienDupont commented on May 27, 2024

Your project seems cool, sounds like you are doing something along the lines of a FactorVAE with discrete codes? I initially bumped into the same problem you mention when I trained beta-VAE models with discrete and continuous codes. When training a regular beta-VAE with a joint gumbel-softmax and gaussian latent, the model tends to ignore the discrete dimensions. I believe this would also happen when training a beta-VAE model which maximizes total correlation. One of the keys to getting the model to use the discrete codes is to build a loss that forces the model to encode information in both the discrete and continuous dimensions. This is what eq. 7 ensures (in the paper this repo is based on) and is a crucial part of building a model that can encode both discrete and continuous information about the data. Using a similar approach may be useful in your case too. Either way, it sounds very cool, do keep me updated if you get good results!

To answer your question about the code, the only reason I did that was for convenience. When debugging it was often more intuitive to directly look at values in the (0, 1) range. However, there is no important reason for this and it should work the same without normalizing.

from joint-vae.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.