From the code, I see that the analytic KL term is used in the ELBO, which does not yie

Use of an analytic discrete KL term about joint-vae HOT 3 CLOSED

sootlasten commented on May 27, 2024

Use of an analytic discrete KL term

from joint-vae.

Comments (3)

EmilienDupont commented on May 27, 2024 2

This is a very interesting question. The answer is that I indeed did try both. In the code (as you noticed) I ended up using the KL between two discrete probability distributions as opposed to two concrete/gumbel-softmax probability distributions. This is also what was used in the gumbel-softmax paper, see e.g. on page 6

We use a learned categorical prior rather than a Gumbel-Softmax prior in the training objective. Thus, the minimization objective during training is no longer a variational bound if the samples are not discrete. In practice, we find that optimizing this objective in combination with temperature annealing still minimizes actual variational bounds on validation and test sets.

So as the paper notes, it works in practice. However, as you say, it is more theoretically appealing to use the KL between two concrete distributions (as the objective is then an ELBO). I thought that this was especially important in the case of JointVAE since the discrete KL term carries even more weight than in a regular VAE model. I was worried that approximating the KL term with discrete instead of concrete distributions could hinder discrete disentanglement. So I experimented quite a lot with it.

The main problem is that there is no closed form expression for the KL between two concrete distributions. However you play around with it, you will end up with an expectation you cannot compute analytically, so you will have to resort to some form of approximation. I tried various MC estimates of this concrete KL term, but found that all of them had high variance and (experimentally) did not lead to good results. I also tried to directly estimate the gradient through various methods but with no success either. So I ended up using the discrete KL term which yields good results experimentally. However, it would be interesting to explore this further. It might be possible to come up with a low variance estimate of the concrete KL term and its gradients, in which case it would probably make sense to use that instead of the discrete KL term.

Hope this answers your question!

from joint-vae.

sootlasten commented on May 27, 2024

Very interesting, thank you for the thorough response. I'm experimenting with explicitly minimizing the total correlation between the latents using the density-ratio trick, and it seems to have a completely different behavior from beta-VAE when it comes to the categorical code, as it does not encode the digit information into the gumbel-softmax variable. I was thinking that it might be due to the specifics of the KL relaxation, I'll keep on investigating.

May I ask a related follow up question as well though, related to your code. In the encoder, you transform the logits for the Gumbel-Softmax through a softmax function, yielding the alphas to be normalized, and then you go on to use these normalized alphas to directly sample from the Gumbel-Softmax as per equation 10 in the Concrete paper. However, notice that your alphas were normalized to lie in the range (0,1), as opposed to the more relaxed range of (0, \infinity), the latter of which could be achieved by treating the linear output of the encoder as \log alpha directly. What was the reasoning behind this approach?

from joint-vae.

EmilienDupont commented on May 27, 2024

Your project seems cool, sounds like you are doing something along the lines of a FactorVAE with discrete codes? I initially bumped into the same problem you mention when I trained beta-VAE models with discrete and continuous codes. When training a regular beta-VAE with a joint gumbel-softmax and gaussian latent, the model tends to ignore the discrete dimensions. I believe this would also happen when training a beta-VAE model which maximizes total correlation. One of the keys to getting the model to use the discrete codes is to build a loss that forces the model to encode information in both the discrete and continuous dimensions. This is what eq. 7 ensures (in the paper this repo is based on) and is a crucial part of building a model that can encode both discrete and continuous information about the data. Using a similar approach may be useful in your case too. Either way, it sounds very cool, do keep me updated if you get good results!

To answer your question about the code, the only reason I did that was for convenience. When debugging it was often more intuitive to directly look at values in the (0, 1) range. However, there is no important reason for this and it should work the same without normalizing.

from joint-vae.

Use of an analytic discrete KL term about joint-vae HOT 3 CLOSED

Comments (3)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent