Comments (16)
@BridgetteSong You're right. the posterior is gaussian, and the prior is product of Gaussian and the jacobian determinant.
Let me explain the kl loss in detail. For brevity, and without loss of generality, I'll assume the channel dimension of latent variables is one.
The kl divergence is the mean of the difference of log probabilities as follows:
- mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)
As q(z/x) is gaussian, we can calculate the closed form mean of log(q(z/x)), which is the negative entropy of gaussian (see https://en.wikipedia.org/wiki/Normal_distribution):
- mean of log(q(z/x)) = negative entropy of q(z|x) = -logs_q - 0.5 - 0.5 * log(2*pi)
On the other hand, the mean of log(p(z/c)) has no closed-form solution. So we have to calculate log(p(z/c)) for each sampled z and then average them out:
- log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + logdet(df/dz), where f is a normalizing flow.
As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see
Line 449 in 2e561ba
Lines 179 to 209 in 2e561ba
- log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + 0 = -logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2
Then, kl = the average of ( negative entropy of q(z/x) - log(p(z/c))) is:
- (logs_p - logs_q - 0.5) + 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2, where f(z) is z_p in our code.
This is the explanation of the kl loss (
Lines 57 to 60 in 2e561ba
from vits.
Good point! In case the channel dimension of latent variables is one, the prior is Gaussian when the jacobian determinant is a constant.
However, when the channel dimension of latent variables exceeds one, it's not true.
For example, Let (x1, x2) ~ N( (0, 0), I), then transform it into (y1, y2) = (x1, cos(x1) + x2).
Because of the non-linear transformation, the joint distribution of (y1, y2) is not Gaussian.
However, the jacobian determinant is still one, as the first order derivatives are : dy1/dx1 = 1, dy1/dx2 = 0, dy2/dx1 = -sin(x1), dy2/dx2 = 1.
The normalizing flow of prior also provides non-linear transformation using neural-networks while maintaining a constant jacobian determinant, resulting in non-Gaussian prior distribution. If the normalizing flow of prior only allows linear transformation or the channel dimension of latent variables is one, you can use the KL-divergence btw two Gaussians. But, in general, you cannot use it.
from vits.
Hi @BridgetteSong. Yes, (closed-form) KL-divergence between two Gaussians is different from our KL loss. It's because we use KL-divergence between a Gaussian and a Normalizing flow rather than two Gaussians. Therefore, there is no closed-form KL like Gaussians. Equation 4 of our paper shows that the prior distribution is not Gaussian.
If you're not familiar with normalizing flows, and if you don't know how to calculate their log-likelihood (which is needed for calculating KL), it would be better to look these blog posts first: nf1 and nf2. These are great illustrative blog posts about normalizing flows containing model implementations.
from vits.
Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”,doesn't the kl divergence need integration? You directly kl=mean(log(q(z/x))) - mean(log(p(z/c))),is this an approximate formula?In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation, m_p and logs_p?
@yanggeng1995 I will give some additional supplements:
- KL_loss = ∫q(z/x) * (log(q(z/x)) - log(p(z/c)))dz = ∫q(z/x) * log(q(z/x))dz - ∫q(z/x) * log(p(z/c)))dz
- as q(z/x) is gaussian, so ∫q(z/x) * log(q(z/x))dz = -logs_q - 0.5 - 0.5 * log(2*pi).
- we can't directly compute ∫q(z/x) * log(p(z/c)))dz, so we only approximately compute it by sampling method. as usually, we sample some z values and average them. In the VAE code, usually sampling one z is enough, so ∫q(z/x) * log(p(z/c)))dz ≈ mean(log(p(z/c))) = log(p(z/c)).
from vits.
Thank you very much for your patience and detailed answer, I got it.
from vits.
Thank you very much again. I totally understand now. I learned much from your detailed answer.
from vits.
Hi @jaywalnut310, thanks for your detailed answer. Very helpful! I'd like to ask two more questions. I'd appreciate it if I can have your answers.
As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see
You mentioned you set up the normalizing flow to be volume-preserving. Did this way benefit the model? In my understanding, it can be replaced by a more complicated non-volume-preserving flow model.
The kl divergence is the mean of the difference of log probabilities as follows:
mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)
As far as I know, the KL divergence value lies in the range between [0,+inf]. But according to your formula, its value could be negative? (ref: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8)
from vits.
@BridgetteSong You're right. the posterior is gaussian, and the prior is product of Gaussian and the jacobian determinant.
Let me explain the kl loss in detail. For brevity, and without loss of generality, I'll assume the channel dimension of latent variables is one.
The kl divergence is the mean of the difference of log probabilities as follows:
- mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)
As q(z/x) is gaussian, we can calculate the closed form mean of log(q(z/x)), which is the negative entropy of gaussian (see https://en.wikipedia.org/wiki/Normal_distribution):
- mean of log(q(z/x)) = negative entropy of q(z|x) = -logs_q - 0.5 - 0.5 * log(2*pi)
On the other hand, the mean of log(p(z/c)) has no closed-form solution. So we have to calculate log(p(z/c)) for each sampled z and then average them out:
- log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + logdet(df/dz), where f is a normalizing flow.
As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see
Line 449 in 2e561ba
and
Lines 179 to 209 in 2e561ba
):
- log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + 0 = -logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2
Then, kl = the average of ( negative entropy of q(z/x) - log(p(z/c))) is:
- (logs_p - logs_q - 0.5) + 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2, where f(z) is z_p in our code.
This is the explanation of the kl loss (
Lines 57 to 60 in 2e561ba
).
Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”,doesn't the kl divergence need integration? You directly kl=mean(log(q(z/x))) - mean(log(p(z/c))),is this an approximate formula?In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation, m_p and logs_p?
@yanggeng1995 I will give some additional supplements:
- KL_loss = ∫q(z/x) * (log(q(z/x)) - log(p(z/c)))dx = ∫q(z/x) * log(q(z/x))dx - ∫q(z/x) * log(p(z/c)))dx
- as q(z/x) is gaussian, so ∫q(z/x) * log(q(z/x))dx = -logs_q - 0.5 - 0.5 * log(2*pi).
- we can't directly compute ∫q(z/x) * log(p(z/c)))dx, so we only approximately compute it by sampling method. as usually, we sample some z values and mean of them. In the VAE code, usually sampling one z is enough, so ∫q(z/x) * log(p(z/c)))dx ≈ mean(log(p(z/c))) = log(p(z/c)).
@BridgetteSong Thanks for your answer. There is another question, why not calculate the negative log-likelihood of Gaussian distribution based on z_p, m_p and logs_p, isn't it more convenient?
from vits.
@yanggeng1995 it is very convenient to compute ∫q(z/x) * log(q(z/x))dz as it is Gaussian. And as for ∫q(z/x) * log(p(z/c)))dz, it is also convenient to compute it if you understand approximate sampling: ∫q(z/x) * log(p(z/c)))dz ≈ log(p(z/c)).
p(z/c) is product of Gaussian and the jacobian determinant. To compute log(p(z/c)), we need first sample z ~ posterior(), and get z_p = NormalizedFlow(z), finally use z_p to compute log-likelihood of prior Gaussian: N(z_p, m_p, logs_p).
so log(p(z/c)) = logdet(df/dz) + log(N(z_p, m_p, logs_p)) = 0 - logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2.
I think it is also right if you directly use kl_loss ≈ log(q(z/x)) - log(p(z/c)), just log(q(z/x)) = - logs_q - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_q) * (z - m_q) ** 2 where z ~ posterior(m_q, logs_q), and log(p(z/c)) is the same. I think the author's method is more concise and more accurate.
from vits.
Thank you for your reply.
According to my understanding, posterior distribution is Gaussian, and prior distribution is product of prior Gaussian and absolute value of the determinant (Equation 4). So KL loss is following:
1. q(z/x) = torch.distributions.normal.Normal(m_q, exp(logs_q))
2. p(z/c) = torch.distributions.normal.Normal(m_p, exp(logs_p)) * torch.abs(jacobian determinant)
3. kl_loss = torch.distributions.kl.kl_divergence(q(z/x), p(z/c))
is my understanding right? or is this kl_loss equal to your kl loss?
I will appreciate if you can give a detailed explanation.
from vits.
BTW, as the prior is product of Gaussian and the jacobian determinant, and considering properties of Gaussian distribution(if X ~ N(u, σ**2), aX+b ~ N(au+b, (aσ)**2)), so the prior is always a Gaussian distribution when the jacobian determinant is a constant. So can we calculate KL-divergence using abovementioned two Gaussian KL-divergence or use torch API to get KL-divergence directly like this?
kl_loss = torch.distributions.kl.kl_divergence(q(z/x), p(z/c))
from vits.
How about taking absolute value to overcome kl loss is negative?
--- a/losses.py
+++ b/losses.py
@@ -54,7 +54,7 @@ def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
logs_p = logs_p.float()
z_mask = z_mask.float()
- kl = logs_p - logs_q - 0.5
+ kl = torch.abs(logs_p - logs_q - 0.5)
from vits.
@BridgetteSong You're right. the posterior is gaussian, and the prior is product of Gaussian and the jacobian determinant.
Let me explain the kl loss in detail. For brevity, and without loss of generality, I'll assume the channel dimension of latent variables is one.
The kl divergence is the mean of the difference of log probabilities as follows:
- mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)
As q(z/x) is gaussian, we can calculate the closed form mean of log(q(z/x)), which is the negative entropy of gaussian (see https://en.wikipedia.org/wiki/Normal_distribution):
- mean of log(q(z/x)) = negative entropy of q(z|x) = -logs_q - 0.5 - 0.5 * log(2*pi)
On the other hand, the mean of log(p(z/c)) has no closed-form solution. So we have to calculate log(p(z/c)) for each sampled z and then average them out:
- log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + logdet(df/dz), where f is a normalizing flow.
As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see
Line 449 in 2e561ba
and
Lines 179 to 209 in 2e561ba
):
- log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + 0 = -logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2
Then, kl = the average of ( negative entropy of q(z/x) - log(p(z/c))) is:
- (logs_p - logs_q - 0.5) + 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2, where f(z) is z_p in our code.
This is the explanation of the kl loss (
Lines 57 to 60 in 2e561ba
).
Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”,doesn't the kl divergence need integration? You directly kl=mean(log(q(z/x))) - mean(log(p(z/c))),is this an approximate formula?In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation, m_p and logs_p?
from vits.
How about taking absolute value to overcome kl loss is negative?
--- a/losses.py +++ b/losses.py @@ -54,7 +54,7 @@ def kl_loss(z_p, logs_q, m_p, logs_p, z_mask): logs_p = logs_p.float() z_mask = z_mask.float() - kl = logs_p - logs_q - 0.5 + kl = torch.abs(logs_p - logs_q - 0.5)
Hi, Is it work?
from vits.
@980202006 It will not work. As usually, KL_loss will not be negative if your inputs and network are right. When kl_loss < 0, it means your prior distribution is almost same as posterior distribution, so posterior distribution fails to learn as a complicated distribution.
When kl_loss < 0, The first thing you should to do is checking your inputs and network. If you must add some constraints in the loss formula, you should add abs function to all items not first item like this:
kl = logs_p - logs_q - 0.5
kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
kl = torch.clamp(kl, min=0.0)
But in usually, you need not add this constraint, as when your KL_Loss < 0, it means your network is trained unsuccessfully, although you add this constraint, you can't get right results.
from vits.
Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”,doesn't the kl divergence need integration? You directly kl=mean(log(q(z/x))) - mean(log(p(z/c))),is this an approximate formula?In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation, m_p and logs_p?
@yanggeng1995 I will give some additional supplements:
- KL_loss = ∫q(z/x) * (log(q(z/x)) - log(p(z/c)))dz = ∫q(z/x) * log(q(z/x))dz - ∫q(z/x) * log(p(z/c)))dz
- as q(z/x) is gaussian, so ∫q(z/x) * log(q(z/x))dz = -logs_q - 0.5 - 0.5 * log(2*pi).
- we can't directly compute ∫q(z/x) * log(p(z/c)))dz, so we only approximately compute it by sampling method. as usually, we sample some z values and average them. In the VAE code, usually sampling one z is enough, so **_∫q(z/x) * log(p(z/c)))dz ≈ mean(log(p(z/c))) = log(p(z/c)).
@BridgetteSong hi,i want to know why mean(log(p(z/c))) = log(p(z/c)).sampling one z is enought,why?
from vits.
Related Issues (20)
- ModuleNotFoundError: No module named 'unidecode' HOT 3
- Duration Issue with the generated audio HOT 2
- Issue with training at 8000Hz
- Iserting new symbols in a pre-trained model
- Is there some way to force a break?
- Can I also train an Italian model?
- running vits on Ventura GPU HOT 1
- segmentation fault after train a few steps
- RuntimeError: CUDA error: unknown error and torch.multiprocessing.spawn.ProcessRaisedException HOT 1
- 大佬们救救我 AttributeError: 'DistributedDataParallel' object has no attribute 'infer' 报错 HOT 1
- Training time too long
- Peak Performance of Single vs. Multi-Speaker TTS Models: Seeking Insights and References
- progress keeps resetting, checkpoint fails to load properly
- VITS codes failed to run for Python 3.10.12 HOT 1
- How to continue training from the previous stopped epoch? HOT 1
- Some of the losses increasing during training?
- Is there a way to do batch inference? HOT 3
- Transfer learning and fine-tuning tts HOT 3
- How to locate the spreak time of each word
- How to use trained model in inference? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vits.