When I searched KL-divergence between two Gaussians, I got this which is diffenrent fr

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hoverc

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

KL Loss is right?,about jaywalnut310/vits

Comments (16)

jaywalnut310 commented on May 20, 2024 46

@BridgetteSong You're right. the posterior is gaussian, and the prior is product of Gaussian and the jacobian determinant.

Let me explain the kl loss in detail. For brevity, and without loss of generality, I'll assume the channel dimension of latent variables is one.

The kl divergence is the mean of the difference of log probabilities as follows:

mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)

As q(z/x) is gaussian, we can calculate the closed form mean of log(q(z/x)), which is the negative entropy of gaussian (see https://en.wikipedia.org/wiki/Normal_distribution):

mean of log(q(z/x)) = negative entropy of q(z|x) = -logs_q - 0.5 - 0.5 * log(2*pi)

On the other hand, the mean of log(p(z/c)) has no closed-form solution. So we have to calculate log(p(z/c)) for each sampled z and then average them out:

log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + logdet(df/dz), where f is a normalizing flow.

As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see

vits/models.py

Line 449 in 2e561ba

 self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels) 

and

vits/models.py

Lines 179 to 209 in 2e561ba

 class ResidualCouplingBlock(nn.Module): 

 def __init__(self, 

 channels, 

 hidden_channels, 

 kernel_size, 

 dilation_rate, 

 n_layers, 

 n_flows=4, 

 gin_channels=0): 

 super().__init__() 

 self.channels = channels 

 self.hidden_channels = hidden_channels 

 self.kernel_size = kernel_size 

 self.dilation_rate = dilation_rate 

 self.n_layers = n_layers 

 self.n_flows = n_flows 

 self.gin_channels = gin_channels 

 self.flows = nn.ModuleList() 

 for i in range(n_flows): 

 self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True)) 

 self.flows.append(modules.Flip()) 

 def forward(self, x, x_mask, g=None, reverse=False): 

 if not reverse: 

 for flow in self.flows: 

 x, _ = flow(x, x_mask, g=g, reverse=reverse) 

 else: 

 for flow in reversed(self.flows): 

 x = flow(x, x_mask, g=g, reverse=reverse) 

 return x

log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + 0 = -logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2

Then, kl = the average of ( negative entropy of q(z/x) - log(p(z/c))) is:

(logs_p - logs_q - 0.5) + 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2, where f(z) is z_p in our code.

This is the explanation of the kl loss (

vits/losses.py

Lines 57 to 60 in 2e561ba

 kl = logs_p - logs_q - 0.5 

 kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p) 

 kl = torch.sum(kl * z_mask) 

 l = kl / torch.sum(z_mask)

from vits.

jaywalnut310 commented on May 20, 2024 16

Good point! In case the channel dimension of latent variables is one, the prior is Gaussian when the jacobian determinant is a constant.
However, when the channel dimension of latent variables exceeds one, it's not true.
For example, Let (x1, x2) ~ N( (0, 0), I), then transform it into (y1, y2) = (x1, cos(x1) + x2).
Because of the non-linear transformation, the joint distribution of (y1, y2) is not Gaussian.
However, the jacobian determinant is still one, as the first order derivatives are : dy1/dx1 = 1, dy1/dx2 = 0, dy2/dx1 = -sin(x1), dy2/dx2 = 1.

The normalizing flow of prior also provides non-linear transformation using neural-networks while maintaining a constant jacobian determinant, resulting in non-Gaussian prior distribution. If the normalizing flow of prior only allows linear transformation or the channel dimension of latent variables is one, you can use the KL-divergence btw two Gaussians. But, in general, you cannot use it.

from vits.

jaywalnut310 commented on May 20, 2024 2

Hi @BridgetteSong. Yes, (closed-form) KL-divergence between two Gaussians is different from our KL loss. It's because we use KL-divergence between a Gaussian and a Normalizing flow rather than two Gaussians. Therefore, there is no closed-form KL like Gaussians. Equation 4 of our paper shows that the prior distribution is not Gaussian.

If you're not familiar with normalizing flows, and if you don't know how to calculate their log-likelihood (which is needed for calculating KL), it would be better to look these blog posts first: nf1 and nf2. These are great illustrative blog posts about normalizing flows containing model implementations.

from vits.

BridgetteSong commented on May 20, 2024 2

Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”，doesn't the kl divergence need integration？ You directly kl=mean(log(q(z/x))) - mean(log(p(z/c)))，is this an approximate formula？In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation， m_p and logs_p?

@yanggeng1995 I will give some additional supplements:

KL_loss = ∫q(z/x) * (log(q(z/x)) - log(p(z/c)))dz = ∫q(z/x) * log(q(z/x))dz - ∫q(z/x) * log(p(z/c)))dz
as q(z/x) is gaussian, so ∫q(z/x) * log(q(z/x))dz = -logs_q - 0.5 - 0.5 * log(2*pi).
we can't directly compute ∫q(z/x) * log(p(z/c)))dz, so we only approximately compute it by sampling method. as usually, we sample some z values and average them. In the VAE code, usually sampling one z is enough, so ∫q(z/x) * log(p(z/c)))dz ≈ mean(log(p(z/c))) = log(p(z/c)).

from vits.

BridgetteSong commented on May 20, 2024 1

Thank you very much for your patience and detailed answer, I got it.

from vits.

BridgetteSong commented on May 20, 2024 1

Thank you very much again. I totally understand now. I learned much from your detailed answer.

from vits.

haoheliu commented on May 20, 2024 1

Hi @jaywalnut310, thanks for your detailed answer. Very helpful! I'd like to ask two more questions. I'd appreciate it if I can have your answers.

As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see

You mentioned you set up the normalizing flow to be volume-preserving. Did this way benefit the model? In my understanding, it can be replaced by a more complicated non-volume-preserving flow model.

The kl divergence is the mean of the difference of log probabilities as follows:
mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)

As far as I know, the KL divergence value lies in the range between [0,+inf]. But according to your formula, its value could be negative? (ref: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-understanding-kl-divergence-2b382ca2b2a8)

from vits.

yanggeng1995 commented on May 20, 2024 1

@BridgetteSong You're right. the posterior is gaussian, and the prior is product of Gaussian and the jacobian determinant.
Let me explain the kl loss in detail. For brevity, and without loss of generality, I'll assume the channel dimension of latent variables is one.
The kl divergence is the mean of the difference of log probabilities as follows:

mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)

As q(z/x) is gaussian, we can calculate the closed form mean of log(q(z/x)), which is the negative entropy of gaussian (see https://en.wikipedia.org/wiki/Normal_distribution):

mean of log(q(z/x)) = negative entropy of q(z|x) = -logs_q - 0.5 - 0.5 * log(2*pi)

On the other hand, the mean of log(p(z/c)) has no closed-form solution. So we have to calculate log(p(z/c)) for each sampled z and then average them out:

log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + logdet(df/dz), where f is a normalizing flow.

As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see

vits/models.py

Line 449 in 2e561ba

self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)

and

vits/models.py

Lines 179 to 209 in 2e561ba

class ResidualCouplingBlock(nn.Module):

def __init__(self,

channels,

hidden_channels,

kernel_size,

dilation_rate,

n_layers,

n_flows=4,

gin_channels=0):

super().__init__()

self.channels = channels

self.hidden_channels = hidden_channels

self.kernel_size = kernel_size

self.dilation_rate = dilation_rate

self.n_layers = n_layers

self.n_flows = n_flows

self.gin_channels = gin_channels

self.flows = nn.ModuleList()

for i in range(n_flows):

self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))

self.flows.append(modules.Flip())

def forward(self, x, x_mask, g=None, reverse=False):

if not reverse:

for flow in self.flows:

x, _ = flow(x, x_mask, g=g, reverse=reverse)

else:

for flow in reversed(self.flows):

x = flow(x, x_mask, g=g, reverse=reverse)

return x

):

log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + 0 = -logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2

Then, kl = the average of ( negative entropy of q(z/x) - log(p(z/c))) is:

(logs_p - logs_q - 0.5) + 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2, where f(z) is z_p in our code.

This is the explanation of the kl loss (

vits/losses.py

Lines 57 to 60 in 2e561ba

kl = logs_p - logs_q - 0.5

kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)

kl = torch.sum(kl * z_mask)

l = kl / torch.sum(z_mask)

).

Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”，doesn't the kl divergence need integration？ You directly kl=mean(log(q(z/x))) - mean(log(p(z/c)))，is this an approximate formula？In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation， m_p and logs_p?

@yanggeng1995 I will give some additional supplements:

KL_loss = ∫q(z/x) * (log(q(z/x)) - log(p(z/c)))dx = ∫q(z/x) * log(q(z/x))dx - ∫q(z/x) * log(p(z/c)))dx

as q(z/x) is gaussian, so ∫q(z/x) * log(q(z/x))dx = -logs_q - 0.5 - 0.5 * log(2*pi).

we can't directly compute ∫q(z/x) * log(p(z/c)))dx, so we only approximately compute it by sampling method. as usually, we sample some z values and mean of them. In the VAE code, usually sampling one z is enough, so ∫q(z/x) * log(p(z/c)))dx ≈ mean(log(p(z/c))) = log(p(z/c)).

@BridgetteSong Thanks for your answer. There is another question, why not calculate the negative log-likelihood of Gaussian distribution based on z_p, m_p and logs_p, isn't it more convenient?

from vits.

BridgetteSong commented on May 20, 2024 1

@yanggeng1995 it is very convenient to compute ∫q(z/x) * log(q(z/x))dz as it is Gaussian. And as for ∫q(z/x) * log(p(z/c)))dz, it is also convenient to compute it if you understand approximate sampling: ∫q(z/x) * log(p(z/c)))dz ≈ log(p(z/c)).

p(z/c) is product of Gaussian and the jacobian determinant. To compute log(p(z/c)), we need first sample z ~ posterior(), and get z_p = NormalizedFlow(z), finally use z_p to compute log-likelihood of prior Gaussian: N(z_p, m_p, logs_p).

so log(p(z/c)) = logdet(df/dz) + log(N(z_p, m_p, logs_p)) = 0 - logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2.

I think it is also right if you directly use kl_loss ≈ log(q(z/x)) - log(p(z/c)), just log(q(z/x)) = - logs_q - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_q) * (z - m_q) ** 2 where z ~ posterior(m_q, logs_q), and log(p(z/c)) is the same. I think the author's method is more concise and more accurate.

from vits.

BridgetteSong commented on May 20, 2024

Thank you for your reply.
According to my understanding, posterior distribution is Gaussian, and prior distribution is product of prior Gaussian and absolute value of the determinant (Equation 4). So KL loss is following:

1. q(z/x) = torch.distributions.normal.Normal(m_q, exp(logs_q))
2. p(z/c) = torch.distributions.normal.Normal(m_p, exp(logs_p)) * torch.abs(jacobian determinant)
3. kl_loss = torch.distributions.kl.kl_divergence(q(z/x), p(z/c))

is my understanding right? or is this kl_loss equal to your kl loss?
I will appreciate if you can give a detailed explanation.

from vits.

BridgetteSong commented on May 20, 2024

BTW, as the prior is product of Gaussian and the jacobian determinant, and considering properties of Gaussian distribution(if X ~ N(u, σ**2), aX+b ~ N(au+b, (aσ)**2)), so the prior is always a Gaussian distribution when the jacobian determinant is a constant. So can we calculate KL-divergence using abovementioned two Gaussian KL-divergence or use torch API to get KL-divergence directly like this?
kl_loss = torch.distributions.kl.kl_divergence(q(z/x), p(z/c))

from vits.

candlewill commented on May 20, 2024

How about taking absolute value to overcome kl loss is negative?

--- a/losses.py
+++ b/losses.py
@@ -54,7 +54,7 @@ def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
   logs_p = logs_p.float()
   z_mask = z_mask.float()
 
-  kl = logs_p - logs_q - 0.5
+  kl = torch.abs(logs_p - logs_q - 0.5)

from vits.

yanggeng1995 commented on May 20, 2024

@BridgetteSong You're right. the posterior is gaussian, and the prior is product of Gaussian and the jacobian determinant.

Let me explain the kl loss in detail. For brevity, and without loss of generality, I'll assume the channel dimension of latent variables is one.

The kl divergence is the mean of the difference of log probabilities as follows:

mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)

As q(z/x) is gaussian, we can calculate the closed form mean of log(q(z/x)), which is the negative entropy of gaussian (see https://en.wikipedia.org/wiki/Normal_distribution):

mean of log(q(z/x)) = negative entropy of q(z|x) = -logs_q - 0.5 - 0.5 * log(2*pi)

On the other hand, the mean of log(p(z/c)) has no closed-form solution. So we have to calculate log(p(z/c)) for each sampled z and then average them out:

log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + logdet(df/dz), where f is a normalizing flow.

As we constrain the normalizing flow of prior distribution is volume-preserving, which uses shift-only (=mean-only) operation in coupling layers, the jacobian determinant of prior is one (see

vits/models.py

Line 449 in 2e561ba

self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)

and

vits/models.py

Lines 179 to 209 in 2e561ba

class ResidualCouplingBlock(nn.Module):

def __init__(self,

channels,

hidden_channels,

kernel_size,

dilation_rate,

n_layers,

n_flows=4,

gin_channels=0):

super().__init__()

self.channels = channels

self.hidden_channels = hidden_channels

self.kernel_size = kernel_size

self.dilation_rate = dilation_rate

self.n_layers = n_layers

self.n_flows = n_flows

self.gin_channels = gin_channels

self.flows = nn.ModuleList()

for i in range(n_flows):

self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))

self.flows.append(modules.Flip())

def forward(self, x, x_mask, g=None, reverse=False):

if not reverse:

for flow in self.flows:

x, _ = flow(x, x_mask, g=g, reverse=reverse)

else:

for flow in reversed(self.flows):

x = flow(x, x_mask, g=g, reverse=reverse)

return x

):

log(p(z/c)) = log(N(f(z)|m_p, logs_p))) + 0 = -logs_p - 0.5 * log(2*pi) - 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2

Then, kl = the average of ( negative entropy of q(z/x) - log(p(z/c))) is:

(logs_p - logs_q - 0.5) + 0.5 * exp(-2 * logs_p) * (f(z) - m_p) ** 2, where f(z) is z_p in our code.

This is the explanation of the kl loss (

vits/losses.py

Lines 57 to 60 in 2e561ba

kl = logs_p - logs_q - 0.5

kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)

kl = torch.sum(kl * z_mask)

l = kl / torch.sum(z_mask)

).

Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”，doesn't the kl divergence need integration？ You directly kl=mean(log(q(z/x))) - mean(log(p(z/c)))，is this an approximate formula？In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation， m_p and logs_p?

from vits.

980202006 commented on May 20, 2024

How about taking absolute value to overcome kl loss is negative?

--- a/losses.py
+++ b/losses.py
@@ -54,7 +54,7 @@ def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
   logs_p = logs_p.float()
   z_mask = z_mask.float()
 
-  kl = logs_p - logs_q - 0.5
+  kl = torch.abs(logs_p - logs_q - 0.5)

Hi, Is it work?

from vits.

BridgetteSong commented on May 20, 2024

@980202006 It will not work. As usually, KL_loss will not be negative if your inputs and network are right. When kl_loss < 0, it means your prior distribution is almost same as posterior distribution, so posterior distribution fails to learn as a complicated distribution.
When kl_loss < 0, The first thing you should to do is checking your inputs and network. If you must add some constraints in the loss formula, you should add abs function to all items not first item like this:

kl = logs_p - logs_q - 0.5
kl += 0.5 * ((z_p - m_p)**2) * torch.exp(-2. * logs_p)
kl = torch.clamp(kl, min=0.0)

But in usually, you need not add this constraint, as when your KL_Loss < 0, it means your network is trained unsuccessfully, although you add this constraint, you can't get right results.

from vits.

fenling commented on May 20, 2024

Hi, I'm a bit confused “ The kl divergence is the mean of the difference of log probabilities as follows: mean(log(q(z/x))) - mean(log(p(z/c))), where z ~ q(z/x)”，doesn't the kl divergence need integration？ You directly kl=mean(log(q(z/x))) - mean(log(p(z/c)))，is this an approximate formula？In this case, isn't it more convenient to calculate the negative log likelihood based on z_p obtained according to flow transformation， m_p and logs_p?

@yanggeng1995 I will give some additional supplements:

KL_loss = ∫q(z/x) * (log(q(z/x)) - log(p(z/c)))dz = ∫q(z/x) * log(q(z/x))dz - ∫q(z/x) * log(p(z/c)))dz

as q(z/x) is gaussian, so ∫q(z/x) * log(q(z/x))dz = -logs_q - 0.5 - 0.5 * log(2*pi).

we can't directly compute ∫q(z/x) * log(p(z/c)))dz, so we only approximately compute it by sampling method. as usually, we sample some z values and average them. In the VAE code, usually sampling one z is enough, so **_∫q(z/x) * log(p(z/c)))dz ≈ mean(log(p(z/c))) = log(p(z/c)).
@BridgetteSong hi，i want to know why mean(log(p(z/c))) = log(p(z/c)).sampling one z is enought,why？

from vits.

KL Loss is right? about vits HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	class ResidualCouplingBlock(nn.Module):
	def __init__(self,
	channels,
	hidden_channels,
	kernel_size,
	dilation_rate,
	n_layers,
	n_flows=4,
	gin_channels=0):
	super().__init__()
	self.channels = channels
	self.hidden_channels = hidden_channels
	self.kernel_size = kernel_size
	self.dilation_rate = dilation_rate
	self.n_layers = n_layers
	self.n_flows = n_flows
	self.gin_channels = gin_channels

	self.flows = nn.ModuleList()
	for i in range(n_flows):
	self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
	self.flows.append(modules.Flip())

	def forward(self, x, x_mask, g=None, reverse=False):
	if not reverse:
	for flow in self.flows:
	x, _ = flow(x, x_mask, g=g, reverse=reverse)
	else:
	for flow in reversed(self.flows):
	x = flow(x, x_mask, g=g, reverse=reverse)
	return x

	kl = logs_p - logs_q - 0.5
	kl += 0.5 * ((z_p - m_p)*2) torch.exp(-2. * logs_p)
	kl = torch.sum(kl * z_mask)
	l = kl / torch.sum(z_mask)