Comments (4)
Sorry for the dense calculation of the MLE loss...
I'll let you know when I clean up the clutter in the code.
Temporarily, I'll explain the loss one by one.
The original line I implemented was:
l_mle = 0.5 * math.log(2 * math.pi)
+ (torch.sum(y_logs) + 0.5 * torch.sum(torch.exp(-2 * y_logs) * (z - y_m)**2) - torch.sum(logdet))
/ (torch.sum(y_lengths // hps.model.n_sqz) * hps.model.n_sqz * hps.data.n_mel_channels)
It can be decomposed as
l_mle_normal = torch.sum(y_logs) + 0.5 * torch.sum(torch.exp(-2 * y_logs) * (z - y_m)**2)
l_mle_jacob = -torch.sum(logdet)
l_mle_sum = l_mle_normal + l_mle_jacob
denom = torch.sum(y_lengths // hps.model.n_sqz) * hps.model.n_sqz * hps.data.n_mel_channels
l_mle = 0.5 * math.log(2 * math.pi) + l_mle_sum / denom
l_mle_normal
is the negative log likelihood of normal distribution N(z| y_m, y_logs) (except the constant term: 0.5*log(2pi)), where y_m and y_logs are the mean and logarithm of standard deviation of the prior distribution. Please see Equation 2 in the paper.
l_mle_normal = torch.sum(y_logs) + 0.5 * torch.sum(torch.exp(-2 * y_logs) * (z - y_m)**2)
l_mle_jacob
denotes the negative log determinant of jacobian of flows. Please see Equation 1 in the paper.
l_mle_jacob = -torch.sum(logdet)
l_mle_sum
denotes the total negative log likelihood of the model, anddenom
is a denominator to average the total negative log likelihood across batch, time steps and mel channels (Our model force mel-spectrogram lengthsy_lengths
to be a multiple ofn_sqz
.).
l_mle_sum = l_mle_normal + l_mle_jacob
denom = torch.sum(y_lengths // hps.model.n_sqz) * hps.model.n_sqz * hps.data.n_mel_channels
- Add the the constant term, 0.5*log(2pi), excluded in step 1.
l_mle = 0.5 * math.log(2 * math.pi) + l_mle_sum / denom
from glow-tts.
Yes the constant term is ignored in backpropgation. I just left it for exact calculation of log likelihood. And I saw AlignTTS, which also proposes an alignment search algorithm similar to Glow-TTS. I think it is clever, thanks for the heads up! Btw, I hope you enjoy the interesting characteristics of our model such as manipulating the latent representation of speech :)
from glow-tts.
Thanks for your detailed explanation. I think you could ignore the constant term, it does not contribute to backpropagation. Btw, I found another paper that has the same idea of learning implicitly duration of each character but in a different approach AlignTTS.
from glow-tts.
Just wanted to say amazing work! Love the controllability of length and expressiveness. I wanted to try a few ideas of my own using your repository as a codebase by I've run into a strange phenomenon. It's related to the loss function so maybe you could help me understand what is the cause. The strange thing is that the value of l_mle
(g0
) loss depends on the value range of Mel spectrograms.
Orange
- LJSpeech wavs transformed into melspecs using default paramters. Melspec values range from 0.5
to -11.5
Pink
- My data transformed the same way as LJSpeech
Blue
- My data transformed to melspecs with different sfft parameters and then scaled to 0.5
to -11.5
range
Gray
- My data transformed to melspecs with different sfft parameters. Value range from 0.
to 0.76
(the same results if multiplied by -1)
From what I was able to check in the case of data in the range of 0 to 0.76 values differ in the following way
l_mle_jacob
- is bigger for Mel spectrograms with smaller absolute values. I think it makes sense because jacobian is calculated based on weights and they have to be bigger to result in the same values.
l_mle_normal
- about the same
denom
- obviously the same
l_mle
- with different proportion of l_mle_sum
and denom
l_mle
no longer normalizes to 1. I think it's a problem because the balance between g0 and g1 is disturbed and alignment gets worse
Also I find it quite strange that grad norm keeps increasing on both Blue
and Gray
curves. The only thing that they have in common is different than default melspec sfft parameters
from glow-tts.
Related Issues (20)
- Runtime Error: Multi speaker HOT 1
- GPU required or CPU-compatible? HOT 1
- Different Languages us different amount of GPU memory
- multi speaker
- Output compared to Fastspeech2
- Models for finetuning
- Could not create monotonic_align HOT 3
- Glowtts melspectrogram to fine tune hifigan HOT 2
- RuntimeError: CUDA error: invalid device function
- ImportError: /glow-tts/monotonic_align/monotonic_align/core.cpython-38-x86_64-linux-gnu.so: failed to map segment from shared object HOT 1
- Error using mel generated from glow-tts for hifi-gan training HOT 1
- Can I apply MAS method to other model ? HOT 1
- Query : How is the Model training different from the Model training of wave glow
- Multi speaker training error HOT 11
- With out Training DDI
- An explanation for the source code of finding the alignment path in GlowTTS? HOT 2
- DDI training compared to not DDI training HOT 1
- [Question] How many iterations for the available pretrained model?
- [Question] about `intersperse` function. HOT 1
- [CONTRIBUTION] Speech Dataset Generator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from glow-tts.