jaywalnut310 / glow-tts Goto Github PK

View Code? Open in Web Editor NEW

658.0 19.0 150.0 2.21 MB

A Generative Flow for Text-to-Speech via Monotonic Alignment Search

License: MIT License

Python 90.98% Jupyter Notebook 8.91% Shell 0.11%

pytorch speech-synthesis tts deep-learning text-to-speech

glow-tts's Introduction

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon

In our recent paper, we propose Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search.

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality. We further show that our model can be easily extended to a multi-speaker setting.

Visit our demo for audio samples.

We also provide the pretrained model.

Glow-TTS at training	Glow-TTS at inference

Update Notes*

This result was not included in the paper. Lately, we found that two modifications help to improve the synthesis quality of Glow-TTS.; 1) moving to a vocoder, HiFi-GAN to reduce noise, 2) putting a blank token between any two input tokens to improve pronunciation. Specifically, we used a fine-tuned vocoder with Tacotron 2 which is provided as a pretrained model in the HiFi-GAN repo. If you're interested, please listen to the samples in our demo.

For adding a blank token, we provide a config file and a pretrained model. We also provide an inference example inference_hifigan.ipynb. You may need to initialize HiFi-GAN submodule: git submodule init; git submodule update

1. Environments we use

Python3.6.9
pytorch1.2.0
cython0.29.12
librosa0.7.1
numpy1.16.4
scipy1.3.0

For Mixed-precision training, we use apex; commit: 37cdaf4

2. Pre-requisites

a) Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY

b) Initialize WaveGlow submodule: git submodule init; git submodule update

Don't forget to download pretrained WaveGlow model and place it into the waveglow folder.

c) Build Monotonic Alignment Search Code (Cython): cd monotonic_align; python setup.py build_ext --inplace

3. Training Example

sh train_ddi.sh configs/base.json base

4. Inference Example

See inference.ipynb

Acknowledgements

Our implementation is hugely influenced by the following repos:

glow-tts's People

Contributors

Stargazers

Watchers

Forkers

ggsonic entn-at gheyret joovvhan amirstudy huguensjean wgwangang chochobo hiyoung-asr trendingtechnology rubenmccarty sungroh ml-applications imiyou jfsantos christophyoon charlottecuc sam1373 synthaether shangqwe123 ll2pakll ajinkyakulkarni14 idcore iftaken rixot-chang cookieppp marlon-br wangfn gudwns1215 lampv33 ryanmcgary blandon-lc haifengzeng 41whiteelephants thananchaiktw seantempesta nianzu-ethan-zheng tilde-nlp wizardk taotaofu elbum 121898 ntzzc zeta1999 abidlabs patdflynn liusongxiang athon-millane daxiafresh epochsimate holttechnologycorporation aliang-cn xuexidi smksyj saber5433 intel-pa stonermax dongsig shahuzi jayathungek harshit1011 sx-tts debasish-mihup anonymous1086 anchit1999 nickatomlin ishine reppy4620 mithunkumarsr idgmatrix sasukepn1999 andrey999333 lkurlandski ak391 rogervaas ensky0 wuyx517 oniondai sciai-ai welkinyang will-rice aparcho sadam1195 c0debrain syang1993 techthiyanes ppanja rielkim poveteen npujcong manhph2211 andrehe02 xinshengwang talentsprint2011 silyfox hynkang sakura2233565548 eirene-aisa wayne-wonderai moonintheriver

glow-tts's Issues

text embedding dimension

Hi. Since I tried to train Glow-TTS using Mandarin datasets, there are about 300 symbols in symbols.py. Therefore, it seems that I need to increase the text embedding depth. I notice that in your paper, you mentioned that:

Does the Embedding Dimension here stands for "text embedding dimension"?
If it is, which parameter here should I modify, hidden_channels , or hidden_channels_enc?

Thank you very much!

Sharing my results. Glow-tts is incredibly impressive!

Thank you so much for developing such a high-quality, sparse, and performant network, @jaywalnut310. I thought I'd share the results I have obtained so that others can see how promising your network is and can make an easier decision to adopt glow-tts for their use cases.

My website Vocodes chiefly employs glow-tts as a core component of speech synthesis: https://vo.codes

All of the voices now use glow-tts for mel inference and melgan for mel inversion. I briefly tried building multi-speaker embedding models, but the speakers never gained clarity or achieved natural prosody. I only conducted a limited number of experiments, but it was enough to consider my own investigation in the area to be unfruitful.

I haven't attributed an MOS to the speakers on Vocodes, but intuitively several of them seem quite good. The training data for each speaker varies between 45 minutes to four hours. One improvement I made was to remove the dual phoneme/grapheme embeddings and force ARPABET-only phoneme training.

Another series of tweaks had to me made to adapt your network to running on TorchScript JIT (the backend is in Rust), but this was relatively straightforward.

There's more work to be done here to achieve even more natural fit, but I wanted to share my results and congratulate you on your incredible work.

Feature Request: Create a new repo or branch for multilingual + multispeaker setup.

멀티-스피커 버전 버그?

안녕하세요. 좋은 synthesizer을 올려주셔서 감사합니다.
LJ-speech데이터로 실험결과 기존 synthesizer에 비해 속도측면에서 압도적인 것을 확인하였습니다.
그리고 주신 조언에 따라 multi-speaker버전으로 변경하여 구동하였는데, 저의 경우 버그가 발생하였습니다.
LJ+개인데이터(1hour)를 이용하여 학습한결과 발화는 하나 가끔 문장에서 목소리의 피치가 매우 높게 올라가는 문제가 발생하였습니다.
데이터가 적어서 생기는 문제인지 아니면 해결중인 문제이신지 궁금해서 남겨봅니다.
multi-speaker로 변경하기 위해 TextMelSpeakerLoader와 TextMelSpeakerCollater n_speakers, gin_channels 를 수정하였고 speaker_id도 데이터셋에 포함시켰습니다.
맨 처음 스텝으로 LJ데이터만 이용, 위의 모듈들을 변경 후 구동시킨 결과 성공적으로 목소리가 합성되었으며, 그 후 개인데이터 셋을 이용하여 구동시켜 개인데이터의 목소리가 짧은 시간인데도 합성되는 놀라운 결과를 얻었습니다. 하지만 위에 서술한 목소리의 피치가 매우 높게 올라가는 문제 및 LJ의 목소리가 약간 섞인듯한 문제가 발생하여 골머리를 앓고 있습니다.
혹시 결레가 아니라면 multi-speaker버전을 올리지 못하는 이유가 이 버그 때문인지 아니면 어떤 이슈가 있는지 알려주실 수 있나요?

Why putting a blank token between any two input tokens can improve pronunciation?

Hi, glow-tts is really a wonderful work! I noticed that your updating in README.md

moving to a vocoder, HiFi-GAN to reduce noise, 2) putting a blank token between any two input tokens to improve pronunciation

I experimented in Chinese, and the results suggested that add a blank token do improve pronunciation in Chinese. So, my question is why this trick can improve the pronunciation? Is there any theoretical basis？

My personal intuition is that this trick can extend the pronunciation time of each phoneme, thereby improving the pronunciation.

Thanks! @jaywalnut310

griffin-lim gives strange output

hi, I tried the code with Chinese corpus, with config:
"sampling_rate": 16000,
"filter_length": 1024,
"hop_length": 200,
"win_length": 800,
"n_mel_channels": 80,
"mel_fmin": 96.0,
"mel_fmax": 7600.0,

The corpus is about 20hours and I picked up the 160th epoch to generate my mel spec.
I tried with griffin-lim by modifiy inference.ipynb:
(y_gen_tst, *r), attn_gen, *_ = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale)
mel_np = y_gen_tst.cpu().squeeze(0).numpy()
res = librosa.feature.inverse.mel_to_audio(mel_np, sr=16000, n_fft=1024, hop_length=200, win_length=800)
and finally:
librosa.output.write_wav('sample_output.wav', res, 16000)
And it outputs a long silence like:

The question is should I wait for more epochs? Or maybe I used griffin-lim the wrong way?

BTW, the mel generated is like: [-10.xxxx, -11.xxxx, ...]

inference

model_dir = "./logs/base/"

Size of glowtts models

Hi all,
I am wondering as to the sizes of other peoples saved models and of ways to reduce them.
My saved models are .pth and are around 328MB.
I have looked at the models by MozillaTTS (based on this repo as I understand) which are .pth.tar files and are 288MB.
@echelon has also shown that he saves his models as .torchjit and they are 110MB.

I was wondering if I was doing something wrong that leads me to getting such large model sizes or if it is normal. Ultimately I would like to make my model sizes smaller and wanted to see if anyone had any ideas.

I'm looking into pytorch Dynamic Quantization (https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) but I assume there is a reason why I didn't see anyone here mention it and that it isn't used by @jaywalnut310 in the first place.

Thanks for your time and help

How to train a new model with dataset of diffirent language?

I would like to know how to train a glow-tts 2 model for another language, using another dataset which have the same structure as LJ Speech dataset?
Could you give some hints about how to train it or do a transfer learning from your pretrained models?

I have succesfully trained nvidia's tacotron2 with polish dataset as mentioned here: NVIDIA/tacotron2#321, do the step similar to those in tacotron2(add my language's symbols, use smaller learning rate for transfer learning) ?

已放弃 (吐核)python train.py -c $config -m $modeldir

hello, my friends.
when I clone this code and plan to run it, I meet this bad issures. Do you have some advices or suggestion to help me sovle this problems? Thanks in advance.

7efddd583000-7efddd585000 r--p 00000000 08:05 24539630 /data/tts/.conda/envs/glow_tts/lib/python3.6/lib-dynload/_blake2.cpython-36m-x86_64-linux-gnu.sotrain_ddi.sh: 行 5: 148887 已放弃 (吐核)python train.py -c $config -m $modeldir

Sharing Korean Glow-TTS Samples

Dear contributors,

Thank you for sharing your great works.

I have successfully reproduced your result with the LJSpeech Dataset.

In addition, I have trained your model with Korean Single Speaker Speech Dataset and G2PK grapheme-to-phoneme converting module as a Korean cleaner.

This is the link to the demo page.

I would be glad if you introduce my demo page in your README.

Thanks again for your great code.

Any method could make the result more nature?

As my experiment , the result of glow-tts sounds more like robot than a real person, do you have any method could make it more nature, like the result or autoregressive model, like tacotron eg..thanks.

Idea on modeling different features of speech

The results of sampling with higher std that the one used during training seem to be a very interesting idea to produce the speech with high expressiveness but the impact on the quality is quite significant. From what I understood the core reason is a low density of the samples in the regions far from the center of the distribution. I wanted to do some experiments with Gaussian Mixture Models as a base distribution. For example, in emotions, each peak would represent a mean of speaking manner of a certain emotion. In this setup, loss function would be calculated based on the probability that the model assigns samples of given emotion to the correct peak. This would allow for sampling emotional speech with high quality and possibly even regions between them which would be equivalent to controlling the strength of the emotion in a sample. What is your opinion on that?

There's no requirements file

Hey,

It's hard to install all required packages. Could you add requirements.txt file to repo?
It would be super usefull as we don't know all the packages versions that you use.

One question about the decoder compared with FastSpeech and Tacotron.

It's really amazing that Glow-TTS does such a good job. I have some confusion about the decoder framework:
There is no post-net at the end of the decoder. I understand that the invertible flows require that there cannot be a post net, but how can Glow-TTS get such good results without it (while Tacotron and Fastspeech both have the post-net)?

Improving prosody? Specifically the length of pauses between words?

First off, this project is amazing! I'm getting great results compared to Tacotron2 with much shorter training times and it's unbelievably stable even for long sentences. Congratulations. :)

The only thing I've found that Tacotron2 did better was capturing the manner that people speak in. Specifically the speed words are spoken and how long they tend to pause between words. Is this something that can be adjusted in the loss function to fine tune the model to pay more attention to these aspects?

Log det Jacobian is wrong in Inv1x1Conv

So, this log det jacobian is torch.slogdet(self.weight)

glow-tts/modules.py

Line 228 in 00a482d

logdet = torch.logdet(self.weight) * (c / self.n_split) * x_len # [b]

ZeroDivisionError: float division by zero when training the model

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.691694759794e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1729236899484e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.43230922487e-312
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.35807730622e-312
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.39519326554e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.487983164e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.121995791e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.304989477e-315
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.32624737e-315
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3156184e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.289046e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0722615e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.180654e-318
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.295163e-318
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2379e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.095e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0237e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.06e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.265e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.16e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Traceback (most recent call last):
File "train.py", line 189, in
main()
File "train.py", line 34, in main
mp.spawn(train_and_eval, nprocs=n_gpus, args=(n_gpus, hps,))
File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/nur-179/.temp/glow-tts/train.py", line 91, in train_and_eval
train(rank, epoch, hps, generator, optimizer_g, train_loader, None, None)
File "/home/nur-179/.temp/glow-tts/train.py", line 115, in train
scaled_loss.backward()
File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters
scale_override=(grads_have_scale, stashed_have_scale, out_scale))
File "/home/nur-179/anaconda3/envs/gtts/lib/python3.6/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed
out_scale/grads_have_scale, # 1./scale,
ZeroDivisionError: float division by zero

my base.json file is as follows:
{
"train": {
"use_cuda": true,
"log_interval": 20,
"seed": 1234,
"epochs": 10000,
"learning_rate": 1e0,
"betas": [0.9, 0.98],
"eps": 1e-9,
"warmup_steps": 4000,
"scheduler": "noam",
"batch_size": 4,
"ddi": true,
"fp16_run": true
},
"data": {
"load_mel_from_disk": false,
"training_files":"filelists/ljs_audio_text_train_filelist.txt",
"validation_files":"filelists/ljs_audio_text_val_filelist.txt",
"text_cleaners":["transliteration_cleaners"],
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": 8000.0,
"add_noise": true,
"add_space": false,
"cmudict_path": "data/dict"
},
"model": {
"hidden_channels": 192,
"filter_channels": 768,
"filter_channels_dp": 256,
"kernel_size": 3,
"p_dropout": 0.1,
"n_blocks_dec": 12,
"n_layers_enc": 6,
"n_heads": 2,
"p_dropout_dec": 0.05,
"dilation_rate": 1,
"kernel_size_dec": 5,
"n_block_layers": 4,
"n_sqz": 2,
"prenet": true,
"mean_only": true,
"hidden_channels_enc": 192,
"hidden_channels_dec": 192,
"window_size": 4
}
}

Whats values of final losses?

[train.py] epoch당 진행률 계산

빠른 TTS를 위한 좋은 논문을 내주셔서 감사합니다.

제목에 해당하는 부분을 먼저 말씀드리자면,
(944525a commit) train.py의 127번째 line 에서
logger.info에서 진행도를 계산하는 부분에서 gpu 개수가 고려가 되어 있지 않습니다.

해당 부분:
logger.info('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(x), len(train_loader.dataset)

그 외에 화자 7명에 대해 training을 진행하기 위해,
n_speakers와 gin_channels를 조절해서 input을 집어넣어서 training을 진행하지만,
training에 어려움이 있습니다.
(지속된 gradient overflow로 인해 training이 되지 않습니다.)
혹시 training과 관련해서 조언을 주실 수 있으시면 감사하겠습니다.

감사합니다.

A reshape problem in InvConvNear

Hi, @jaywalnut310 。I'm trying to understand the glow-tts by reading the code. And I am a little bit confused about this piece of code in InvConvNear。

glow-tts/modules.py

Lines 214 to 215 in 13e9976

 x = x.view(b, 2, c // self.n_split, self.n_split // 2, t) 

 x = x.permute(0, 1, 3, 2, 4).contiguous().view(b, self.n_split, c // self.n_split, t)

So if the purpose is reshape the input x from [b,c,t] to [b, self.n_split, c // self.n_split, t]，what's the purpose of the L214？

Working 'version' of WaveGlow

Hi everyone! I'm getting acquainted with this repo, particularly with trying out inference.ipynb. I have a question: I followed the instructions and ran "git submodule init; git submodule update" but it seems it didn't behave as expected (no waveglow folder created).

So I manually cloned from the github of WaveGlow (cloning, initializing and updating) and tried running inference.ipynb, but then ran into lots of errors that I think are related to version mismatches of waveglow and torch, etc. e.g. 'Conv1d' object has no attribute '_ non_persistent buffers set'

So may I ask if anyone has a copy/folder of waveglow that currently works this repo that you can share?

Thanks a lot and please let me know if there's anything I can provide to make my question clearer!

GPU not being utilized?

Hi all,
This might just be me misunderstanding things but I wanted to ask about the GPU utilization.
What I use
I am using Ubuntu 18.04 and nvidia-smi to monitor my GPU
I have an NVIDIA RTX 2080Ti on my machine

What I noticed
Volatile GPU-Util seems to be on 0% most of the time, occasionally jumping to 50-80% for a second and then going back down to 0%. This corresponds to the power usage and temperature spiking for these seconds but for the most part staying low.
This being said the power usage when training is around 60W as opposed to 10W when idle and the temperature is around 47C as opposed to 32C when idle, so the GPU is definitely being used.
Also when I run the training I see the GPU memory being used as I would expect, and I can see the python process running on it.

What I expect
When I ran other models like Tacotron in the past I would have my Volatile GPU-Util at 70-98% use all the time. This is what I thought should happen here as well

The Question
Is there a reason my Volatile GPU-Util isn't being used? Is there something I am doing wrong? Does this mean I am leaving performance laying around that I can leverage somehow? If so, how?
Is this happening to anyone else?

Please let me know if you need any more information from me.
Thank you in advance for your time and any assistance available!

run issue

ModuleNotFoundError: No module named 'monotonic_align.monotonic_align'
my command is: cd glow-tts && python init.py -c c_path -m m_path
I have run 'cd monotonic_align; python setup.py build_ext --inplace'

any idea why?

Number of training steps

I have been trying to replicate the results from the paper, but I'm confused about the number of training steps. The paper mentions 240k steps, but when running this code on 8 V100 gpus 240k steps takes a lot longer that the 3 days from the paper. The base config here specifies 10000 epochs, which also doesn't seem like the correct amount. Could you clarify the correct amount of training epochs/steps?

Gradient overflow and negative loss

Hi. I tried to train the model by using a 24-hour mandarin dataset and encountered the following gradient overflow and negative loss problem.

I only changed the "data" part of the config file and modified the "text" folder (cmudict.py, & symbols.py by adding some mandarn phonemes

):

Could you give me any suggestion? Thank you!

License?

Very interesting paper!

But I don't see any licensing information in this project, which restricts any/all potential applications.
Could you please add a license to the project?

Transfer to tensorRT and using triton inference

Does anyone have experience of converting the models into tensorRT and/or using nvidia triton inference container to serve glow tts models?

any shared win prebuilt monotonic_align?

got this error:
(Hifigan384test) D:\Coding\PYFastCache\PYVenv\Hifigan384test\glow-tts-master\monotonic_align>python setup.py build_ext --inplace
running build_ext
building 'monotonic_align.core' extension
error: Unable to find vcvarsall.bat

Yes,i know it is probably because cpython or msvc.
and Yes,I had installed cpython with pip but It uses prebuilt binary.

So,it can be deduced that it is MSVC setups problem.and yes,I have that file but adding it to pth does not seems to work.

(Hifigan384test) D:\Coding\PYFastCache\PYVenv\Hifigan384test\glow-tts-master\monotonic_align>set PATH=%PATH%;F:\Win10\vsresource\win10maylatestsdksuggestbyvsessence\VC\Auxiliary\Build

(Hifigan384test) D:\Coding\PYFastCache\PYVenv\Hifigan384test\glow-tts-master\monotonic_align>python setup.py build_ext --inplace
running build_ext
building 'monotonic_align.core' extension
error: Unable to find vcvarsall.bat

So,anyone have those prebuilt binaries,Win10,64bit here.Thank you very much.

Ideal size of gin_channels for multiple speaker embeddings?

Hi Jaehyeon, I modified your code to train multiple speakers and it seems to be training and inferring pretty well. Thanks for leaving the code in a state that makes this relatively easy!

Here are my hparams:

    "n_speakers": 10, 
    "gin_channels": 16

I have nine speakers, but mistakenly didn't zero index them.

Is gin_channels too small? Should this be appreciably larger to capture the voice characteristics? 32? 64? ...?

Two of the speakers have four hours of data. Other speakers have far less. Oddly, the speaker with the smallest amount of data seems to have one of the clearest voices. Other speakers don't sound like their source at all.

I'm only epoch 1400 in so far and I had to train from zero, so this has got a long way to go. Should I abandon this and increase gin_channels, or does it seem fair to proceed?

Parallelization and GPU requirements

I see GTTS was developed to be parallelizable. Does training work on let's say several 2080 ti or 3080 rtx?

GlowTTS with MultiBand Melgan

I am trying to get GlowTTS working with Multiband Melgan but I am running into many issues with the different MB Melgan models I am trying.
I managed to get this working with normal Melgan from https://github.com/seungwonpark/melgan , but Multiband Melgan seems to be expecting different input or some normalization I can't figure out.

What I Tried

Using Mozilla-TTS Multiband Melgan and taking most of my implementation from https://colab.research.google.com/drive/1u_16ZzHjKYFn1HNVuA4Qf_i2MMFB9olY?usp=sharing#scrollTo=x8IDS6fO8uW2

Initially I got a lot of mechanical noise and nothing else.
I tried copying the normalization that is done during the melgan synthesis and that made it so that I could here and understand the words in the synthesis but with a lot of background noise
I then came across kan-bayashi/ParallelWaveGAN#169 where it was mentioned that the normalization includes decompression and logs. I used @seantempesta code with some differences (the stats file provided by MozillaTTS gives standard deviation and not variance so I skipped var and imported sigma directly). This made it so there was no background noise and it sounded like someone talking but all the words were garbled up.

I tried using https://github.com/kan-bayashi/ParallelWaveGAN Multiband Melgan but I kept running into tensor size issues during the inference and I couldn't figure out why because the tensor size is the same as what I sent to MozillaTTS as well as to the normal Melgan.

I also tried the Multiband Melgan model from https://github.com/TensorSpeech/TensorflowTTS but I ran into similar tensor size issues.

Question

Has anyone managed to get any model of Multiband Melgan working with GlowTTS? Is there a specific repository that is better to use?
Is this really up to differences in normalization prior to sending the mel spectrogram to the Multiband Melgan? What is the normalization that needs to be done to the mel spectrograms that come out of GlowTTS in order from them to work with Multiband Melgan?

Please let me know if more information is needed from me (i didn't want to elaborate on every specific error I got as to not make this post go into too many directions at once).

Thanks in advance for your time and any help you can provide!

Add new speaker voice

Hi Jaehyeon,

Could you please provide instructions how to use pretrained model and add new speaker voice?

I have created google colab file basing on your work: https://github.com/marlon-br/glow-tts-colab
Now I want to add a possibility to have more speaker voices.

Problem with running on cuda 11.2

Hi,
thanks for this great repo!

I am trying to run this repo with nvidia rtx 3090 and cuda 11.2. I have this error whole day. I was trying different versions of pytorch and no use.

I am getting this error:

Traceback (most recent call last):
  File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/user/tts/glow-tts/train.py", line 92, in train_and_eval
    train(rank, epoch, hps, generator, optimizer_g, train_loader, None, None)
  File "/home/user/tts/glow-tts/train.py", line 119, in train
    loss_g.backward()
  File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/user/anaconda3/envs/glowtts/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f09d7de3840> returned NULL without setting an error

I have Ubuntu 20.04, Driver Version: 460.32.03 CUDA Version: 11.2.

I was trying torch 1.7.1, 1.2.0, 1.3.0. I was even compiling torch from source ('1.9.0a0+ee04cd9').

I was removing cuda, installing it from scratch and no luck.

Do you have any idea what is causing this problem?

Thanks!

Loss value

I am wondering how loss value looks like. Could you give some pictures of the loss during training?

[Question] Where in the code can we control temperature?

I can see that we can control epsilon (noise_scale) and duration (length_scale), but I can't find in the code where we can control the temperature.

[ERROR] monotonic_align.core

ImportError: No module named 'monotonic_align.core'
해당 에러로 인해서 train_ddi.sh가 실행이 되질않습니다.

제가 무엇을 잘못한 건지 질문드립니다. core.pyx는 core.c로 빌드가 된 상태입니다.

Getting assertion error for while training

def mel_spectrogram(self, y):
assert(torch.min(y.data) >= -1) # Getting value of tensor(-34834.9805)
assert(torch.max(y.data) <= 1)

I am getting assertion error in the above function in commons.py during training. Is this a issue due to incorrect training data and how to handle the issue? Any tips?

q

CUDA error. What version of CUDA are you running?

I installed all of the requirements and apex at the provided SHA. When I attempt to train, it crashes with the following error:

CUDA runtime error: an illegal instruction was encountered (73) in magmablas_strsm at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/magmablas/strsm.cu:484
CUDA runtime error: an illegal instruction was encountered (73) in magmablas_strsm at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/magmablas/strsm.cu:485
CUDA runtime error: an illegal instruction was encountered (73) in magma_sgetri_gpu at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/src/sgetri_gpu.cpp:164
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:946
CUDA runtime error: an illegal instruction was encountered (73) in magmablas_strsm at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/magmablas/strsm.cu:484
CUDA runtime error: an illegal instruction was encountered (73) in magmablas_strsm at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/magmablas/strsm.cu:485
CUDA runtime error: an illegal instruction was encountered (73) in magma_sgetri_gpu at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/src/sgetri_gpu.cpp:164
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:946
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=296 error=73 : an illegal instruction was encountered
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error
Traceback (most recent call last):
  File "train.py", line 191, in <module>
    main()
  File "train.py", line 34, in main
    mp.spawn(train_and_eval, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/home/bt/dev/2nd/glow-tts/python/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/bt/dev/2nd/glow-tts/python/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/bt/dev/2nd/glow-tts/python/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/bt/dev/2nd/glow-tts/train.py", line 93, in train_and_eval
    train(rank, epoch, hps, generator, optimizer_g, train_loader, None, None)
  File "/home/bt/dev/2nd/glow-tts/train.py", line 117, in train
    scaled_loss.backward()
  File "/home/bt/dev/2nd/glow-tts/python/lib/python3.7/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/bt/dev/2nd/glow-tts/python/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: an illegal instruction was encountered

I'm thinking that I might be running an out of date CUDA, but I wanted to confirm this before upgrading.

What version are you all using?

Thanks!

How about the intelligibility and stability

I read your paper and found that your model have a significant improvement of inference speed, long sentence and character error rate, But I would like to know your experiment about other aspect of intelligibility and stability, how about the misalignment (even you won't use attention), punctuation misalignment, skip word, worse sound at the end of the long sentence, the stability of long paragraph (like difference voice between two or more sentence of paragraph), etc,... Thank for your hard work!

Sharing the results of the Korean model learned in my voice

I used Glow-TTS and Multi-band MelGAN to create Korean TTS using my voice as a dataset.
And the sample audio of the result can be found at the bottom of the Colab page.

Demo Colab

The dataset I used was only about 3 hours long, but I got really good results.
I'm not very familiar with machine learning, but the results I could make using your code were really impressive.
I was very pleased with the quality of the final result and the fast synthesis speed.
It is especially good because it can be used even in a CPU environment.

Thanks to all of you for sharing such a great work 😄

Why you do not use pytorch native layer norm?

About speaker embedding

Hi. Could you please give me some advice on adding speaker embedding (as you mentioned in the paper) to your code? Thanks!

Would you be willing to integrate this to Mozilla TTS?

I was planning to integrate your great work to our repo. I think in terms of the license there is no problem but I just wanted to ask your permission before going further. We already have a very robust TTS performance with a highly modified Tacotron model but Glow-TTS is faster and will provide a better run-time especially for low-resource devices.

Thanks for this work again :)

https://github.com/mozilla/TTS

Autoregressive flow instead of WaveGlow?

Hi, thank you for this amazing idea. It is really nice :).
I was wondering if it would be possible to replace the WaveGlow model by some more expressive flow, for example by autoregressive WaveNetor WaveFlow? In WaveGlow, the audio/spectrograms are directly encoded into samples from some gaussian distributions and an external encoder can be used to evaluate likelihood of the sample. WaveNet instead does a scale + shift transformation of a standard normal distribution based on previous audio timesteps. The use of a model such as WaveFlow/WaveNet could boost expressivenes of the system and lower the number of necessary params, but I was not yet able to figure out how it could be integrated in your framework. Did you consider similar options when you wrote the paper?

Regarding adding a Colab Notebook

Hi @jaywalnut310.

I took your inference_hifigan.ipynb notebook and made it fully runnable inside Google Colab (here's my Colab Gist). I think it would make it easier for people to play around with the model.

If you want I can create a PR accordingly including this notebook.

Let me know.

         self.emb = nn.Embedding(n_vocab, hidden_channels, padding_idx=0)
         nn.init.normal_(self.emb.weight[1:], 0.0, hidden_channels**-0.5)
         self.stress_emb = nn.Embedding(3, hidden_channels, padding_idx=0)
         nn.init.normal_(self.stress_emb.weight[1:], 0.0, hidden_channels**-0.5)
         ...
         x = self.emb(x) + self.stress_emb(stress)
         x = x * math.sqrt(self.hidden_channels)  # [b, t, h]

Any advice?

	x = x.view(b, 2, c // self.n_split, self.n_split // 2, t)
	x = x.permute(0, 1, 3, 2, 4).contiguous().view(b, self.n_split, c // self.n_split, t)