nvidia / waveglow Goto Github PK

A Flow-based Generative Network for Speech Synthesis

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

waveglow's Introduction

WaveGlow: a Flow-based Generative Network for Speech Synthesis

Ryan Prenger, Rafael Valle, and Bryan Catanzaro

In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

Our PyTorch implementation produces audio samples at a rate of 1200 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.

Visit our website for audio samples.

Setup

Clone our repo and initialize submodule

git clone https://github.com/NVIDIA/waveglow.git
cd waveglow
git submodule init
git submodule update

Install requirements pip3 install -r requirements.txt
Install Apex

Generate audio with our pre-existing model

Download our published model
Download mel-spectrograms
Generate audio python3 inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_256channels.pt -o . --is_fp16 -s 0.6

N.b. use convert_model.py to convert your older models to the current model with fused residual and skip connections.

Train your own model

Download LJ Speech Data. In this example it's in data/

Make a list of the file names to use for training/testing

ls data/*.wav | tail -n+10 > train_files.txt
ls data/*.wav | head -n10 > test_files.txt

Train your WaveGlow networks
```
mkdir checkpoints
python train.py -c config.json
```
For multi-GPU training replace train.py with distributed.py. Only tested with single node and NCCL.

For mixed precision training set "fp16_run": true on config.json.
Make test set mel-spectrograms

python mel2samp.py -f test_files.txt -o . -c config.json

Do inference with your network

ls *.pt > mel_files.txt
python3 inference.py -f mel_files.txt -w checkpoints/waveglow_10000 -o . --is_fp16 -s 0.6

waveglow's People

Contributors

Stargazers

Watchers

Forkers

jn58 tensorfl0w kastnerkyle g-wang rintukutum xinkez auzxb pan-yangxu azraelkuan linhanxiao ruimina chapter544 siddu1998 dengliqun mu94w qoboty likeucode chaiyujin tspannhw wezteoh peter05010402 alexanderkozhevin mannykayy geneing sandeepk17 zhang-jian linecode wendonggan toannhu gauss256 taddy569 jackylee1 arunkumarramanan fireae thecooltechguy will-rice zhanglipku switchzts haifengzeng shownor wgwangang zhuyiche beyondboy raymondwmng kustomzone lallouslab transcendtron sycns yxma2015 qq1074123922 binyi10 galaxy-war yeongtae ml-lab ranne nmstoker eos21 belevtsoff devhttps zhenwangdl meelement yzy081 2208abhinav blisc nvchai yhgon lturing wuzl quocvuong82 rogerfitz b-xiang seaniezhao shafiahmed mgsong dbaruajuniv intuitionmachine entn-at huguanglong crazysnowboy o7s8r6 beckgom liusongxiang sebastiani yoyololicon predatorhjl ishandutta2007 ctailabs onlyliucat keith1477 yzliu90 williamfalcon fakufaku onisimchukv maihau nikolayvoronchikhin gridl sarthakm1011 kjeanclaude saitamandd redfiue

waveglow's Issues

option for other sampling rate

Plz sugguest recommend n_flow/residual option and Config.json for 48/44/16khz

nv-wavenet 16khz

Training fails with torch.jit.script

I pulled the latest code and now training fails with error:

raise NotSupportedError(base.range(), "slicing multiple dimensions at the same time isn't supported yet")
torch.jit.frontend.NotSupportedError: slicing multiple dimensions at the same time isn't supported yet
@torch.jit.script
def fused_add_tanh_sigmoid_multiply(input_a, input_b,n_channels):
n_channels_int = n_channels[0]
in_act = input_a+input_b
t_act = torch.nn.functional.tanh(in_act[:, :n_channels_int, :])
~~~~~~ <--- HERE
s_act = torch.nn.functional.sigmoid(in_act[:,n_channels_int:, :])
acts = t_act * s_act
return acts

how to use mutil machines to train?

pretrained model which can resume training

@rafaelvalle Thanks for your sharing! It helps a lot.
I find it trains very slow (about 1 epoch/day,batch size=1) when I trained model using my data(about 12h). Could you offer a model which can resume training from it ?

Thanks a lot !!

division by zero

hello,everyone.when i ran train.py i met this erro,
WARNING: Multiple GPUs detected but no distributed group set
Only running 1 GPU. Use distributed.py for multiple GPUs
output directory checkpoints
Traceback (most recent call last):
File "train.py", line 171, in
train(num_gpus, args.rank, args.group_name, **train_config)
File "train.py", line 106, in train
epoch_offset = max(0, int(iteration / len(train_loader)))
ZeroDivisionError: division by zero
how can i solve this problem?i have changed the size of the batchsize.is there something wrong with the NCCL?thank you

During training, the loss value goes up and down and cannot converge, is that normal? Besides, what should the final loss value looks like?

I implemented a waveglow model in my own project. The codes are almost the same as this repo with some modifications:

Upsample the mel-spectrogram to number of groups, so n_mel_channels in WN reduce to 80.
Change logdet() in invertable1x1 to det().abs().log() as #35 did because the first few runs I did the loss became nan after thousands of steps.

The n_channels is 256 so the model size is 4~5 times smaller than original. I run model on 2 1080ti using nn.DataParallel with batch size of 8. After about 5k steps the loss is around -6 ~ -7 and I can hear some speech-like sentences from the model outputs. Then the loss value started to go up and down, even over zero, and cannot go any further. I add 24 flows in the model, doesn't help; 32 batch size, the problem still exist. Maybe more steps it will become better, but after 70k steps still cannot see any improvement. Did anyone have similar problems?

I also want to ask about the final loss as reference. In my case, -11 is the smallest value I can get, then the aforementioned problem happened. In #5 @azraelkuan can get a -18 loss value at 56k, is it the normal loss value?

@torch.jit.script is not working properly

Torch version is: 0.4.1(the lastest version shown in official website)
CUDA: 8.0
CUDNN: 7.0

I have no idea about the error and my colleague came with the same problem.

Here is the log:
raise NotSupportedError(base.range(), "slicing multiple dimensions at the same time isn't supported yet")
torch.jit.frontend.NotSupportedError: slicing multiple dimensions at the same time isn't supported yet
@torch.jit.script
def fused_add_tanh_sigmoid_multiply(input_a, input_b,n_channels):
n_channels_int = n_channels[0]
in_act = input_a+input_b
t_act = torch.nn.functional.tanh(in_act[:, :n_channels_int, :])
<--- HERE
s_act = torch.nn.functional.sigmoid(in_act[:,n_channels_int:, :])
acts = t_act * s_act
return acts

Does anyone test the inference process using trained model?

I have train a model about 60k,
when i test the inference.py using the checkpoint waveglow_0, there will be all noise in the wav.
but when i use the trained model(60k), the generated wav is almost 0, nothing in the wav.
Does anyone have this problem?

about n_group

has anyone change the n_group to 4 instead of 8?
I just want to remove the line noise. if n_group=4, for 24k audio, the line noise only exist in 6k and 12k. but I can not change it, it will give error:

File "/running_package/waveglow/glow.py", line 231, in forward
output = self.WN[k]((audio_0, spect))
File "/tmp/venv3/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/running_package/waveglow/glow.py", line 145, in forward
audio = self.start(audio)
File "/tmp/venv3/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/tmp/venv3/lib64/python3.6/site-packages/torch/nn/modules/conv.py", line 176, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [512, 2, 1], expected input[3, 1, 1500] to have 2 channels, but got 1 channels instead

this is the config:

"waveglow_config": {
    "n_mel_channels": 80,
    "n_flows": 6
    "n_group": 4,
    "n_early_every": 2,
    "n_early_size": 1,
    "WN_config": {
        "n_layers": 8,
        "n_channels": 512,
        "kernel_size": 3

negative loss and failed to load model for retraining

I encountered 2 issues when following the work.
The first one is when I trained the model from the scratch. I got some negative loss. I read your paper and think this case is normal, right?
However, after 7 epochs about 22000 iterations, I synthesized the audio, got some white noise. Are there any mistakes? How many epochs and iterations do you train?
The second one is I retrained the model from your "waveglow_old.pt ", when loading the model but lost "iteration", "optimizer" and "learning_rate". When training, I got "tensor(nan., device='cuda:0')" and loss is Nan.

Can you help me or give some advices?

Coupling layer loss computation

It looks like the coupling layer loss is first summed over all batches, time steps, and channels (https://github.com/NVIDIA/waveglow/blob/master/glow.py#L55). Note that the number of channels for all coupling layers is not fixed due to early outputs reducing the channel count. But then the loss is averaged over batches, time steps, and n_group (https://github.com/NVIDIA/waveglow/blob/master/glow.py#L59). This means that regardless of the number of channels actually present in a given coupling layer, we're averaging over n_group (8).

Is this intentional? It's unclear to me why the coupling layer loss is computed this way.

why remove weight normalization when inference?

Will that hurt quality of the generated sample?
Or why it could still work if weight normalization is removed?

Generated results sounds speaking with echo.

Train model with batch size 4 for 80k steps. The generated samples sounds like this. It sounds like echo.

I want to know if I train the model further, can I get better results?

Segment size, batch size and VRAM; fp16

Can somebody please share train config and other parameters for training on LJ data (22kHz) on 1080 (8Gb mem) card?
I am trying to use default config as in current master and getting out-of-video-memory errors.

I have already lowered batch_size to 1 and tried to reduce segment_size, but that did not help.

Also, I do not see any train options to switch to fp16; although they are in inference script.

About fp16 error

File "train.py", line 144, in train
loss.backward()
File "/tmp/venv3/lib64/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/tmp/venv3/lib64/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File "/running_package/waveglow-fp16/distributed.py", line 125, in allreduce_params
coalesced = _flatten_dense_tensors(grads)
File "/running_package/waveglow-fp16/distributed.py", line 68, in _flatten_dense_tensors
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
RuntimeError: Expected a Tensor of type torch.cuda.HalfTensor but found a type torch.cuda.FloatTensor for sequence element 926 in sequence argument at position #1 'tensors'

about "segment_length"

i think it does not mean: segment_length long better. in my experiment, i just use 6k.

Set WN channel nums smaller

Hi,the original model is too big for most people too train on single GPU,but if we set WN channel num smaller such like 256,we can train it for larger batch size on single GPU,is the last speech quality the same?

120k, the generate wav is not good. I have check someone else, but It think their result is not good. Does this model just so so?

how many steps does yours that you can get as good as NVIDIA's claim?

loss

i read the code. and the loss function is -logp. why training loss is negative?

CPU inference?

How can I run inference in CPU-only device? Removing cuda() in inference.py and adding map to cpu on model load does not help - it still asks for GPU:

Traceback (most recent call last):
  File "inference.py", line 73, in <module>
    args.output_dir, args.sampling_rate, args.is_fp16)
  File "inference.py", line 46, in main
    mel = torch.autograd.Variable(mel.cuda())
  File "/home/ttsynth/tacotron2-nvidia-new/tacotron2/p3.6_venv/lib/python3.6/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
    _check_driver()
  File "/home/ttsynth/tacotron2-nvidia-new/tacotron2/p3.6_venv/lib/python3.6/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

How much hours of dataset required to produce wavenet quality of voice ?

2 hours of dataset is enough for wavenet to generate crystal clear quality of audio, so is there any minimum hours of dataset required to produce wavenet quality of voice from waveglow.

Mu_law quantization

Plz add mu_law quantization option same as nv-wavenet.
Nv_Wavenet
Nv-wavenet util

Sigma during training is different between the paper and this implementation

In the parper 2.1, you say that using sigma=sqrt(0.5). But in this implementation, sigma=1.

waveglow/config.json

Line 6 in 6d26007

"sigma": 1.0,

Which one is used to train pretrained model?

RuntimeError: DataLoader worker is killed by signal: Illegal instruction.

I am using PyTorch 1.0.0a0+1e05f4b

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

When I run train.py, I get RuntimeError: DataLoader worker is killed by signal: Illegal instruction.

I tried increasing shared memory following this link. But didn't help.

Here's the full stack trace.

Traceback (most recent call last):
File "train.py", line 171, in
train(num_gpus, args.rank, args.group_name, **train_config)
File "train.py", line 110, in train
for i, batch in enumerate(train_loader):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 632, in next
idx, batch = self._get_batch()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 611, in _get_batch
return self.data_queue.get()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/queues.py", line 94, in get
res = self._recv_bytes()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 274, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 13993) is killed by signal: Illegal instruction.

what brings the improvement of inference speed from 1200khz to 2750khz?

The recent commit f70678e claims a big improvement of inference speed.
But there was no code change, so i am wondering what brings the improvement.

i have 4 1080ti, how much time does the training take?

i have 4 1080ti, how much time does the training have to take, also the batch_size should be set to what, i ran the model for 15 hours and got 47000 iteration is that right?

Can single GPU get good result?

Does anyone train this model with single GPU(1080ti) and get good result? In this situation i can only run the model with the batch size 1. Cause I don't have enough GPU...

FileNotFoundError: [Errno 2] No such file or directory: '/dev/fd/63'

When running inference.py with the provided pre-trained model (sudo python3 inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_old.pt -o . --is_fp16 -s 0.6). Error message prints: FileNotFoundError: [Errno 2] No such file or directory: '/dev/fd/63'. Is this an issue with the spectogram?

How can I trained model in fp16 with your code?

I think your inferneced code surrport fp16, but your train code does not support it.

If audio value range -2 to 2,then it will not convergence?

Hi,i have done some experiments,if the audio value do not normalization to [-1,1] then it seems not convergence,so what is the reason?

Confused concept about bijective function in WaveGlow

Be glad to see such great work and paper. However after reading the paper, I find that raw wave (which is x ) and the latent space (which is z) is not bijective because they are under condition of the mel-spectrum. So can I imagine this model like VAE that the z sampled from normal distribution is a kind of continuous space which leads to the pronunciation diversity of the result ?

ask for help

ImportError: No module named 'tacotron2.layers'
hi,friend,i met the erro ,when i run python3.5 train.py -c config.json,how can i solve that?thanks

Best pretrainded tacotron for pretrained waveglow?

The description says about the Tacontron2 from Nvidia, but there is no pretrained model.
Maybe someone knows the pretrained tacotron to make a simple example and look how it works?

get out of memory error

Hi, I have changed batch_size to 1, when I run "python train.py -c config.json", I still got error below.

GPU:

Can you help me or give some advices?

problem about mel2samp.py

It's a very nice work! I'm study it! But when I'm training the model, I find a problem. May, there is some mistake in mel2samp.py. It can't get a right mel.
When I use the "mel_spectrograms" that the NVIDIA gives, the results that "waveglow_old.pt" infers are nice. But ，when I use "python mel2samp.py -f test_files.txt -o . -c config.json" to get mel, the results that "waveglow_old.pt" infers are bad. The voice is from man, not the original woman.

the result of "mel_spectrograms" that the NVIDIA gives:

the result of mel2samp.py

This is my audio.
my_results_wav.zip

Inference time 3 times slower than real-time on single GTX 1080ti

I have tested NVIDIA/tacotron2+waveglow by using inference.ipynb with pretrained models on single GTX 1080ti. Inference time was 3 times slower than real-time. It spent 24s for generate 7s voice.

Is it possible to generate faster than real-time on single GTX 1080ti.
Thank you.

Here is screenshot while generating audio.

jupyter notebook in Google Colab

I made jupyter notebook for WaveGlow Model in Google COLAB. Within 10 minutes including time to get weight file(2GB) , you could synthesize voice.
https://github.com/yhgon/waveglow/blob/master/inference_COLAB.ipynb

got training loss (per epoch) -4.47 at step 40k, and loss is decreasing very slow now, is this normal? or should I decrease my learning rate, which is 5e-5 with batch size of 4

Can you and everyone else share your loss curve for reference?

My trained model is 4GB, how that?

About upsample, the kernel_size is 1024?

self.upsample = torch.nn.ConvTranspose1d(n_mel_channels,
n_mel_channels,
1024, stride=256)

I think the kernel_size is too large

the line noise, anyone can remove it?

this problem has disturbed me long time. any people can remove it.

AttributeError: 'WN' object has no attribute 'res_skip_layers' When i tried to generate audio with pre-trained model

pip install requirements fail = pytorch not yet available

For all I know, as of today 1.0 isn't yet out in pip.

$ pip install -r requirements.txt
Collecting torch==1.0.0a0 (from -r requirements.txt (line 1))
Could not find a version that satisfies the requirement torch==1.0.0a0 (from -r requirements.txt (line 1)) (from versions: 0.1.2, 0.1.2.post1, 0.3.1, 0.4.0, 0.4.1)
No matching distribution found for torch==1.0.0a0 (from -r requirements.txt (line 1))

loss will increase much large, only in some step

step_78540: loss_-4.065555573
step_78560: loss_-4.490265846
step_78580: loss_-4.454271317
step_78600: loss_-4.322495461
step_78620: loss_4191.683593750
step_78640: loss_-2.408614397
step_78660: loss_-2.908127546

120k, the generate wav is not good. I have check someone else, but It think their result is not good. Does this model just so so?

how many steps do you got clear wav without noise?

need to update mel_files excluding MACOSX realted files

I found error to load mel file and found the root cause of it the zip file includes mel_spectrograms/.DS_Store so filelist load got errors in https://github.com/NVIDIA/waveglow/blob/master/mel2samp.py#L47

The solution would be

!rm -rf content/mel_spectrogram/.DS_Store

but plz republish the mel file.

Archive:  mel_spectrograms.zip
  inflating: mel_spectrograms/LJ001-0153.wav.pt  
  inflating: mel_spectrograms/LJ001-0096.wav.pt  
  inflating: mel_spectrograms/LJ001-0094.wav.pt  
  inflating: mel_spectrograms/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/mel_spectrograms/
  inflating: __MACOSX/mel_spectrograms/._.DS_Store  
  inflating: mel_spectrograms/LJ001-0079.wav.pt  
  inflating: mel_spectrograms/LJ001-0051.wav.pt  
  inflating: mel_spectrograms/LJ001-0063.wav.pt  
  inflating: mel_spectrograms/LJ001-0173.wav.pt  
  inflating: mel_spectrograms/LJ001-0102.wav.pt  
  inflating: mel_spectrograms/LJ001-0015.wav.pt  
  inflating: mel_spectrograms/LJ001-0072.wav.pt

python inference.py  -f <(ls /content/mel_spectrograms/*.pt)  -w /content/waveglow_old.pt -o . -s 0.6

./LJ001-0015.wav_synthesis.wav
./LJ001-0051.wav_synthesis.wav
./LJ001-0063.wav_synthesis.wav
./LJ001-0072.wav_synthesis.wav
./LJ001-0079.wav_synthesis.wav
./LJ001-0094.wav_synthesis.wav
./LJ001-0096.wav_synthesis.wav
./LJ001-0102.wav_synthesis.wav
./LJ001-0153.wav_synthesis.wav
./LJ001-0173.wav_synthesis.wav

It is not clear how to generate wav files from text directly.

In the readme file, for generating wav files, mel-spectrograms seems to be necessary. Why so? How can I generate wav outputs directly from arbitrary text data?

Is the 500khz sampling speed achieved using the fp16 acceleration of GV100?

I'm wondering how much slower is waveglow than parallel wavenet, as waveglow has 108 layer, while parallel wavenet only has 60, and parallel wavenet has only 64 hidden units in conv layers, while waveglow has 512 hidden units.

why loss suddenly become the initial value

123873: -5.480178833
123874: -5.352289200
123875: -5.298225403
123876: -5.460500240
123877: -5.547176361
123878: -5.597653866
123879: -5.451864243
123880: -5.445433617
123881: -5.494832516
123882: -5.272130013
123883: -5.312953949
123884: -5.550226212
123885: -5.573468208
123886: -5.405485153
123887: -4.718036175
123888: -5.176541805
123889: -4.795463085
123890: -4.772330284
123891: -4.845523357
123892: 40490.179687500
123893: -3.482588530
123894: -2.216503620
123895: -1.672664404
123896: -1.454238176
123897: -1.419931889
123898: -1.281533122
123899: -1.110476017
123900: -1.149755478
123901: -1.464488387
123902: -1.522264838
123903: -1.716608882
123904: -1.713886499

when inference, how to set sigma value?

when sigma=0.6

you can see in the high frenquency, has a line noise. how to remove it?