Giter VIP home page Giter VIP logo

waveglow's Introduction

WaveGlow

WaveGlow: a Flow-based Generative Network for Speech Synthesis

Ryan Prenger, Rafael Valle, and Bryan Catanzaro

In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

Our PyTorch implementation produces audio samples at a rate of 1200 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.

Visit our website for audio samples.

Setup

  1. Clone our repo and initialize submodule

    git clone https://github.com/NVIDIA/waveglow.git
    cd waveglow
    git submodule init
    git submodule update
  2. Install requirements pip3 install -r requirements.txt

  3. Install Apex

Generate audio with our pre-existing model

  1. Download our published model
  2. Download mel-spectrograms
  3. Generate audio python3 inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_256channels.pt -o . --is_fp16 -s 0.6

N.b. use convert_model.py to convert your older models to the current model with fused residual and skip connections.

Train your own model

  1. Download LJ Speech Data. In this example it's in data/

  2. Make a list of the file names to use for training/testing

    ls data/*.wav | tail -n+10 > train_files.txt
    ls data/*.wav | head -n10 > test_files.txt
  3. Train your WaveGlow networks

    mkdir checkpoints
    python train.py -c config.json

    For multi-GPU training replace train.py with distributed.py. Only tested with single node and NCCL.

    For mixed precision training set "fp16_run": true on config.json.

  4. Make test set mel-spectrograms

    python mel2samp.py -f test_files.txt -o . -c config.json

  5. Do inference with your network

    ls *.pt > mel_files.txt
    python3 inference.py -f mel_files.txt -w checkpoints/waveglow_10000 -o . --is_fp16 -s 0.6

waveglow's People

Contributors

azraelkuan avatar bigbigcity1986 avatar bryancatanzaro avatar nmstoker avatar nvchai avatar rafaelvalle avatar srstevenson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

waveglow's Issues

Training fails with torch.jit.script

I pulled the latest code and now training fails with error:

raise NotSupportedError(base.range(), "slicing multiple dimensions at the same time isn't supported yet")
torch.jit.frontend.NotSupportedError: slicing multiple dimensions at the same time isn't supported yet
@torch.jit.script
def fused_add_tanh_sigmoid_multiply(input_a, input_b,n_channels):
n_channels_int = n_channels[0]
in_act = input_a+input_b
t_act = torch.nn.functional.tanh(in_act[:, :n_channels_int, :])
~~~~~~ <--- HERE
s_act = torch.nn.functional.sigmoid(in_act[:,n_channels_int:, :])
acts = t_act * s_act
return acts

pretrained model which can resume training

@rafaelvalle Thanks for your sharing! It helps a lot.
I find it trains very slow (about 1 epoch/day,batch size=1) when I trained model using my data(about 12h). Could you offer a model which can resume training from it ?

Thanks a lot !!

division by zero

hello,everyone.when i ran train.py i met this erro,
WARNING: Multiple GPUs detected but no distributed group set
Only running 1 GPU. Use distributed.py for multiple GPUs
output directory checkpoints
Traceback (most recent call last):
File "train.py", line 171, in
train(num_gpus, args.rank, args.group_name, **train_config)
File "train.py", line 106, in train
epoch_offset = max(0, int(iteration / len(train_loader)))
ZeroDivisionError: division by zero
how can i solve this problem?i have changed the size of the batchsize.is there something wrong with the NCCL?thank you

During training, the loss value goes up and down and cannot converge, is that normal? Besides, what should the final loss value looks like?

I implemented a waveglow model in my own project. The codes are almost the same as this repo with some modifications:

  1. Upsample the mel-spectrogram to number of groups, so n_mel_channels in WN reduce to 80.
  2. Change logdet() in invertable1x1 to det().abs().log() as #35 did because the first few runs I did the loss became nan after thousands of steps.

The n_channels is 256 so the model size is 4~5 times smaller than original. I run model on 2 1080ti using nn.DataParallel with batch size of 8. After about 5k steps the loss is around -6 ~ -7 and I can hear some speech-like sentences from the model outputs. Then the loss value started to go up and down, even over zero, and cannot go any further. I add 24 flows in the model, doesn't help; 32 batch size, the problem still exist. Maybe more steps it will become better, but after 70k steps still cannot see any improvement. Did anyone have similar problems?

I also want to ask about the final loss as reference. In my case, -11 is the smallest value I can get, then the aforementioned problem happened. In #5 @azraelkuan can get a -18 loss value at 56k, is it the normal loss value?

@torch.jit.script is not working properly

Torch version is: 0.4.1(the lastest version shown in official website)
CUDA: 8.0
CUDNN: 7.0

I have no idea about the error and my colleague came with the same problem.

Here is the log:
raise NotSupportedError(base.range(), "slicing multiple dimensions at the same time isn't supported yet")
torch.jit.frontend.NotSupportedError: slicing multiple dimensions at the same time isn't supported yet
@torch.jit.script
def fused_add_tanh_sigmoid_multiply(input_a, input_b,n_channels):
n_channels_int = n_channels[0]
in_act = input_a+input_b
t_act = torch.nn.functional.tanh(in_act[:, :n_channels_int, :])
<--- HERE
s_act = torch.nn.functional.sigmoid(in_act[:,n_channels_int:, :])
acts = t_act * s_act
return acts

Does anyone test the inference process using trained model?

I have train a model about 60k,
when i test the inference.py using the checkpoint waveglow_0, there will be all noise in the wav.
but when i use the trained model(60k), the generated wav is almost 0, nothing in the wav.
Does anyone have this problem?

about n_group

has anyone change the n_group to 4 instead of 8?
I just want to remove the line noise. if n_group=4, for 24k audio, the line noise only exist in 6k and 12k. but I can not change it, it will give error:

File "/running_package/waveglow/glow.py", line 231, in forward
output = self.WN[k]((audio_0, spect))
File "/tmp/venv3/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/running_package/waveglow/glow.py", line 145, in forward
audio = self.start(audio)
File "/tmp/venv3/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/tmp/venv3/lib64/python3.6/site-packages/torch/nn/modules/conv.py", line 176, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [512, 2, 1], expected input[3, 1, 1500] to have 2 channels, but got 1 channels instead

this is the config:

"waveglow_config": {
    "n_mel_channels": 80,
    "n_flows": 6
    "n_group": 4,
    "n_early_every": 2,
    "n_early_size": 1,
    "WN_config": {
        "n_layers": 8,
        "n_channels": 512,
        "kernel_size": 3

negative loss and failed to load model for retraining

I encountered 2 issues when following the work.
The first one is when I trained the model from the scratch. I got some negative loss. I read your paper and think this case is normal, right?
However, after 7 epochs about 22000 iterations, I synthesized the audio, got some white noise. Are there any mistakes? How many epochs and iterations do you train?
The second one is I retrained the model from your "waveglow_old.pt ", when loading the model but lost "iteration", "optimizer" and "learning_rate". When training, I got "tensor(nan., device='cuda:0')" and loss is Nan.

Can you help me or give some advices?

Coupling layer loss computation

It looks like the coupling layer loss is first summed over all batches, time steps, and channels (https://github.com/NVIDIA/waveglow/blob/master/glow.py#L55). Note that the number of channels for all coupling layers is not fixed due to early outputs reducing the channel count. But then the loss is averaged over batches, time steps, and n_group (https://github.com/NVIDIA/waveglow/blob/master/glow.py#L59). This means that regardless of the number of channels actually present in a given coupling layer, we're averaging over n_group (8).

Is this intentional? It's unclear to me why the coupling layer loss is computed this way.

Segment size, batch size and VRAM; fp16

Can somebody please share train config and other parameters for training on LJ data (22kHz) on 1080 (8Gb mem) card?
I am trying to use default config as in current master and getting out-of-video-memory errors.

I have already lowered batch_size to 1 and tried to reduce segment_size, but that did not help.

Also, I do not see any train options to switch to fp16; although they are in inference script.

About fp16 error

File "train.py", line 144, in train
loss.backward()
File "/tmp/venv3/lib64/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/tmp/venv3/lib64/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File "/running_package/waveglow-fp16/distributed.py", line 125, in allreduce_params
coalesced = _flatten_dense_tensors(grads)
File "/running_package/waveglow-fp16/distributed.py", line 68, in _flatten_dense_tensors
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
RuntimeError: Expected a Tensor of type torch.cuda.HalfTensor but found a type torch.cuda.FloatTensor for sequence element 926 in sequence argument at position #1 'tensors'

about "segment_length"

i think it does not mean: segment_length long better. in my experiment, i just use 6k.

Set WN channel nums smaller

Hi,the original model is too big for most people too train on single GPU,but if we set WN channel num smaller such like 256,we can train it for larger batch size on single GPU,is the last speech quality the same?

loss

i read the code. and the loss function is -logp. why training loss is negative?

CPU inference?

How can I run inference in CPU-only device? Removing cuda() in inference.py and adding map to cpu on model load does not help - it still asks for GPU:

Traceback (most recent call last):
  File "inference.py", line 73, in <module>
    args.output_dir, args.sampling_rate, args.is_fp16)
  File "inference.py", line 46, in main
    mel = torch.autograd.Variable(mel.cuda())
  File "/home/ttsynth/tacotron2-nvidia-new/tacotron2/p3.6_venv/lib/python3.6/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
    _check_driver()
  File "/home/ttsynth/tacotron2-nvidia-new/tacotron2/p3.6_venv/lib/python3.6/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

RuntimeError: DataLoader worker is killed by signal: Illegal instruction.

I am using PyTorch 1.0.0a0+1e05f4b

Here's the output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 39C P0 27W / 300W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

When I run train.py, I get RuntimeError: DataLoader worker is killed by signal: Illegal instruction.

I tried increasing shared memory following this link. But didn't help.

Here's the full stack trace.

Traceback (most recent call last):
File "train.py", line 171, in
train(num_gpus, args.rank, args.group_name, **train_config)
File "train.py", line 110, in train
for i, batch in enumerate(train_loader):
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 632, in next
idx, batch = self._get_batch()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 611, in _get_batch
return self.data_queue.get()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/queues.py", line 94, in get
res = self._recv_bytes()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 274, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 13993) is killed by signal: Illegal instruction.

Can single GPU get good result?

Does anyone train this model with single GPU(1080ti) and get good result? In this situation i can only run the model with the batch size 1. Cause I don't have enough GPU...

FileNotFoundError: [Errno 2] No such file or directory: '/dev/fd/63'

When running inference.py with the provided pre-trained model (sudo python3 inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_old.pt -o . --is_fp16 -s 0.6). Error message prints: FileNotFoundError: [Errno 2] No such file or directory: '/dev/fd/63'. Is this an issue with the spectogram?

Confused concept about bijective function in WaveGlow

Be glad to see such great work and paper. However after reading the paper, I find that raw wave (which is x ) and the latent space (which is z) is not bijective because they are under condition of the mel-spectrum. So can I imagine this model like VAE that the z sampled from normal distribution is a kind of continuous space which leads to the pronunciation diversity of the result ?

ask for help

ImportError: No module named 'tacotron2.layers'
hi,friend,i met the erro ,when i run python3.5 train.py -c config.json,how can i solve that?thanks

get out of memory error

Hi, I have changed batch_size to 1, when I run "python train.py -c config.json", I still got error below.

default

GPU:
default

Can you help me or give some advices?

problem about mel2samp.py

It's a very nice work! I'm study it! But when I'm training the model, I find a problem. May, there is some mistake in mel2samp.py. It can't get a right mel.
When I use the "mel_spectrograms" that the NVIDIA gives, the results that "waveglow_old.pt" infers are nice. But ,when I use "python mel2samp.py -f test_files.txt -o . -c config.json" to get mel, the results that "waveglow_old.pt" infers are bad. The voice is from man, not the original woman.

the result of "mel_spectrograms" that the NVIDIA gives:
image

the result of mel2samp.py
image

This is my audio.
my_results_wav.zip

Inference time 3 times slower than real-time on single GTX 1080ti

I have tested NVIDIA/tacotron2+waveglow by using inference.ipynb with pretrained models on single GTX 1080ti. Inference time was 3 times slower than real-time. It spent 24s for generate 7s voice.

Is it possible to generate faster than real-time on single GTX 1080ti.
Thank you.

Here is screenshot while generating audio.
while_generating_nvidia_smi

pip install requirements fail = pytorch not yet available

For all I know, as of today 1.0 isn't yet out in pip.

$ pip install -r requirements.txt
Collecting torch==1.0.0a0 (from -r requirements.txt (line 1))
Could not find a version that satisfies the requirement torch==1.0.0a0 (from -r requirements.txt (line 1)) (from versions: 0.1.2, 0.1.2.post1, 0.3.1, 0.4.0, 0.4.1)
No matching distribution found for torch==1.0.0a0 (from -r requirements.txt (line 1))

loss will increase much large, only in some step

step_78540: loss_-4.065555573
step_78560: loss_-4.490265846
step_78580: loss_-4.454271317
step_78600: loss_-4.322495461
step_78620: loss_4191.683593750
step_78640: loss_-2.408614397
step_78660: loss_-2.908127546

need to update mel_files excluding MACOSX realted files

I found error to load mel file and found the root cause of it the zip file includes mel_spectrograms/.DS_Store so filelist load got errors in https://github.com/NVIDIA/waveglow/blob/master/mel2samp.py#L47

The solution would be

!rm -rf content/mel_spectrogram/.DS_Store

but plz republish the mel file.

Archive:  mel_spectrograms.zip
  inflating: mel_spectrograms/LJ001-0153.wav.pt  
  inflating: mel_spectrograms/LJ001-0096.wav.pt  
  inflating: mel_spectrograms/LJ001-0094.wav.pt  
  inflating: mel_spectrograms/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/mel_spectrograms/
  inflating: __MACOSX/mel_spectrograms/._.DS_Store  
  inflating: mel_spectrograms/LJ001-0079.wav.pt  
  inflating: mel_spectrograms/LJ001-0051.wav.pt  
  inflating: mel_spectrograms/LJ001-0063.wav.pt  
  inflating: mel_spectrograms/LJ001-0173.wav.pt  
  inflating: mel_spectrograms/LJ001-0102.wav.pt  
  inflating: mel_spectrograms/LJ001-0015.wav.pt  
  inflating: mel_spectrograms/LJ001-0072.wav.pt   
python inference.py  -f <(ls /content/mel_spectrograms/*.pt)  -w /content/waveglow_old.pt -o . -s 0.6

./LJ001-0015.wav_synthesis.wav
./LJ001-0051.wav_synthesis.wav
./LJ001-0063.wav_synthesis.wav
./LJ001-0072.wav_synthesis.wav
./LJ001-0079.wav_synthesis.wav
./LJ001-0094.wav_synthesis.wav
./LJ001-0096.wav_synthesis.wav
./LJ001-0102.wav_synthesis.wav
./LJ001-0153.wav_synthesis.wav
./LJ001-0173.wav_synthesis.wav

why loss suddenly become the initial value

123873: -5.480178833
123874: -5.352289200
123875: -5.298225403
123876: -5.460500240
123877: -5.547176361
123878: -5.597653866
123879: -5.451864243
123880: -5.445433617
123881: -5.494832516
123882: -5.272130013
123883: -5.312953949
123884: -5.550226212
123885: -5.573468208
123886: -5.405485153
123887: -4.718036175
123888: -5.176541805
123889: -4.795463085
123890: -4.772330284
123891: -4.845523357
123892: 40490.179687500
123893: -3.482588530
123894: -2.216503620
123895: -1.672664404
123896: -1.454238176
123897: -1.419931889
123898: -1.281533122
123899: -1.110476017
123900: -1.149755478
123901: -1.464488387
123902: -1.522264838
123903: -1.716608882
123904: -1.713886499

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.