jaywalnut310 / vits Goto Github PK

View Code? Open in Web Editor NEW

6.5K 56.0 1.2K 3.34 MB

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Home Page: https://jaywalnut310.github.io/vits-demo/index.html

License: MIT License

Python 94.96% Jupyter Notebook 5.04%

tts text-to-speech pytorch deep-learning speech-synthesis

vits's Introduction

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Visit our demo for audio samples.

We also provide the pretrained models.

** Update note: Thanks to Rishikesh (ऋषिकेश), our interactive TTS demo is now available on Colab Notebook.

VITS at training	VITS at inference

Pre-requisites

Python >= 3.6
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

Training Exmaple

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

Inference Example

See inference.ipynb

vits's People

Contributors

Stargazers

Watchers

Forkers

shaun95 yangpeng08 entn-at chenchy sadam1195 idgmatrix ak391 iamgoofball macroustc ishine charlottecuc whitefu ductho9799 zhangsanfeng86 juheeuu ensky0 hertz-pj karan-deepsync hiroshiba zshy1205 jesus-villalba c00renut saber5433 rendchevi merumeru-rururu tiamat-tech wgwangang hlp2819 serg06 tyanz aaaaaalan luckeryi tsaifangsheng michalkrass seunghwan1228 thechuong98 maxmax2016 azraelkuan thanhkm mbarnig amisarbel sodapeter flipkast zdisket sciai-ai xinshengwang rogervaas pauljskim silyfox cuongnm5 akorol98 boltzmann-li hiyoung-asr junjun3518 ngtiendong cheikh06 aijianiula0601 dbcooperptit chenhuayou chunhui-lu skilomlg dimitrijejankov rielkim hairuo55 wataru-nakata ttslr hcwu1993 piotrdabkowski unparalleled-ysj wizardk zhyoung24 wen2cheng roedoejet forwiat haifengzeng jmp84 haruqa xiasanshi jaggukaka jivvp mosure fanhuaandluomu karynaur gcafepgdi arnasrad qdenisq baejaecheol amir-shokri lh997 hongchengzhu yinchyu r9y9 chaiyujin t0nych3n spaghettisystems marcofernandez007 yingfenging fearofchou esoff kikirizki

vits's Issues

vits_is_good

Is it possible to manually condition pitch and duration like FastPitch ?

Hello,

Thank you for sharing this great work!

I had a look at the model. There seems to be a duration predictor but I do not see a way to provide input pitch conditioning.

Is this possible? If not do you think it would be trivial to add support for pitch conditioning?

The segment_size in paper and code is different?

In the paper, the segment size (the window length) for sliced audio reconstruction is stated as 32.

While in the code, I saw the definition segment_size=4. See commons.rand_slice_segments(x, x_lengths=None, segment_size=4):

Do I have any misunderstanding here?

A question about "Glow based TTS"

Thank you for your great work and excellent results. I read your paper titled " Glow-TTS: A generative flow for text-to-speech via monotonic alignment search " and " Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech ". I'm a beginner at glow and I used to reproduce waveglow. But, I have a question about GlowTTS

In the GlowTTS, the only loss outside the duration module is the likelihood of the hidden variable Z (generated by the decoder) on the Gaussian distribution(predicted by the encoder). However，At the beginning of the training, both encoder and decoder are in the initial state. The encoder has not learned how to generate content information, and the decoder has not learned how to transform the acoustic characteristics into the distribution suitable for the encoder. In waveflow, the initial distribution has a certain mean and variance, that is, one “end” is determined, while in GlowTTS, both "ends" are not determined. The training results are good and the convergence speed is also fast, but why it is. I didn't figure it out.

I will be appreciated for any ideas.

How to fix the noise during inference time?

Hi Jaehyeon,

May I ask how to fix the stochastic noise during inference time? I want some generated audio to be reproducable, thus need to fix the random noise part.
Currently it seems I can only control the noise scale.

sid = torch.LongTensor([1]) # speaker identity
stn_tst = get_text("Tell me the answer please", hps_ms)

with torch.no_grad():
    x_tst = stn_tst.unsqueeze(0)
    x_tst_lengths = torch.LongTensor([stn_tst.size(0)])
    audio = net_g_ms.infer(x_tst, x_tst_lengths, sid = sid, noise_scale=1, noise_scale_w=2, length_scale=1)[0][0,0].data.float().numpy()
ipd.display(ipd.Audio(audio, rate=hps_ms.data.sampling_rate))

Question regarding symbols used

Hi,

Thanks for this research!!

I have a query regarding use of _letters in symbol variable in symbol.py.
if we are using phonemes, then why are we using characters in symbols because based on that text encoder vocab is decided.

wouldn't only _letters_ipa would be sufficient?

some advice about MAS Algorithm

As the GlowTTS MAS definition:

to satisfy monotonicity and surjection, if z_j and {u_i, σ_i} are aligned, the previous latent variable z_j-1 should have been aligned to
either {u_i-1, σ_i-1} or {u_i, σ_i}, which is equal to the Forward Attention for Tacotron2(https://arxiv.org/abs/1807.06736), my question is:

replacing max(Q_i-1_j-1, Q_i_j-1) with log(exp(Q_i-1_j-1) + exp(Q_i_j-1)) is a better choice?

poor performance on short phrases

I trained the multi speaker model on VCTK (~400k) and for longer input phrases (ie >5 words), performance is approximately comparable to the released pretrained model.

For shorter phrases (ie 1-2 words), pronunciation becomes significantly degraded. Words that are pronounced correctly as part of a longer phrase become hard to understand when passed as the only word in the input.

Is anyone else experiencing this? Would love some intuition behind what's causing this and how to correct this issue.

KL Loss is right?

When I searched KL-divergence between two Gaussians, I got this which is diffenrent from your KL loss
https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians

Training on Tesla K80

Hi,
Using Tesla K80 to train the model is giving the following error. Does the model require specific GPU architecture for training?

File "train.py", line 290, in
main()
File "train.py", line 50, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/anaconda/envs/vits/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/anaconda/envs/vits/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/anaconda/envs/vits/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/anaconda/envs/vits/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/media/hdd1tb/tts-VITS/vits-main/train.py", line 117, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/media/hdd1tb/tts-VITS/vits-main/train.py", line 162, in train_and_evaluate
hps.data.mel_fmax
File "/media/hdd1tb/tts-VITS/vits-main/mel_processing.py", line 105, in mel_spectrogram_torch
center=center, pad_mode='reflect', normalized=False, onesided=True)
File "/anaconda/envs/vits/lib/python3.7/site-packages/torch/functional.py", line 465, in stft
return _VF.stft(input, n_fft, hop_length, win_length, window, normalized, onesided)
RuntimeError: cuFFT doesn't support signals of half type with compute capability less than SM_53, but the device containing input half tensor only has SM_37

What tools do you use to create/split file lists for new dataset?

Please advise as per subject.

espeak

Hello, I am looking to make a web demo for this on gradio hub https://gradio.app/hub, could this be used instead of espeak? https://pypi.org/project/py-espeak-ng/

any suggestion for onnx exporting?

Provide Full Pretrain models ?

Could you please provide the pretrain of Discriminator which corresponding to the generator model you provided (LJS and VCTK)

How to continue fine tuning of model?

How can I continue training on one of my finetuned models? I'm using google colab so it was only about to run to 36k steps before stopping. I see that there is a generator and a discriminator model, how will this work when continuing fine tuning? Do I just load in the generator and lose the training for the discriminator?

I tried to start again, just feeding the path of my finetuned G model instead of ljs_base, but the quality is considerably worse and seems like it started training nearly from the beginning.

Thanks!

why not use multi-scale D and loss NAN

vits/models.py

Line 369 in 2e561ba

discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]

such as and another question, sometimes train used FP16 may cause loss=NaN? How can I fix it? thanks

Can MultiHeadAttention replace with nn.MultiheadAttention?

what's the difference between them?

about multi-speaker data

Thanks for your great work. I want to train with my own dataset. So I want to ask what does this number '67' mean in your data? And how it is calculated.

E.g：DUMMY2/p229/p229_128.wav|67|The whole process is a vicious circle at the moment.

Can use character input ?

i want train this model with japanese, i can use character as input
Thank

Training on GTX2080

Process 2 terminated with the following error:
Traceback (most recent call last):
File "/data3/liuhaogeng/anaconda3/envs/vits/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/data3/liuhaogeng/test/vits-main/train.py", line 120, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, None], None, None)
File "/data3/liuhaogeng/test/vits-main/train.py", line 138, in train_and_evaluate
for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(train_loader):
File "/data3/liuhaogeng/anaconda3/envs/vits/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/data3/liuhaogeng/anaconda3/envs/vits/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
return self._process_data(data)
File "/data3/liuhaogeng/anaconda3/envs/vits/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/data3/liuhaogeng/anaconda3/envs/vits/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/data3/liuhaogeng/anaconda3/envs/vits/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/data3/liuhaogeng/anaconda3/envs/vits/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data3/liuhaogeng/anaconda3/envs/vits/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data3/liuhaogeng/test/vits-main/data_utils.py", line 94, in getitem
return self.get_audio_text_pair(self.audiopaths_and_text[index])
File "/data3/liuhaogeng/test/vits-main/data_utils.py", line 62, in get_audio_text_pair
spec, wav = self.get_audio(audiopath)
File "/data3/liuhaogeng/test/vits-main/data_utils.py", line 74, in get_audio
spec = torch.load(spec_filename)
File "/data3/liuhaogeng/anaconda3/envs/vits/lib/python3.8/site-packages/torch/serialization.py", line 577, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/data3/liuhaogeng/anaconda3/envs/vits/lib/python3.8/site-packages/torch/serialization.py", line 241, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: [enforce fail at inline_container.cc:144] . PytorchStreamReader failed reading zip archive: failed finding central directory

I set the batch size to 16

no performance increase for multi-GPU

I'm noticing that the performance is not affected at all when using multiple GPUs at a time, although the code seems to use DistributedDataParallel, I'm confused.

The number of batches per epoch is the same, and the seconds per iteration are the same ( I use tqdm to measure it)

There is one change in my code however, I set DDP(....., find_unused_parameters=True), I've tried it both, enabled and disabled, and still the speed is the same.

for quick checking, I advise just using tqdm around the epoch loop

for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers_or_embeds) in tqdm((train_loader), 
        '', len(train_loader), disable=writers is None):

Speed on VCTK dataset, for multiGPU (4x RTX 8000):

293.8784s/epoch
1.63sec/batch

Find the environment information bellow

$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.6.0
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Pop!_OS 20.04 LTS
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Quadro RTX 8000
GPU 1: Quadro RTX 8000
GPU 2: Quadro RTX 8000
GPU 3: Quadro RTX 8000

Nvidia driver version: 470.63.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] numpy                     1.18.5                   pypi_0    pypi
[conda] torch                     1.6.0                    pypi_0    pypi
[conda] torchvision               0.7.0                    pypi_0    pypi

Export the model to onnx format.

Thanks for your great work. I'v trained a VITS model, and it can synthesize very fluently, and inference very fast. However is it possible to export trained model into onnx format, so as to inference with onnxruntime even more faster? Thanks in advance.

Can I use SDP for Fastspeech2?

When I use StochasticDurationPredictor like below in place of normal duration predictor and pitch predictor,
Duration loss and pitch loss are quite large. (4000~5000)

Even though attn_hard_dur is in training(by alignment encoder) and not stable, the loss seems to be too large,
what could be the problem?

self.duration_predictor = StochasticDurationPredictor(model_config)

# output : [batch_size, hidden_dim, text_seq_len] text encoder output
output = output + speaker_embedding
sdp_mask = torch.unsqueeze(sequence_mask(text_seq_lens, output.shape[-1]), 1).to(output.dtype)
duration_prediction = self.duration_predictor(
    x=output, 
    x_mask=sdp_mask, 
    w=attn_hard_dur.unsqueeze(1), 
    reverse=False
)

duration_loss = torch.sum(duration_prediction.float())

[Train step : 100] total_loss 10906.626953, mel_loss 0.977667, d_loss 5461.666504, p_loss 5441.733887, ctc_loss 2.249039, bin_loss 1.860192,
[Train step : 200] total_loss 11053.518555, mel_loss 0.775560, d_loss 5531.041016, p_loss 5519.537598, ctc_loss 2.165245, bin_loss 1.741701,
[Train step : 300] total_loss 10651.076172, mel_loss 0.666218, d_loss 5329.517578, p_loss 5318.736328, ctc_loss 2.155767, bin_loss 1.736991,
[Train step : 400] total_loss 10753.271484, mel_loss 0.627934, d_loss 5380.976074, p_loss 5369.540039, ctc_loss 2.126560, bin_loss 1.712614,
[Train step : 500] total_loss 11444.052734, mel_loss 0.591087, d_loss 5737.601562, p_loss 5703.707520, ctc_loss 2.152741, bin_loss 1.709005,
[Train step : 600] total_loss 11534.011719, mel_loss 0.564516, d_loss 5772.588867, p_loss 5758.684082, ctc_loss 2.173516, bin_loss 1.651289,
[Train step : 700] total_loss 12391.351562, mel_loss 0.555285, d_loss 6199.072754, p_loss 6189.496582, ctc_loss 2.226302, bin_loss 1.634849,
[Train step : 800] total_loss 10847.539062, mel_loss 0.542545, d_loss 5435.289062, p_loss 5409.686523, ctc_loss 2.021489, bin_loss 1.557770,
[Train step : 900] total_loss 10487.540039, mel_loss 0.532718, d_loss 5260.755371, p_loss 5224.231445, ctc_loss 2.020667, bin_loss 1.525733,
[Train step : 1000] total_loss 9263.677734, mel_loss 0.536135, d_loss 4639.945312, p_loss 4621.299805, ctc_loss 1.896345, bin_loss 1.430400,
[Train step : 1100] total_loss 10892.701172, mel_loss 0.537346, d_loss 5445.554688, p_loss 5444.647461, ctc_loss 1.962303, bin_loss 1.467209,
[Train step : 1200] total_loss 9963.730469, mel_loss 0.528609, d_loss 4972.430664, p_loss 4988.891113, ctc_loss 1.879873, bin_loss 1.405760,
[Train step : 1300] total_loss 9535.383789, mel_loss 0.527506, d_loss 4766.958496, p_loss 4766.083008, ctc_loss 1.815230, bin_loss 1.356215,
[Train step : 1400] total_loss 10367.413086, mel_loss 0.529463, d_loss 5190.863281, p_loss 5174.230469, ctc_loss 1.789981, bin_loss 1.329592,
[Train step : 1500] total_loss 10163.126953, mel_loss 0.525895, d_loss 5072.743164, p_loss 5088.081543, ctc_loss 1.776225, bin_loss 1.312975,
[Train step : 1600] total_loss 10285.883789, mel_loss 0.518091, d_loss 5121.499023, p_loss 5162.050781, ctc_loss 1.815271, bin_loss 1.384229,
[Train step : 1700] total_loss 10007.465820, mel_loss 0.515487, d_loss 4998.065918, p_loss 5007.191406, ctc_loss 1.692332, bin_loss 1.380546,
[Train step : 1800] total_loss 10438.118164, mel_loss 0.506113, d_loss 5225.224609, p_loss 5210.665527, ctc_loss 1.721988, bin_loss 1.504314,
[Train step : 1900] total_loss 9777.532227, mel_loss 0.515738, d_loss 4897.006836, p_loss 4878.411133, ctc_loss 1.598633, bin_loss 1.363741,
[Train step : 2000] total_loss 10859.443359, mel_loss 0.488703, d_loss 5405.339844, p_loss 5451.948242, ctc_loss 1.666379, bin_loss 1.473202,
[Train step : 2100] total_loss 10407.706055, mel_loss 0.495060, d_loss 5194.208008, p_loss 5211.336914, ctc_loss 1.665549, bin_loss 1.437147,
[Train step : 2200] total_loss 10450.838867, mel_loss 0.491688, d_loss 5227.919922, p_loss 5220.825684, ctc_loss 1.601205, bin_loss 1.470080,
[Train step : 2300] total_loss 9259.790039, mel_loss 0.489379, d_loss 4645.092773, p_loss 4612.664551, ctc_loss 1.544137, bin_loss 1.450988,

[Bug] 80000 exceeds legal port number range 0-65535

vits/train_ms.py

Line 47 in 2e561ba

os.environ['MASTER_PORT'] = '80000'

pytorch/pytorch#67172 (comment)

Setting port to 80000 (or any other number beyond 0-65535) will cause the error below if pytorch version >= 1.10.2 :

TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. torch._C._distributed_c10d.TCPStore(host_name: str, port: int, world_size: int = -1, is_master: bool = False, timeout: datetime.timedelta = datetime.timedelta(seconds=
300), wait_for_workers: bool = True, multi_tenant: bool = False)

Invoked with: 'localhost', 80000, 1, True, datetime.timedelta(seconds=1800); kwargs: multi_tenant=True

About ceiling for calculating phoneme duration

Is there any reason to use torch.ceil instead of torch.round or other algorithms for calculating phoneme duration?

Thank you.

Leak memory when runing on CPUs

I runned your code with CPUs. I have a big problems about leak memory.

is that able to train on Chinese dataset?

Phonemizer is too slow

ViTs accepts phonemized input data during training and inference. The phonemizer doesn't create a bottleneck on training because we use pre-processed data. But on inference if the model forward is taking 0.1 seconds, phonemizer is consuming minimum 0.3->0.5 seconds, so the majority of time spent on phonemizer. Do you guys have any workaround for this? Should we be training a light neural net phonemizer for this purpose?

Anybody having luck fine tuning?

I'm using a clean 40hr dataset, female American, that I used on tacotron with good results. I've trained on VITS twice now and it starts over fitting around 70K. It's definitely intelligible and in the correct tone but the prosody is way off. First run had default configs. Second run I tried decreasing learning rate and lr decay. It helped some with overall loss, but still started overfitting around 70K.

Pre-trained model link invalid

I can't open this link now, would you update the download link? Thank you very much!
https://drive.google.com/drive/folders/1ksarh-cJf3F5eKJjLVWY0X1j1qsQqiS2?usp=sharing

Training Time

How much time do you need for training to get the performance in your paper and web demo? I recently started to do experiments on LJspeech and found the training really slow (10+ days for 800k steps).

Questions about the 44KHz audio file train.

Hello,

I trained at 44KHz for a higher quality VC because the results were good when I trained with VCTK 22KHz.

At this time, the result of TTS inference was to read the text very quickly.

Regarding the above phenomenon, can you tell me if there are any parameters I need to adjust when learning 44KHz voice rather than 22KHz voice?

Questions about Loss_dur

As in paper Loss_dur is the negative variational lower bound of equation 7

Loss_dur = -log(p/q)=logq - logp

Is the logq(x) computed as follows:

In the code:

Why in the computation of logq, " - logdet_tot_q" not " + logdet_tot_q" ?

questions about why not convert English words to phone?

the train list something like:

DUMMY1/LJ045-0096.wav|Mrs. De Mohrenschildt thought that Oswald,
DUMMY1/LJ049-0022.wav|The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent.

why not convert words to phone?

very hard to construct a suitable environment

very hard to construct a suitable environment. impossible to duplicate the environment by requirements.txt

Stochastic duration prediction failed for fastspeech2

I applied the stochastic duration predictor to the fastspeech2 model.

Duration loss is falling smoothly (1.2 to 0.2)

But, in inference, the duration predictor does not work at all. (noise scale=0.333)

Does anyone know the cause of this problem?
The pseudo code I used is like below

# in variance adaptor
inputs = text_encoder_output + extended_speaker_embedding
sdp_mask = torch.unsqueeze(sequence_mask(text_lens, inputs.shape[-1]), 1).to(inputs.dtype)

if training:
    duration_prediction = self.duration_predictor(
        inputs , sdp_mask, torch.log(attn_hard_dur.float() + 1).unsqueeze(1)
    )
    duration_prediction = duration_prediction / torch.sum(sdp_mask)
else:
    duration_prediction = self.duration_predictor(inputs , sdp_mask, reverse=True, noise_scale=0.333)
    duration_prediction = duration_prediction.squeeze(1)

duration_rounded = torch.clamp(
                (torch.round(torch.exp(duration_prediction) - 1) * d_control),
                min=1,
            )

# loss
duration_loss = torch.sum(duration_prediction.float())

Can't train with fp16 on Nvidia P100

training with fp16 doesn't work for me on a P100, I'll look into fixing it, but for future reference here is the full stacktrace
torch version 1.9.0

2021-06-29 10:29:09.537741: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/functional.py:472: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /pytorch/aten/src/ATen/native/SpectralOps.cpp:664.)
  normalized, onesided, return_complex)
Traceback (most recent call last):
  File "train_ms.py", line 294, in <module>
    main()
  File "train_ms.py", line 50, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/content/vits/train_ms.py", line 118, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/content/vits/train_ms.py", line 192, in train_and_evaluate
    scaler.scale(loss_gen_all).backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: "fill_cuda" not implemented for 'ComplexHalf'

Training time on VCTK.

Thanks for your great work. I have been training a multi-speaker VITS model 160000 steps for 2 days on 8 V100 GPUs. The synthesized speech is clear but not that fluent. How many steps did you trained on VCTK dataset, and how long? Thanks in advance.

Mispronunciation and what is the purpose of the add_blank config ?

I've trained my own dataset with the default config except for add_blank option (I changed it to false). I know the add_blank option will add the 0 between the symbol ids but in my case it's been disabled. And I trained with phonemes and got 200K steps right now, but some phonemes seem to be spelled wrong. So I have some question ?

What is the purpose of the add_blank config ?
The reason for model to spell wrong ? How can I improve my model with the pronunciation ?

CPU infer slow

Thank you for sharing your code. I tried on my own dataset(Chinese) with config:
{ "train": { "log_interval": 200, "eval_interval": 1000, "seed": 1234, "epochs": 10000, "learning_rate": 2e-4, "betas": [0.8, 0.99], "eps": 1e-9, "batch_size": 32, "fp16_run": false, "lr_decay": 0.999875, "segment_size": 8192, "init_lr_ratio": 1, "warmup_epochs": 0, "c_mel": 45, "c_kl": 1.0 }, "data": { "training_files":"filelists/mt_f065_train_filelist.txt", "validation_files":"filelists/mt_f065_val_filelist.txt", "text_cleaners":["collapse_whitespace"], "max_wav_value": 32768.0, "sampling_rate": 16000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "n_mel_channels": 80, "mel_fmin": 0.0, "mel_fmax": null, "add_blank": true, "n_speakers": 0, "cleaned_text": false }, "model": { "inter_channels": 192, "hidden_channels": 192, "filter_channels": 768, "n_heads": 2, "n_layers": 6, "kernel_size": 3, "p_dropout": 0.1, "resblock": "1", "resblock_kernel_sizes": [3,7,11], "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]], "upsample_rates": [8,8,2,2], "upsample_initial_channel": 512, "upsample_kernel_sizes": [16,16,4,4], "n_layers_q": 3, "use_spectral_norm": false, "gin_channels": 256 } }

And synthesis on CPU in 1 batch, but the speed is not as good as using GPU.
total audio length: 77.76s
total cost GPU: 4.84s
total cost CPU: 151.26s
average rtf GPU: 0.06
average rtf CPU: 1.95

I checked the checkpoint and found the G_*.pth file size is up to 445M.

My question is:

Is the CPU infer time correct?
Is the checkpoint I'm using correct?
Could you kindly give any ideas about how to make CPU infer faster?

Thank you in advance

Can you provide typical loss figures ? KL diverging

Hello,

I am training from scratch using custom data. The Hifi-GAN part has converged relatively quickly, and the generated samples in the evaluation tensorboard sound really good.

However, the inference samples from phonemes don't seem to improve. Moreover, The kl-loss, which, as I understand, should be the next loss to converge, is rather diverging.

Here are my generator losses :

The jump is because the training was interrupted and the step number is wrong when loading from a checkpoint.

So, is this normal ? Do you have typical loss figures to share to compare with ? Thanks.

Result getting worse when i use ground truth duration.

Dear author, thank you for your contribution for TTS, this is a big step in E2E TTS. But when I use ground truth duration aiming to train faster and get more accurate duration, duration loss drops fast and kl loss drops slow. I only change the attn matrix using true duration. I check the consist of loss, but not find the alignment realated part. Could you please give me some help with this problem?

How is the KL loss computed?

Thanks for the great work!
There's one thing that confuses me very much though. In the paper, the KL loss is computed as (Eq.3) $L_{k l}=\log q_{\phi}\left(z \mid x_{l i n}\right)-\log p_{\theta}\left(z \mid c_{t e x t}, A\right)$ .
In vanilla VAEs, the KL loss is actually an expectation. As the two distributions involved are both Gaussians, there is a closed-form expression. It is understandable that as the distribution $p_{\theta}\left(z \mid c_{t e x t}, A\right)$ is no more a Gaussian in VITS, we don't calculate the expectation but instead use the sampled $z$ to evaluate the probability density of $q_\phi,p_\theta$ and calculate Eq.3. Till now, there is no problem for me.

Nevertheless, in the code I notice that the KL loss is calculated in a special way, in losses.py:kl_loss(z_p, logs_q, m_p, logs_p, z_mask). In this function, as far as I know, m_p and logs_p are extracted from text encodings, and logs_q is extracted from spectrogram posterior encoder. z_p is the flow-transformed latent variable from posterior encoder. And this function calculates KL loss as the sum of $\frac{1}{2} \frac{(z_p-\mu_p)^2}{\sigma_p^2}$ and $\log\sigma_p - \log\sigma_q - \frac{1}{2}$ . So how does this come? I guess the first term comes from Eq.4 but why is the log-determinant missing? Also, why is $\mu_q$ not participating in this loss? I really cannot relate this calculation with Eq.3 in the paper.

There is another question by the way. I notice that the mean_only switch is turned off in the ResidualCouplingLayer, which means the log-determinant returned by the flow is always 0. In this case, the transformed distribution is still a Gaussian, right?

Again, thanks for the work and code!

A simple multi-process version of preprocess.py

import argparse
import text
from utils import load_filepaths_and_text
from multiprocessing import Pool, cpu_count
from tqdm import tqdm

def process(inputs):
  i, filepaths_and_text = inputs
  original_text = filepaths_and_text[args.text_index]
  cleaned_text = text._clean_text(original_text, args.text_cleaners)
  filepaths_and_text[args.text_index] = cleaned_text
  return i, filepaths_and_text

if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument("--out_extension", default="cleaned")
  parser.add_argument("--text_index", default=1, type=int)
  parser.add_argument("--filelists", nargs="+", default=["filelists/ljs_audio_text_val_filelist.txt", "filelists/ljs_audio_text_test_filelist.txt"])
  parser.add_argument("--text_cleaners", nargs="+", default=["english_cleaners2"])

  args = parser.parse_args()
    
  for filelist in args.filelists:
    print("START:", filelist)
    filepaths_and_text = load_filepaths_and_text(filelist)
    inputs = [(i, filepaths_and_text[i]) for i in range(len(filepaths_and_text))]
    with Pool(processes=cpu_count()-1) as pool:
      with tqdm(total=len(inputs)) as pbar:
        for i, line in tqdm(pool.imap_unordered(process, inputs)):
          filepaths_and_text[i] = line
          pbar.update()

    new_filelist = filelist + "." + args.out_extension
    with open(new_filelist, "w", encoding="utf-8") as f:
      f.writelines(["|".join(x) + "\n" for x in filepaths_and_text])

It still takes hours to gen xx.txt.cleaned of 88k sentences though.

Inference benchmarks on CPU / single-thread performance

Hi @jaywalnut310 ,

Many thanks for your work!
As usual this is very thorough, open and inspiring.

In your paper you publish the GPU speed benchmarks:

We measured the synchronized elapsed time over the entire process to generate raw waveforms from phoneme sequences with 100 sentences randomly selected from the test set of the LJ Speech dataset. We used a single NVIDIA V100 GPU with a batch size of 1.

But somehow you do not say anything about running on CPU (1 CPU thread). Notably, this is also omitted from papers like TalkNet, / FastSpeech / GlowTTS (I believe this paper is mostly GlowTTS meets HifiGAN).

The only paper saying anything about CPU speed is LightSpeech:

Is this because flow-based models do not lend themselves well to CPU inference?

VITS paper ?

@jaywalnut310 I am unable to find the paper on which this repo based on.

Add dataset support for aishell-3?

https://www.openslr.org/resources/93/data_aishell3.tgz

It would be very good to support aishell-3 dataset for Chinese.

questions about z_p

In the model:
z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
z_p = self.flow(z, y_mask, g=g)

I donot understand, since enc_q is something like spec_encoder, it is posterior, why your use "z_p" here, like the abbreviation of "z_prior"?

In the paper, you noted that "a normalizing flow" used in prior encoder. I am confused, "flow" is used after enc_q instead of using after enc_p.

But in the computation of kl_loss, the loss is between (z_p, logs_q) and (m_p, logs_p), like "post" and "prior".

In inference, flow is used after enc_p.

what is "z_p" really is ? what is the function in the model?

Can't train with fp16 on Nvidia RTX3060

training with fp16 doesn't work for me on a RTX3060, I'll look into fixing it, but for future reference here is the full stacktrace
torch version 1.9.0

INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for 1 nodes.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:158, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

what do missing here?

  File "gradiodemo.py", line 13, in <module>
    from data_utils import TextAudioLoader, TextAudioCollate, TextAudioSpeakerLoader, TextAudioSpeakerCollate
  File "/root/vits/data_utils.py", line 9, in <module>
    from mel_processing import spectrogram_torch
  File "/root/vits/mel_processing.py", line 9, in <module>
    import librosa
  File "/usr/local/lib/python3.8/dist-packages/librosa/__init__.py", line 211, in <module>
    from . import core
  File "/usr/local/lib/python3.8/dist-packages/librosa/core/__init__.py", line 6, in <module>
    from .audio import *  # pylint: disable=wildcard-import
  File "/usr/local/lib/python3.8/dist-packages/librosa/core/audio.py", line 8, in <module>
    import soundfile as sf
  File "/usr/local/lib/python3.8/dist-packages/soundfile.py", line 142, in <module>
    raise OSError('sndfile library not found')
OSError: sndfile library not found

jaywalnut310 / vits Goto Github PK

vits's Introduction

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

Pre-requisites

Training Exmaple

Inference Example

vits's People

Contributors

Stargazers

Watchers

Forkers

vits's Issues

Recommend Projects

Recommend Topics

Recommend Org