Giter VIP home page Giter VIP logo

vits2_pytorch's Introduction

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim

Unofficial implementation of the VITS2 paper, sequel to VITS paper. (thanks to the authors for their work!)

Alt text

Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-toend single-stage approach.

Credits

  • We will build this repo based on the VITS repo. The goal is to make this model easier to transfer learning from VITS pretrained model!
  • (08-17-2023) - The authors were really kind to guide me through the paper and answer my questions. I am open to discuss any changes or answer questions regarding the implementation. Please feel free to open an issue or contact me directly.

Pretrained checkpoints

  • LJSpeech-no-sdp (refer to config.yaml in this checkppoint folder) | 64k steps | proof that training works! Would recommend experts to rename the ckpts to *_0.pth and starting the training using transfer learning. (I will add a notebook for this soon to help beginers).
  • Check 'Discussion' page for training logs and tensorboard links and other community contributions.

Sample audio

Prerequisites

  1. Python >= 3.10
  2. Tested on Pytorch version 1.13.1 with Google Colab and LambdaLabs cloud.
  3. Clone this repository
  4. Install python requirements. Please refer requirements.txt
    1. You may need to install espeak first: apt-get install espeak
  5. Download datasets
    1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
    2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
  6. Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

How to run (dry-run)

  • model forward pass (dry-run)
import torch
from models import SynthesizerTrn

net_g = SynthesizerTrn(
    n_vocab=256,
    spec_channels=80, # <--- vits2 parameter (changed from 513 to 80)
    segment_size=8192,
    inter_channels=192,
    hidden_channels=192,
    filter_channels=768,
    n_heads=2,
    n_layers=6,
    kernel_size=3,
    p_dropout=0.1,
    resblock="1", 
    resblock_kernel_sizes=[3, 7, 11],
    resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
    upsample_rates=[8, 8, 2, 2],
    upsample_initial_channel=512,
    upsample_kernel_sizes=[16, 16, 4, 4],
    n_speakers=0,
    gin_channels=0,
    use_sdp=True, 
    use_transformer_flows=True, # <--- vits2 parameter
    # (choose from "pre_conv", "fft", "mono_layer_inter_residual", "mono_layer_post_residual")
    transformer_flow_type="fft", # <--- vits2 parameter 
    use_spk_conditioned_encoder=True, # <--- vits2 parameter
    use_noise_scaled_mas=True, # <--- vits2 parameter
    use_duration_discriminator=True, # <--- vits2 parameter
)

x = torch.LongTensor([[1, 2, 3],[4, 5, 6]]) # token ids
x_lengths = torch.LongTensor([3, 2]) # token lengths
y = torch.randn(2, 80, 100) # mel spectrograms
y_lengths = torch.Tensor([100, 80]) # mel spectrogram lengths

net_g(
    x=x,
    x_lengths=x_lengths,
    y=y,
    y_lengths=y_lengths,
)

# calculate loss and backpropagate

Training Example

Open In Colab

# LJ Speech
python train.py -c configs/vits2_ljs_nosdp.json -m ljs_base # no-sdp; (recommended)
python train.py -c configs/vits2_ljs_base.json -m ljs_base # with sdp;

# VCTK
python train_ms.py -c configs/vits2_vctk_base.json -m vctk_base

# for onnx export of trained models
python export_onnx.py --model-path="G_64000.pth" --config-path="config.json" --output="vits2.onnx"
python infer_onnx.py --model="vits2.onnx" --config-path="config.json" --output-wav-path="output.wav" --text="hello world, how are you?"

TODOs, features and notes

Duration predictor (fig 1a)

  • Added LSTM discriminator to duration predictor.
  • Added adversarial loss to duration predictor. ("use_duration_discriminator" flag in config file; default is "True")
  • Monotonic Alignment Search with Gaussian Noise added; might need expert verification (Section 2.2)
  • Added "use_noise_scaled_mas" flag in config file. Choose from True or False; updates noise while training based on number of steps and never goes below 0.0
  • Update models.py/train.py/train_ms.py
  • Update config files (vits2_vctk_base.json; vits2_ljs_base.json)
  • Update losses in train.py and train_ms.py

Transformer block in the normalizing flow (fig 1b)

  • Added transformer block to the normalizing flow. There are three types of transformer blocks: pre-convolution (my implementation), FFT (from so-vits-svc repo) and mono-layer.
  • Added "transformer_flow_type" flag in config file. Choose from "pre_conv", "fft", "mono_layer_inter_residual", "mono_layer_post_residual".
  • Added layers and blocks in models.py (ResidualCouplingTransformersLayer, ResidualCouplingTransformersBlock, FFTransformerCouplingLayer, MonoTransformerFlowLayer)
  • Add in config file (vits2_ljs_base.json; can be turned on using "use_transformer_flows" flag)

Speaker-conditioned text encoder (fig 1c)

  • Added speaker embedding to the text encoder in models.py (TextEncoder; backward compatible with VITS)
  • Add in config file (vits2_ljs_base.json; can be turned on using "use_spk_conditioned_encoder" flag)

Mel spectrogram posterior encoder (Section 3)

  • Added mel spectrogram posterior encoder in train.py
  • Addded new config file (vits2_ljs_base.json; can be turned on using "use_mel_posterior_encoder" flag)
  • Updated 'data_utils.py' to use the "use_mel_posterior_encoder" flag for vits2

Training scripts

  • Added vits2 flags to train.py (single-speaer model)
  • Added vits2 flags to train_ms.py (multi-speaker model)

ONNX export

  • Add ONNX export support.

Gradio Demo

  • Add Gradio demo support.

Special mentions

vits2_pytorch's People

Contributors

awas666 avatar choihkk avatar fenrlr avatar kdaip avatar kevinwang676 avatar leminhnguyen avatar lexkoro avatar luhavis avatar p0p4k avatar shigabeev avatar subarasheese avatar w11wo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vits2_pytorch's Issues

model compute error

RuntimeError: Given groups=1, weight of size [192, 80, 1], expected input[64, 513, 297] to have 80 channels, but got 513 channels instead
After I preprocessed the data, everything went fine. However, model calculation problems arise during the training process. Is it because I preprocessed the data wrong?

about duration-discriminator training objective

Thank you for your hard work. I have a question while attempting to train with your code.

During the training of the duration predictor, I noticed that the "loss_dur" fluctuates significantly compared to previous work. Upon investigation, I found that "grad_norm_dur_disc" is spiking very high. In my opinion, this might be due to the adversarial loss being calculated for a single batch, which is much larger compared to the weights, especially in contrast to the few convolution layers in the discriminator.

As far as I know, in HiFiGAN, the discriminator is composed of several sub-discriminators. Therefore, I understand that there is a for loop inside the "discriminator_loss" and "generator_loss" functions to calculate the loss for each sub-discriminator.

  • hifigan discriminator output shape -> [N, B, 1, ...]

Since the "DurationDiscriminator" you implemented does not consist of sub-discriminator layers, when calculating the loss in the "discriminator_loss" and "generator_loss" functions, it is computed for a single batch size and then summed without any scaling.

  • your discriminator output shape -> [B, 1, ...]

In my opinion, this might make the training of the "DurationDiscriminator", which is composed of very small parameters, unstable. I'm curious if this was intentional, to calculate it for a single batch size without scaling. If not, I'm also wondering if it would be acceptable to input in list form within the append() in the discriminator forward pass. Currently, I'm training the model with the relu non-linearity and layernorm that you have written but commented out and the list form. If i get good result, i will share it this issue.

class DurationDiscriminator(nn.Module):  # vits2
    # TODO : not using "spk conditioning" for now according to the paper.
    # Can be a better discriminator if we use it.
    def __init__(
        self, in_channels, filter_channels, kernel_size, p_dropout, gin_channels=0
    ):
        super().__init__()

        self.in_channels = in_channels
        self.filter_channels = filter_channels
        self.kernel_size = kernel_size
        self.p_dropout = p_dropout
        self.gin_channels = gin_channels

        self.conv_1 = nn.Conv1d(
            in_channels, filter_channels, kernel_size, padding=kernel_size // 2
        )
        self.norm_1 = modules.LayerNorm(filter_channels)
        self.conv_2 = nn.Conv1d(
            filter_channels, filter_channels, kernel_size, padding=kernel_size // 2
        )
        self.norm_2 = modules.LayerNorm(filter_channels)
        self.dur_proj = nn.Conv1d(1, filter_channels, 1)

        self.pre_out_conv_1 = nn.Conv1d(
            2 * filter_channels, filter_channels, kernel_size, padding=kernel_size // 2
        )
        self.pre_out_norm_1 = modules.LayerNorm(filter_channels)
        self.pre_out_conv_2 = nn.Conv1d(
            filter_channels, filter_channels, kernel_size, padding=kernel_size // 2
        )
        self.pre_out_norm_2 = modules.LayerNorm(filter_channels)

        # if gin_channels != 0:
        #   self.cond = nn.Conv1d(gin_channels, in_channels, 1)

        self.output_layer = nn.Sequential(nn.Linear(filter_channels, 1), nn.Sigmoid())

    def forward_probability(self, x, x_mask, dur, g=None):
        dur = self.dur_proj(dur)
        x = torch.cat([x, dur], dim=1)
        x = self.pre_out_conv_1(x * x_mask)
        x = torch.relu(x)
        x = self.pre_out_norm_1(x)
        x = self.pre_out_conv_2(x * x_mask)
        x = torch.relu(x)
        x = self.pre_out_norm_2(x)
        x = x * x_mask
        x = x.transpose(1, 2)
        output_prob = self.output_layer(x)
        return output_prob

    def forward(self, x, x_mask, dur_r, dur_hat, g=None):
        x = torch.detach(x)
        # if g is not None:
        #   g = torch.detach(g)
        #   x = x + self.cond(g)
        x = self.conv_1(x * x_mask)
        x = torch.relu(x)
        x = self.norm_1(x)
        x = self.conv_2(x * x_mask)
        x = torch.relu(x)
        x = self.norm_2(x)

        output_probs = []
        for dur in [dur_r, dur_hat]:
            output_prob = self.forward_probability(x, x_mask, dur, g)
            output_probs.append([output_prob])

        return output_probs

Are the parameters for the Duration Discriminator being incorrectly passed?

https://github.com/p0p4k/vits2_pytorch/blob/dbdf9362ff9b8033d93e4acaa96ef2bc7a6b7646/train_ms.py#L289C54-L289C54
The last two parameters of the discriminator should be dur_real and dur_hat, respectively.

def forward(self, x, x_mask, dur_r, dur_hat, g=None):

However, the passed parameters logw and logw_

logw_ = torch.log(w + 1e-6) * x_mask

logw is predicted duration, logw_ is real duration

so it should be net_dur_disc(hidden_x, x_mask, logw_, logw) ?

Error in code when training

vits2_pytorch\models.py", line 886, in forward
    return o, l_length, attn, ids_slice, x_mask, y_mask, (z, z_p, m_p, logs_p, m_q, logs_q), (x, logw, logw_)
UnboundLocalError: local variable 'logw' referenced before assignment

Seems like the variables logw and logw_ are not defined when using sdp.

Also, when training with nosdp, this error pops up:

vits2_pytorch\train.py", line 194, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], [scheduler_g, scheduler_d, scheduler_dur_disc], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
UnboundLocalError: local variable 'optim_dur_disc' referenced before assignment

AlignerNet instead of MAS

Is it possible to use AlignerNet (aligner.py in pflow-tts repo) instead of MAS in VITS2?

What should be changed in the code? I am a bit confused on what the inputs should be.

ValueError: too many values to unpack (expected 2)

Hi, I was using cjke_cleaners2 to clean the texts. However, I got the error ValueError: too many values to unpack (expected 2) as shown below

image

But I don't think I got more than 2 values to unpack. Here is the structure of my cleaned texts

image

So I wonder which step could go wrong. Thank you!

Two errors in using noise MAS

Thank you for your implementation on VITS2! I copied your noise MAS part into the original VITS and attempted to train it with multiple speakers across 2 GPUs. However, I encountered two errors in the process:

The first error arises when employing DDP (DistributedDataParallel), displaying the message: "AttributeError: 'DistributedDataParallel' object has no attribute 'net_g.mas_noise_scale_initial'". Solutions found online suggest using model.module instead of model. The code in train.py (line 181-182) may need modification as follows:

current_mas_noise_scale = net_g.module.mas_noise_scale_initial - net_g.module.noise_scale_delta * global_step
net_g.module.current_mas_noise_scale = max(current_mas_noise_scale, 0.0)

The second error originates in models.py (line 703):
epsilon = torch.sum(logs_p, dim=1).exp() * torch.randn_like(neg_cent) * self.current_mas_noise_scale
The error message is: "RuntimeError: The size of tensor 'a' must match the size of tensor 'b' at non-singleton dimension 1". Upon examining notebooks/MAS_with_noise.ipynb, the code appears to function correctly. Yet, altering the batch size from 1 to other values triggers the same error. I guess, when the batch size is not 1, the two tensors fail to meet the broadcasting condition. Introducing a dimension using unsqueeze(1) resolves the error (but I am unsure whether it is right).
The code adjustment could be as follows:
epsilon = torch.sum(logs_p, dim=1).exp().unsqueeze(1) * torch.randn_like(neg_cent) * self.current_mas_noise_scale

Upon introducing the noise, I observed a deteriorating trend in the MAS conclusion as the training steps increased. Additionally, the audio generated consisted mostly of silence and electric current-like sound. It seems that there might be some problem in the noise generation formula.
image
image
image
image

Where is L_mse?

The paper has three losses for the duration discriminator: $L_{adv}(D)$, $L_{adv}(G)$, and $L_{mse}$. But the code only implemented $L_{adv}(D)$ and $L_{adv}(G)$:

) = discriminator_loss(y_dur_hat_r, y_dur_hat_g)

loss_dur_gen, losses_dur_gen = generator_loss(y_dur_hat_g)

$L_{mse}$ should be:

loss_mse = F.mse_loss(logw, logw_)

And add to loss_gen_all.

loss_gen_all = loss_gen + loss_fm + loss_mel + loss_dur + loss_kl

Duration Discriminator problem

In train_ms.py line 408-411,The duration Discriminator Net in the last two input , Is the order reversed?
I saw the other vits2 code order reversed, but logw_ is the MAS result logw is the dp result, I think this is correct.
Can you help me answer it?

Duration Discriminator

        if net_dur_disc is not None:
            y_dur_hat_r, y_dur_hat_g = net_dur_disc(
                hidden_x.detach(), x_mask.detach(), logw_.detach(), logw.detach()
            )

->

another Duration Discriminator

       if net_dur_disc is not None:
            y_dur_hat_r, y_dur_hat_g = net_dur_disc(
                hidden_x.detach(), x_mask.detach(), logw.detach(), logw_.detach()
            )

Inference error, scipy

Isn't the audio file write here mixed up or did this only occur to me as I'm using python 3.8?

Docs say: filename, rate, data

but the implementation here is: data, rate, filename

Same in ms.

New feature: Deleting the old .pth files when training

Hi, I found that there was no funtion to delete the existing .pth files, which made the training process require a lot of space. So it would be very helpful if you can add a function to deleting the previous .pth files when training. Thanks!

Checkpoint saves?

Should we not save the iteration/global_step here:

epoch,

instead of the epoch?

Edit:

As well as:

optim_g, gamma=hps.train.lr_decay, last_epoch=epoch_str - 2

Should the scheduler not step per iteration/global_step.

I am very very new to this, just asking as I did some reading online. Thanx

colab error

I should have to do this for unsetting espeak
!apt-get install espeak

Also It occured
AttributeError: 'HParams' object has no attribute 'duration_discriminator_type'

Training using SDP (and with DP by ratio?)

This is a follow up to the previous discussion threads regarding stochastic duration predictor in #11 and #68 (comment), as well as with the reference of Bert-VITS2:

Regarding training using SDP, I have a few feedbacks:

  1. A few months ago my experiments using use_sdp at earlier steps(100K ~ 500K) show below the average results compared to those trained without use_sdp, the audios did not sound natural and certain pronunciations are not clear. Now I plan to transfer learn a more well-trained checkpoint with SDP(like mentioned in the thread above), would be curious to hear anyone who has done similar experiments.

  2. I am curious to learn if adding sdp_ratio and training both SDP and DP simultaneously would offer any improvements to results. Not sure about how much code changes but would love to add a pr if this sounds good to you!

  3. About train both SDP & DP together and compare the result to save time(#11 (comment)), if we train from scratch using this method my assumption is it does not sound good compared to two stage training.

  4. DurationPredictor works very well from my experience, but is there any improvement can be done regarding both DP models?

==========================

A summary of my experience using use_sdp so far(will update later when I have more results):

  • train using SDP from scratch: does not sound good at all.
  • train without SDP from scratch: sound natural, best performing checkpoint to date
  • train without SDP from scratch, then continue training using SDP: ?
  • train with both SDP & DP by ratio: ?

custom training

Hi,

I have prepared my trainings data as the ljspeech dataset. I copied the vits2_ljs_base.yaml and changed the paths for train and val.

Now I want to train with the command:

python train.py -c configs/vits2_ljs_base_de.json -m vits2_ljs_base_de

Unfortunately I get the following error message:
Do you have an advice ?

Loading train data: 0%| | 0/1417 [00:13<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 336, in
main()
File "train.py", line 51, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/opt/8tbdrive1/experiments/vits2_pytorch/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/8tbdrive1/experiments/vits2_pytorch/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/8tbdrive1/experiments/vits2_pytorch/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/8tbdrive1/experiments/vits2_pytorch/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/opt/8tbdrive1/experiments/vits2_pytorch/train.py", line 156, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/opt/8tbdrive1/experiments/vits2_pytorch/train.py", line 181, in train_and_evaluate
if net_g.use_noise_scaled_mas:
File "/opt/8tbdrive1/experiments/vits2_pytorch/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 947, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'use_noise_scaled_mas

a trivial bug

Thanks for your great work, I am using your code to train a multilingual model, but you seem to have forgotten the initialization of self.use_mel_spec_posterior in TextAudioSpeakerLoader

Training stuck

Hello ,

Thanks for all the effort to create this repo. When I launch training it runs for a few steps and then I see no progress at all. Its just stuck without any progress for a long time. It still hasnt progressed.
INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/G_0.pth
INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/D_0.pth
INFO:ljs_base:Saving model and optimizer state at iteration 1 to ./logs/ljs_base/DUR_0.pth
Loading train data: 4%|████████████▍

have you encountered this before ? Any help would be extremely helpful

I am trying to use the vits's New FLOW in the svc

Hello, I am extremely grateful for your contribution to VITS2. I have been working in SVC for a long time, and recently I have become interested in the FLOW of VITS2.

I have coded a FLOW in SVC for VITS2, which is similar to your pre-conv FLOW. However, the results were slightly different, and I am unable to determine if it is progressing as intended in SVC.

I found that you mentioned that mono-FLOW aligns with the author's intuition. So, I would like to know if it is the best method for FLOW in VITS2? I will attempt to train it for some days.

Discuss about using blank token in VITS2

The "add_blank" is True in vits2_ljs_base.json. In the original VITS paper, the authors said that they did not use the blank token.
Can you explain to me why you set "add_blank" as True?

Training doesn´t start when speaker IDs isn´t sequential from 0

I was scratching my head why training was always crashing with multiple errors like this:

../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [1,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [1,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed.

Looks like, itś because my custom VCTK-like dataset doesn´t have speaker ID numbering from 0 to N, rather I take VCTK speaker ids (like "374", etc). Why is speaker ID ordering/naming is so strict?

necessary of adversarial duration predictor

"note - duration predictor is not adversarial yet. In my earlier experiments with VITS-1, I used deterministic duration predictor (no-sdp) and found that it is quite good. So, I am not sure if adversarial duration predictor is necessary. But, I will add it sooner or later if it is necessary. Also, I want to combine parallel tacotron-2 and naturalspeech-1's learnable upsampling layer to remove MAS completely for E2E differentiable model."

On this issue, I would like to give my feedback:

  1. The experience of most of us training VITS is that the rhythm of VITS is relatively simple and the expressive is not enough. So, this problem, I think, is probably from MAS and duration predictor.

  2. So, for now, the adversarial duration predictor is an important improvement over VITS2.

3.deterministic duration predictor (no-sdp) : In the VITS-1 paper, there seems to be an ablation experiment that proves this is not good enough.

  1. StyleTTS2 is better than VITS in rhythmic diversity and expressive force, which can also prove the shortcomings of VITS. https://styletts2.github.io/

  2. So, I think the adversarial duration predictor is necessary. Look forward to you updating the code, and then experiment and discuss with you

Training code error

Hi, p0p4k, Thanks for sharing the code. It's a great project. I have been following this project for a long time and have tried it many times. But when I run the training code, I still get the following error. I guess some parameters were passed wrong in the code. The actual parameters are not completely obtained from vits2_ljs_base.json. I tried to debug and modify it, but it didn't work. Look forward to your review and reply.

when i run :
python train.py -c configs/vits2_ljs_base.json -m ljs_base

i meet the error:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 157, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 191, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/models.py", line 748, in forward z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/models.py", line 495, in forward x = self.pre(x) * x_mask File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: Given groups=1, weight of size [192, 80, 1], expected input[32, 513, 298] to have 80 channels, but got 513 channels instead

DDP GPU0 allocate too many memory

Hello, Thank you for your awesome work. When I train with two gpus in recent two days, I find too many process in GPU0?
How can I deal with it?
image.

DurationDiscriminatorType error

Hi there thank you for all this amazing work ! I really loved it when I tested it a few days ago, and it was working perfectly. However, today, when I tried the new version, I encountered some issues

just wanted to point out something I noticed in your code, specifically in the train.py script. It looks like you're doing some wrong there! The line where you define DurationDiscriminator

in file train.py line 168-170
DurationDiscriminator = AVAILABLE_DURATION_DISCRIMINATOR_TYPES[duration_discriminator_type]

net_dur_disc = DurationDiscriminator(

the DurationDiscriminator return str and not the class for DurationDiscriminator

also i notice the json miss the new

"duration_discriminator_type" :"dur_disc_2",
other get error
in files

vits2_ljs_base.json and
vits2_ljs_base_nospd.json

i fix json and when it's run the train.py get this i error return str in DurationDiscriminator like i say up

can you please explain the differences between 'dur_disc_1' and 'dur_disc_2' and recommend which one is preferable? in this new option or need to see my self what work best ? i am very beginner on this stuff and all this new to me,

Keep up the great work :)

Export to JIT script

Hi, I tried to export the model to JIT script but got this error:
reversed(Tensor 0) -> (Tensor 0):
Expected a value of type 'Tensor' for argument '0' but instead found type 'torch.torch.nn.modules.container.ModuleList
Seems JIT doesn't not support reversed, how to solve this?

why 'duration_discriminator_type' is redefined?

vits2_pytorch/train_ms.py

Lines 172 to 188 in 1f4f379

duration_discriminator_type = AVAILABLE_DURATION_DISCRIMINATOR_TYPES
if duration_discriminator_type == "dur_disc_1":
net_dur_disc = DurationDiscriminatorV1(
hps.model.hidden_channels,
hps.model.hidden_channels,
3,
0.1,
gin_channels=hps.model.gin_channels if hps.data.n_speakers != 0 else 0,
).cuda(rank)
elif duration_discriminator_type == "dur_disc_2":
net_dur_disc = DurationDiscriminatorV2(
hps.model.hidden_channels,
hps.model.hidden_channels,
3,
0.1,
gin_channels=hps.model.gin_channels if hps.data.n_speakers != 0 else 0,
).cuda(rank)

Due to this line, the net_dur_disc maintain None

KSS dataset for 490 epochs but the quality is not as good as I expected

First of all, thank you for sharing such a wonderful code.
I trained using the KSS dataset for 490 epochs, but the quality is not as good as I expected.
It seems that the TTS speaks a bit fast.
wav
What might have gone wrong during the training?

{
"train": {
"log_interval": 200,
"eval_interval": 3000,
"seed": 1234,
"epochs": 20000,
"learning_rate": 2e-4,
"betas": [0.8, 0.99],
"eps": 1e-9,
"batch_size": 32,
"fp16_run": false,
"lr_decay": 0.999875,
"segment_size": 8192,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"fft_sizes": [384, 683, 171],
"hop_sizes": [30, 60, 10],
"win_lengths": [150, 300, 60],
"window": "hann_window"
},
"data": {
"use_mel_posterior_encoder": true,
"training_files":"kss/kss_cjke_train.txt.cleaned",
"validation_files":"kss/kss_cjke_val.txt.cleaned",
"text_cleaners":["cjke_cleaners2"],
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 0,
"cleaned_text": true
},
"model": {
"use_mel_posterior_encoder": true,
"use_transformer_flows": true,
"transformer_flow_type": "pre_conv",
"use_spk_conditioned_encoder": false,
"use_noise_scaled_mas": true,
"use_duration_discriminator": true,
"ms_istft_vits": false,
"mb_istft_vits": true,
"istft_vits": false,
"subbands": 4,
"gen_istft_n_fft": 16,
"gen_istft_hop_size": 4,
"inter_channels": 192,
"hidden_channels": 96,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 3,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"upsample_rates": [4,4],
"upsample_initial_channel": 256,
"upsample_kernel_sizes": [16,16],
"n_layers_q": 3,
"use_spectral_norm": false,
"use_sdp": false
}
}

hop size issue

Hello,

When I train using a custom dataset, I encounter the following error with the following parameters:

"filter_length": 2048,
"hop_length": 512,
"win_length": 2048,

File "/mnt/Linux_DATA/synthesis/model/vits2_pytorch/train_ms.py", line 441, in train_and_evaluate
loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel
File "/home/p76111652/.conda/envs/vits/lib/python3.8/site-packages/torch/nn/functional.py", line 3263, in l1_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/home/p76111652/.conda/envs/vits/lib/python3.8/site-packages/torch/functional.py", line 74, in broadcast_tensors
return _VF.broadcast_tensors(tensors) # type: ignore[attr-defined]
RuntimeError: The size of tensor a (16) must match the size of tensor b (8) at non-singleton dimension 2

It seems that the shape of y_mel does not match the shape of y_hat_mel when the hop_size is increased beyond 256.
Any help would be extremely helpful, thanks!

good, how this combine to the bert-vits2?

it seems bert-vits2 performance better. and i have transformed the bert-vits2 to TensorRTBert-VITS2-Faster (there still need some change in structure of the vits2), but its vits2 is spliting to six sub module. for simplify, i think it is better to merge the six sub moduble to one, as you did. but how to combine your vits2 to the bert-vits2, may be need some check detail.
and another issue is the emotion of the vits2 is not good, especial in chinese, but the author of bert-vits2 are trying it.

Bad quality audio when infer with custom condition

Hi, I tried your code for training with both multi-speaker and single-speaker conditions, and it worked well for both training and inference. However, when I made some minor changes to model.py, modifying the forward and inference functions (e.g., replacing the speaker ID with speaker embeddings from a pre-trained speaker recognition model).

        if n_speakers > 1:
            self.emb_g = nn.Embedding(n_speakers, gin_channels)
            self.linlin = nn.Linear(768, gin_channels)

    def forward(self, x, x_lengths, y, y_lengths, sid=None):
        if self.n_speakers > 0:
            # g = self.emb_g(sid).unsqueeze(-1)  # [b, h, 1]
            g = sid
            g = self.linlin(g).unsqueeze(-1) # [b, h, 1]
        else:
            g = None

....

    def infer(
        self,
        x,
        x_lengths,
        sid=None,
        noise_scale=1,
        length_scale=1,
        noise_scale_w=1.0,
        max_len=None,
    ):

        if self.n_speakers > 0:
            # g = self.emb_g(sid).unsqueeze(-1)  # [b, h, 1]
            g = sid
            g = self.linlin(g).unsqueeze(-1) # [b, h, 1]
        else:
            g = None
        x, m_p, logs_p, x_mask = self.enc_p(x, x_lengths, g=g)

....

Training works well and the loss convergence:
[2.560758352279663, 2.2537946701049805, 3.8962457180023193, 20.862136840820312, 0.8815252184867859, 2.2975285053253174, 24100, 0.00019459892692329838]

But when I infer, the quality of audio is so bad (audio can represent speakers style but can't represent any word).
Do you have any idea for this?

sample:
Text: Scarcely had he uttered the name when Pierre's closing eyes shot open
Audio: https://drive.google.com/file/d/1OtWPVw82alLTV3n4i7e9kb1FaKPJ0OBW/view?usp=sharing

Have anyone tried to decrease the feature channels of the model?

Hi, all
We are trying to decrease the model size. as this project set, the feature channels are 192 (inter_channels=192, hidden_channels=192), has anyone tried to lower the channels, eg.160 ? tell me the result if you have tried and thank you very much. and a config file of the decreased model is welcome.

Fine tuning

Hi.

First of all, thanks for your contributions on VITS2!

I was wondering if you'd have any tips how to train a different (English) voice than the ljspeech one. I don't have access to very powerful GPUs, so it'd great if it were possible to do fine-tuning on top of the 64k steps checkpoint that you've already posted (again, thanks!).
Possibly something similar to https://github.com/nivibilla/efficient-vits-finetuning?

Failed to do inferencing with the provided checkpoint and inferencing code

Greetings,

Unfortunately this project only has a Colab for inferencing (the inference.ipynb file) . I had to adapt it into a "local" script, made some changes to save wav files etc.

The script runs fine, however, when saving the .wav files, they are all mute (despite not being 0kb). I even opened Audacity to check it, and the wave file indeed has nothing on it.

Can the author or someone else provide code that works with the provided checkpoint, running outside of a Colab environment? Thank you

Error when training model: AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown'

Greetings,

As seen on #10 (comment), someone successfully trained models, so I decided to try it myself.

I used the following command:

python train.py -c configs/vits2_voice_training.json -m mydataset

However, the following happens:

INFO:mydataset:{'train': {'log_interval': 867, 'eval_interval': 867, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 16, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/train_voice_1_filelist_v4.txt', 'validation_files': 'filelists/val_voice_1_filelist_v4.txt', 'text_cleaners': ['basic_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': False, 'n_speakers': 0, 'cleaned_text': True, 'use_mel_spec_posterior': False}, 'model': {'use_mel_posterior_encoder': False, 'use_transformer_flows': True, 'transformer_flow_type': 'pre_conv', 'use_spk_conditioned_encoder': False, 'use_noise_scaled_mas': True, 'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'max_text_len': 500, 'model_dir': './logs/mydataset'}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
Using lin posterior encoder for VITS1
Using transformer flows pre_conv for VITS2
Using normal encoder for VITS1
Using noise scaled MAS for VITS2
NOT using any duration discriminator like VITS1
Loading train data:   0%|                                                                                                                                 | 0/4 [00:00<?, ?it/s]
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f0f9a049240>
Traceback (most recent call last):
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1397, in _shutdown_workers
    if not self._shutdown:
AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown'
Traceback (most recent call last):
  File "/vits2_pytorch/train.py", line 417, in <module>
    main()
  File "/vits2_pytorch/train.py", line 54, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/vits2_pytorch/train.py", line 196, in run
    train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], [scheduler_g, scheduler_d, scheduler_dur_disc], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
  File "/vits2_pytorch/train.py", line 225, in train_and_evaluate
    for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(loader):
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
    return self._get_iterator()
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 988, in __init__
    super(_MultiProcessingDataLoaderIter, self).__init__(loader)
  File "/vits2_pytorch/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 598, in __init__
    self._sampler_iter = iter(self._index_sampler)
  File "/vits2_pytorch/data_utils.py", line 400, in __iter__
    ids_bucket = ids_bucket + ids_bucket * (rem // len_bucket) + ids_bucket[:(rem % len_bucket)]
ZeroDivisionError: integer division or modulo by zero

What could be the problem here, and what could I try doing to fix this?

vctk dataset training reproduction of repo

Can you please specify the way to reproduce training on vctk dataset?
I've downsampled all the .wavs and even changed naming:
`p362_074_mic1.wav -> p362_074.wav

But something is wrong. For example your script looks for p362_073.wav, but nothing similar exists in original dataset

custom inference with vits2

Hey there,
thanks to your help I was able to train custom data with vits2 :)

Now I wanted to tackle my first inference with the custom model - I used for inference notebook as a guideline.
Unfortunately I get an error in the voice conversion part.
I assume it has something to do with the speaker index - though in vits2 we are using multi-speaker, right?
This is the error I get:
I am looking forward to your answer!

Traceback (most recent call last):
File "test_inf.py", line 56, in
dataset = TextAudioSpeakerLoader(hps.data.validation_files, hps.data)
File "/opt/8tbdrive1/experiments/vits_copy/vits2_pytorch/data_utils.py", line 197, in init
self._filter()
File "/opt/8tbdrive1/experiments/vits_copy/vits2_pytorch/data_utils.py", line 209, in _filter
for audiopath, sid, text in self.audiopaths_sid_text:
ValueError: not enough values to unpack (expected 3, got 2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.