Giter VIP home page Giter VIP logo

yl4579 / styletts2 Goto Github PK

View Code? Open in Web Editor NEW
4.1K 76.0 302.0 133.94 MB

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

License: MIT License

Python 77.78% Jupyter Notebook 22.22%
deep-learning pytorch speaker-adaptation speech-synthesis text-to-speech tts wavlm diffusion-models latent-diffusion latent-diffusion-models

styletts2's Introduction

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

Paper: https://arxiv.org/abs/2306.07691

Audio samples: https://styletts2.github.io/

Online demo: Hugging Face (thank @fakerybakery for the wonderful online demo)

Open In Colab Discord

TODO

  • Training and inference demo code for single-speaker models (LJSpeech)
  • Test training code for multi-speaker models (VCTK and LibriTTS)
  • Finish demo code for multispeaker model and upload pre-trained models
  • Add a finetuning script for new speakers with base pre-trained multispeaker models
  • Fix DDP (accelerator) for train_second.py (I have tried everything I could to fix this but had no success, so if you are willing to help, please see #7)

Pre-requisites

  1. Python >= 3.7
  2. Clone this repository:
git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
  1. Install python requirements:
pip install -r requirements.txt

On Windows add:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -U

Also install phonemizer and espeak if you want to run the demo:

pip install phonemizer
sudo apt-get install espeak-ng
  1. Download and extract the LJSpeech dataset, unzip to the data folder and upsample the data to 24 kHz. The text aligner and pitch extractor are pre-trained on 24 kHz data, but you can easily change the preprocessing and re-train them using your own preprocessing. For LibriTTS, you will need to combine train-clean-360 with train-clean-100 and rename the folder train-clean-460 (see val_list_libritts.txt as an example).

Training

First stage training:

accelerate launch train_first.py --config_path ./Configs/config.yml

Second stage training (DDP version not working, so the current version uses DP, again see #7 if you want to help):

python train_second.py --config_path ./Configs/config.yml

You can run both consecutively and it will train both the first and second stages. The model will be saved in the format "epoch_1st_%05d.pth" and "epoch_2nd_%05d.pth". Checkpoints and Tensorboard logs will be saved at log_dir.

The data list format needs to be filename.wav|transcription|speaker, see val_list.txt as an example. The speaker labels are needed for multi-speaker models because we need to sample reference audio for style diffusion model training.

Important Configurations

In config.yml, there are a few important configurations to take care of:

  • OOD_data: The path for out-of-distribution texts for SLM adversarial training. The format should be text|anything.
  • min_length: Minimum length of OOD texts for training. This is to make sure the synthesized speech has a minimum length.
  • max_len: Maximum length of audio for training. The unit is frame. Since the default hop size is 300, one frame is approximately 300 / 24000 (0.0125) second. Lowering this if you encounter the out-of-memory issue.
  • multispeaker: Set to true if you want to train a multispeaker model. This is needed because the architecture of the denoiser is different for single and multispeaker models.
  • batch_percentage: This is to make sure during SLM adversarial training there are no out-of-memory (OOM) issues. If you encounter OOM problem, please set a lower number for this.

Pre-trained modules

In Utils folder, there are three pre-trained models:

  • ASR folder: It contains the pre-trained text aligner, which was pre-trained on English (LibriTTS), Japanese (JVS), and Chinese (AiShell) corpus. It works well for most other languages without fine-tuning, but you can always train your own text aligner with the code here: yl4579/AuxiliaryASR.
  • JDC folder: It contains the pre-trained pitch extractor, which was pre-trained on English (LibriTTS) corpus only. However, it works well for other languages too because F0 is independent of language. If you want to train on singing corpus, it is recommended to train a new pitch extractor with the code here: yl4579/PitchExtractor.
  • PLBERT folder: It contains the pre-trained PL-BERT model, which was pre-trained on English (Wikipedia) corpus only. It probably does not work very well on other languages, so you will need to train a different PL-BERT for different languages using the repo here: yl4579/PL-BERT. You can also use the multilingual PL-BERT which supports 14 languages.

Common Issues

  • Loss becomes NaN: If it is the first stage, please make sure you do not use mixed precision, as it can cause loss becoming NaN for some particular datasets when the batch size is not set properly (need to be more than 16 to work well). For the second stage, please also experiment with different batch sizes, with higher batch sizes being more likely to cause NaN loss values. We recommend the batch size to be 16. You can refer to issues #10 and #11 for more details.
  • Out of memory: Please either use lower batch_size or max_len. You may refer to issue #10 for more information.
  • Non-English dataset: You can train on any language you want, but you will need to use a pre-trained PL-BERT model for that language. We have a pre-trained multilingual PL-BERT that supports 14 languages. You may refer to yl4579/StyleTTS#10 and #70 for some examples to train on Chinese datasets.

Finetuning

The script is modified from train_second.py which uses DP, as DDP does not work for train_second.py. Please see the bold section above if you are willing to help with this problem.

python train_finetune.py --config_path ./Configs/config_ft.yml

Please make sure you have the LibriTTS checkpoint downloaded and unzipped under the folder. The default configuration config_ft.yml finetunes on LJSpeech with 1 hour of speech data (around 1k samples) for 50 epochs. This took about 4 hours to finish on four NVidia A100. The quality is slightly worse (similar to NaturalSpeech on LJSpeech) than LJSpeech model trained from scratch with 24 hours of speech data, which took around 2.5 days to finish on four A100. The samples can be found at #65 (comment).

If you are using a single GPU (because the script doesn't work with DDP) and want to save training speed and VRAM, you can do (thank @korakoe for making the script at #100):

accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml

Open In Colab

Common Issues

@Kreevoz has made detailed notes on common issues in finetuning, with suggestions in maximizing audio quality: #81. Some of these also apply to training from scratch. @IIEleven11 has also made a guideline for fine-tuning: #128.

  • Out of memory after joint_epoch: This is likely because your GPU RAM is not big enough for SLM adversarial training run. You may skip that but the quality could be worse. Setting joint_epoch a larger number than epochs could skip the SLM advesariral training.

Inference

Please refer to Inference_LJSpeech.ipynb (single-speaker) and Inference_LibriTTS.ipynb (multi-speaker) for details. For LibriTTS, you will also need to download reference_audio.zip and unzip it under the demo before running the demo.

You can import StyleTTS 2 and run it in your own code. However, the inference depends on a GPL-licensed package, so it is not included directly in this repository. A GPL-licensed fork has an importable script, as well as an experimental streaming API, etc. A fully MIT-licensed package that uses gruut (albeit lower quality due to mismatch between phonemizer and gruut) is also available.

Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.

Common Issues

  • High-pitched background noise: This is caused by numerical float differences in older GPUs. For more details, please refer to issue #13. Basically, you will need to use more modern GPUs or do inference on CPUs.
  • Pre-trained model license: You only need to abide by the above rules if you use the pre-trained models and the voices are NOT in the training set, i.e., your reference speakers are not from any open access dataset. For more details of rules to use the pre-trained models, please see #37.

References

License

Code: MIT License

Pre-Trained Models: Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.

styletts2's People

Contributors

ameerazam08 avatar astricks avatar awas666 avatar danielmsu avatar devidw avatar eltociear avatar fakerybakery avatar haoqi avatar kmn1024 avatar phields avatar yl4579 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

styletts2's Issues

Link to pretrained weights broken

Hello,

Thank you for sharing your wonderful text to speech model.

Unfortunately, when trying to download the pretrained weights the Google Drive link gives an error:
image
I get the error for both the LjSpeech and LibriTTS models.

Perhaps it is an idea to place your model including pretrained weights on a site such as huggingface.co? That way more people will be able to find and use your TTS, and it has more reliable filestorage.

why syletts2 can get emotions?

Hello, I want to know why syletts2 can get emotions, how does it know what emotion this sentence should have? Because I see no difference between neutral speech synthesis and multi-emotional speech synthesis in the inference code
image
image

Finetune Error Message

I get this error message when I try to finetune. I set batch_size to 12 and max_len to 14. I'm using torch-2.1.1 torchaudio-2.1.1 torchvision-0.16.1 if that matters.

python train_finetune.py --config_path ./Configs/config_ft.yml
Some weights of the model checkpoint at microsoft/wavlm-base-plus were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']

  • This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-base-plus and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'encoder.pos_conv_embed.conv.parametrizations.weight.original1']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    bert loaded
    bert_encoder loaded
    predictor loaded
    decoder loaded
    text_encoder loaded
    predictor_encoder loaded
    style_encoder loaded
    text_aligner loaded
    pitch_extractor loaded
    mpd loaded
    msd loaded
    wd loaded
    BERT AdamW (
    Parameter Group 0
    amsgrad: False
    base_momentum: 0.85
    betas: (0.9, 0.99)
    capturable: False
    differentiable: False
    eps: 1e-09
    foreach: None
    fused: None
    initial_lr: 1e-05
    lr: 1e-05
    max_lr: 2e-05
    max_momentum: 0.95
    maximize: False
    min_lr: 0
    weight_decay: 0.01
    )
    decoder AdamW (
    Parameter Group 0
    amsgrad: False
    base_momentum: 0.85
    betas: (0.0, 0.99)
    capturable: False
    differentiable: False
    eps: 1e-09
    foreach: None
    fused: None
    initial_lr: 0.0001
    lr: 0.0001
    max_lr: 0.0002
    max_momentum: 0.95
    maximize: False
    min_lr: 0
    weight_decay: 0.0001

Traceback (most recent call last):
File "/home/user/StyleTTS2/train_finetune.py", line 714, in
main()
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/user/StyleTTS2/train_finetune.py", line 302, in main
s = model.predictor_encoder(mel.unsqueeze(0).unsqueeze(1))
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 183, in forward
return self.module(*inputs[0], **module_kwargs[0])
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/models.py", line 160, in forward
h = self.shared(x)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
input = module(input)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (5 x 4). Kernel size: (5 x 5). Kernel size can't be greater than actual input size

I get a error message from trying to finetune a model

python train_finetune.py --config_path ./Configs/config_ft.yml
Some weights of the model checkpoint at microsoft/wavlm-base-plus were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_v', 'encoder.pos_conv_embed.conv.weight_g']

  • This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-base-plus and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    bert loaded
    bert_encoder loaded
    predictor loaded
    decoder loaded
    text_encoder loaded
    predictor_encoder loaded
    style_encoder loaded
    diffusion loaded
    Traceback (most recent call last):
    File "/home/user/StyleTTS2/train_finetune.py", line 714, in
    main()
    File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1157, in call
    return self.main(*args, **kwargs)
    File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
    File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
    File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
    File "/home/user/StyleTTS2/train_finetune.py", line 211, in main
    model, optimizer, start_epoch, iters = load_checkpoint(model, optimizer, config['pretrained_model'],
    File "/home/user/StyleTTS2/models.py", line 702, in load_checkpoint
    model[key].load_state_dict(params[key])
    File "/home/user/StyleTTS2/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for MyDataParallel:
    Missing key(s) in state_dict: "module.diffusion.net.blocks.0.attention.norm.weight", "module.diffusion.net.blocks.0.attention.norm.bias", "module.diffusion.net.blocks.0.attention.norm_context.weight", "module.diffusion.net.blocks.0.attention.norm_context.bias", "module.diffusion.net.blocks.1.attention.norm.weight", "module.diffusion.net.blocks.1.attention.norm.bias", "module.diffusion.net.blocks.1.attention.norm_context.weight", "module.diffusion.net.blocks.1.attention.norm_context.bias", "module.diffusion.net.blocks.2.attention.norm.weight", "module.diffusion.net.blocks.2.attention.norm.bias", "module.diffusion.net.blocks.2.attention.norm_context.weight", "module.diffusion.net.blocks.2.attention.norm_context.bias", "module.unet.blocks.0.attention.norm.weight", "module.unet.blocks.0.attention.norm.bias", "module.unet.blocks.0.attention.norm_context.weight", "module.unet.blocks.0.attention.norm_context.bias", "module.unet.blocks.1.attention.norm.weight", "module.unet.blocks.1.attention.norm.bias", "module.unet.blocks.1.attention.norm_context.weight", "module.unet.blocks.1.attention.norm_context.bias", "module.unet.blocks.2.attention.norm.weight", "module.unet.blocks.2.attention.norm.bias", "module.unet.blocks.2.attention.norm_context.weight", "module.unet.blocks.2.attention.norm_context.bias".
    Unexpected key(s) in state_dict: "module.diffusion.net.to_features.0.weight", "module.diffusion.net.to_features.0.bias", "module.diffusion.net.blocks.0.attention.norm.fc.weight", "module.diffusion.net.blocks.0.attention.norm.fc.bias", "module.diffusion.net.blocks.0.attention.norm_context.fc.weight", "module.diffusion.net.blocks.0.attention.norm_context.fc.bias", "module.diffusion.net.blocks.1.attention.norm.fc.weight", "module.diffusion.net.blocks.1.attention.norm.fc.bias", "module.diffusion.net.blocks.1.attention.norm_context.fc.weight", "module.diffusion.net.blocks.1.attention.norm_context.fc.bias", "module.diffusion.net.blocks.2.attention.norm.fc.weight", "module.diffusion.net.blocks.2.attention.norm.fc.bias", "module.diffusion.net.blocks.2.attention.norm_context.fc.weight", "module.diffusion.net.blocks.2.attention.norm_context.fc.bias", "module.unet.to_features.0.weight", "module.unet.to_features.0.bias", "module.unet.blocks.0.attention.norm.fc.weight", "module.unet.blocks.0.attention.norm.fc.bias", "module.unet.blocks.0.attention.norm_context.fc.weight", "module.unet.blocks.0.attention.norm_context.fc.bias", "module.unet.blocks.1.attention.norm.fc.weight", "module.unet.blocks.1.attention.norm.fc.bias", "module.unet.blocks.1.attention.norm_context.fc.weight", "module.unet.blocks.1.attention.norm_context.fc.bias", "module.unet.blocks.2.attention.norm.fc.weight", "module.unet.blocks.2.attention.norm.fc.bias", "module.unet.blocks.2.attention.norm_context.fc.weight", "module.unet.blocks.2.attention.norm_context.fc.bias".

Multispeaker Config

Hi @yl4579 thanks so much for your work.

To train multispeaker, do we just need to generate train|val_list.txt and change multispeaker to True in the config?

Any chance you could share (for example) your VCTK training script?

Epoch-based loss_param values for LibriTTS

The paper describes reducing the total training epochs for stage 1 (100 -> 30) and for stage 2 (60 -> 25) when moving from LJSpeech to LibriTTS. I'm wondering about the epoch-based loss_params like TMA_epoch, diff_epoch, and joint_epoch. How were those changed for LibriTTS?

Continuing after an interrruption

Hi and thank you for the excellent work you're providing in this repository! It's much appreciated.
I have a question, trying to run this project on Google Colab with a single GPU and a free plan: is there a way to continue where the first or second stage stopped when I (or Google) interrupt the notebook run?

(Question) Max Length and datasets

A lot of useful information is found in the other (closed) issues, but these questions come to mind.

  • How does max_len impact the training/finetuning process exactly?

In the LJS dataset, there are audio files with a duration far longer than the max_len: 400 (=5 seconds) as it is specified in the example config file. Many files are 10 seconds long and a great majority are longer than 5 seconds. They are also included in the train_list.txt . Was this intentional?

  • Are audiofiles truncated once the maximum number of frames is reached?

Should the datasets be carefully edited so that audiofiles do not exceed the maximum duration set in the config file? Is there a detrimental effect on adherence to punctuation or spelling when the model only sees short or clipped speech?

  • Is there a maximum permissible length / does the architecture impose restrictions? Could max_len be set to something like 1200 and thus make full use of long audio files? (Ignoring the VRAM requirements in the current DP implementation)

Additional requirements for README

Saw the code released and just got a chance to poke around. Great results testing out the default inference model!

From my install, I did have a couple install notes. As mentioned in #5 , I also ran into the PLBERT error but the fix worked and that was the only code problem for inference.

For dependencies from a fresh mamba env, nltk and matplotlib need to be installed as well from the ipynb (although matplotlib isn't used in my Python code). I also used soundfile for WAV output.

The only other gotcha getting things up and running is that currently, Pytorch doesn't work with Python 3.12 (which just released and is what's installed if you just install python or pip) - 3.11 is fine w/ PyTorch nightly, although maybe 3.10 is still required for stable releases.

Oh, also, in case anyone doesn't know pip install gdown is great for grabbing GDrive links onto a server.

Happy to submit a PR adding these to the docs if you'd like, otherwise, just leaving a note here for anyone else getting the code up and running.

text_encoder and text_aligner are not optimized in train_second.py

I'm trying to reconcile a difference I'm seeing between the paper and the code. Figure 1a says, "joint training then follows to optimize all components except the pitch extractor". When I look at train_second.py, though, I see two components that are not being optimized: text_encoder and text_aligner. These two components are used only in no_grad contexts in train_second.py, which makes sense to me, but I wanted to check that this is correct. Thanks!

about text aligner

Maybe the asr module(text aligner) can be replaced with vits's idea??

Python 3.10 Colab

RuntimeError: Error(s) in loading state_dict for CustomAlbert:
Unexpected key(s) in state_dict: "embeddings.position_ids".

Awesome in english but no support for other languages - please add an example for another language (german, italian, french etc)

The readme makes it sound very simple: "Replace bert with xphonebert"
Looking a bit closer looks like it's quite a feat to make StyleTTS2 talk in non-english languages (#28)

StyleTTS2 looks like the best approach we have right now, but only english is a killer for many as it means any app will be limited to english without prospect for other users in sight.

Some help to get this going in foreign languages would be awesome.

It appears we need to change inference code and re-train text and phonetics. Any demo/guide would be great

Alternatively re-training the current PL-Bert for other languages, though that needs a corpus and I've no idea on the cost ?
(https://github.com/yl4579/PL-BERT)

How to replace PL-BERT with XPhoneBERT?

Hi,

I'm looking to generate Hindi audio but it was mentioned that PL-BERT doesn't work well with other languages and I either need to train a different PL-BERT or replace the module with XPhoneBERT.

I'm having trouble understanding how I could go about replacing the module with XPhoneBERT. The repository XPhoneBERT describes using the model through transformers but I'm unsure how I can apply that here and this issue thread suggests that the pre-trained model is not publicized so, how do I go about replacing PL-BERT with XPhoneBERT here?

Thanks!

Portability? (iOS, etc.)

Sorry for the naive question - is there any suggested direction to porting this to iOS given that it's all Python, or would that require a completely original project?

Laugh, Chuckle, giggling, 'Uhuh'(Nodding sound), etc

I'm wondering how to extend this excellent work to synthesize more types of expressions like in the title?

For laughs, just using the text "Haha" works, but often sounds like a fake laugh and I have no idea what text makes a giggle sound, it might not even have an IPA representations.
I guess a better approach would be to add some VQS symbols, add some appropriate training data that includes the extra VQS symbols and hack the phonemizer to allow something like "*giggle* Thats funny!" to be phonemized by using a custom dictionary for the giggle part and espeak for the rest. Finally, retrain the ASR with an expanded vocabulary. Is this a good approach?

If the vocabulary expands, then would I need to retrain the ASR from scratch or can it be finetuned?

Alternatively, sampling a laughing, giggling, etc style might work but I think there would be less fine-grained speech control than expanding the vocabulary, especially if laughing and giggling appear in the same same sentence, wouldn't it?

Finetuning and dataset preparation

First of all is it already possible to finetune a single speaker model?
If so, what should one pay attention to?

Second:
How do you prepare a dataset?
train and val is pretty clear but the OOD_text confuse me a little, how do I get to those?

Contextual learning

Sorry if this question is more generic but I'm fairly new to the TTS field and I'm not sure if what I want to ask is a project-specific thing or a TTS-specific thing.

I'd like to ask how much is the context of sentences in WAV files important for training.

I'm working on my own set of WAV files using the LJSpeech structure. While testing this project's demo output by using the model generously provided by you (thank you!), I realized that the output is sometimes sounding "wrong" at the end of a sentence.

In some sentences, the voice is going up, as if there was a comma or a questionmark at the end instead of a full-stop. In other sentences this does not occur.

When I played back some of the LJSpeech sentences used for training, I found that this is exactly the same.

What I'm not sure about is whether the model learns to make the same mistake by the context of the sentence itself or it's just repeating the same thing because there are the same / similar words towards the end of the sentence?

I'm trying to understand how to best create my WAV files, so the model is trained well with regards to the emotional context of the sentence.

Example: "Oh my god! How did that happen?!", exclaimed Anna with a tone of irritation in her voice.

If I use this sentence as a whole, would the TTS learn to use an irritated surprise emotion where the context of irritation is present? Or does this not matter and the model would only learn the irritated tone from the sound in that quote, regardless of the context following it?

Thanks for reading and sorry again for a super-long question!

Poor audio quality after fine-tuning

I'm trying to fine-tune the LibriTTS checkpoint on ~1 hour of LJSpeech but get poor results. Could you please give me some directions or help to spot the issue?

How I fine-tuned:

  1. Pulled the latest changes from the repo
  2. Replaced Data/train_list.txt with a copy that only has the first 1000 lines (~1 hour for training)
  3. Changed batch_size to 4 and max_len to 100, otherwise it doesn't fit into the memory of my 4090 (24GB).
  4. After training it for 50-100 epochs, I tested new checkpoints with both Inference_LibriTTS.ipynb and Inference_LJSpeech.ipynb notebooks by changing the multispeaker parameter in the config to true/false.
  5. Inference_LJSpeech.ipynb produces very noisy results with a poor pronunciation.
  6. Inference_LibriTTS.ipynb with reference audio from LJSpeech has a good pronunciation, but there are noticeable noises (example - https://voca.ro/1nQ8Ltjhsh9y)

Thank you again for the awesome project!

questions on training

a) when loading check point for train_second.py, apart from the nets for mpd, msd and wd, saw the need to add the prefix "module." to make the key names compatible. is this expected ?

diff --git a/models.py b/models.py
index 99b4f3d..fc03a8b 100644
--- a/models.py
+++ b/models.py

+from collections import OrderedDict
 
 class LearnedDownSample(nn.Module):
     def __init__(self, layer_type, dim_in):
@@ -697,9 +700,19 @@ def load_checkpoint(model, optimizer, path, load_only_params=True, ignore_module
     state = torch.load(path, map_location='cpu')
     params = state['net']
     for key in model:
+        new_state_dict = OrderedDict()
+        for k, v in params[key].items():
+            name = 'module.' + k # add `module.`
+            new_state_dict[name] = v
+
+        if key in ['mpd', 'msd', 'wd'] : 
+            new_state_dict = params[key]    
+
         if key in params and key not in ignore_modules:
             print('%s loaded' % key)
-            model[key].load_state_dict(params[key])
+            #model[key].load_state_dict(params[key])
+            model[key].load_state_dict(new_state_dict)
+
     _ = [model[key].eval() for key in model]

b) passing pre-trained model for train_second.py
after training first stage for around 50 epochs, defined the paramsfirst_stage_path and pre-trained_model
anything more needed ?

diff --git a/Configs/config.yml b/Configs/config.yml
index b74b8ee..2a9f93c 100644
--- a/Configs/config.yml
+++ b/Configs/config.yml
@@ -1,13 +1,13 @@
 log_dir: "Models/LJSpeech"
-first_stage_path: "first_stage.pth"
+first_stage_path: "epoch_1st_00048.pth"
 save_freq: 2
-batch_size: 16
-max_len: 400 # maximum number of frames
-pretrained_model: ""
+batch_size: 2
+max_len: 100 # maximum number of frames
+pretrained_model: "Models/LJSpeech/epoch_1st_00048.pth"
 second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
 load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters

c) resume training in first stage?
defining the params first_stage_path suffice ?
further, doing the modifications as done in a) does not seem to be needed.

Weird error when finetuning using colab in repo

Weird error when finetuning. I tried to put 'embeddings' in ignore_modules but it didn't change anything.

bert
bert loaded
Traceback (most recent call last):
  File "train_finetune.py", line 714, in <module>
    main()
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "train_finetune.py", line 212, in main
    load_only_params=config.get('load_only_params', True))
  File "/data/Repos/Forsen2/StyleTTS2/models.py", line 703, in load_checkpoint
    model[key].load_state_dict(params[key])
  File "/home/shiro/miniconda3/envs/styletts_fine/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1672, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for MyDataParallel:
	Missing key(s) in state_dict: "module.embeddings.position_ids". 

High-pitched noise in the background when using old GPUs

Previously discussed here: #1 (comment)

The model produces some high-pitched noise in the background when I use my old GPU for inference (NVIDIA Quadro P5000, Driver Version: 515.105.01, CUDA Version: 11.7)

Audio examples:

I solved this problem by switching to CPU device, so this issue is just for reference, as asked by the author.

Thank you for your work!

PIP package

Hi, are you planning to allow us to install this via pip by creating a setup.py file?

Very high memory usage when training Stage 1

Hello,
Thanks for the great work.
I'm trying to train a model on my dataset using an A5000 (24GB VRAM). I kept getting OOM at the beginning of Stage 1. I kept reducing batch size, and finally, the training could go on with a batch size of 4.
Is this normal? What hardware were you using?
Thanks!

Extremely weird DDP issue for train_second.py

So far train_second.py only works with DataParallel (DP) but not DistributedDataParalell (DDP). One major problem with this is if we simply translate DP to DDP (code in the comment section), we encounter the following problem:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 6; expected version 5 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
It is insanely difficult to debug. The tensor has no batch dimension, indicating it might be a parameter in the neural network. I found the tensor to be the bias term of the last Conv1D layer of predictor_encoder (prosodic style encoder): https://github.com/yl4579/StyleTTS2/blob/main/models.py#L152. This is extremely weird because the problem does not trigger for any Conv1D layer before this.

More mysteriously, issue surprisingly disappears if we add model.predictor_encoder.train() near line 250 of train_second.py. However, this causes the F0 loss to be much higher than without this line. This is true for both DP and DDP, so the higher F0 loss value is caused by model.predictor_encoder.train(), not DDP. Unfortunately, the predictor_encoder, which is StyleEncoder, has no module that changes the behavior depending on whether it is in train or eval mode. The output is exactly the same whether it is set to train or eval.

TLDR: There are three issues with train_second.py:

  1. DDP does not work because of the in-place operation error. The error disappears if model.predictor_encoder.train() before training.
  2. However, model.predictor_encoder.train() causes F0 loss to be much higher after convergence. This issue is independent of using DP or DDP.
  3. model.predictor_encoder is an instantiation of StyleEncoder, which has no components that change the output depending on its train or eval mode.

This problem has bugged me for more than a month, but I can't find a solution to it. It would be greatly appreciated if anyone has any insight into how to fix this problem. I have pasted the broken DDP code with accelerator below.

Licensing issue

Hi,
This package uses the phonemizer library which is GPL licensed (because it depends on espeak-ng by Jonathan Duddington and nothing's been heard of him for years). That means all software that uses it must also be GPL licensed. Might it be possible to switch to an alternate library (preferably deep_phonemizer or g2p_en)? Thanks!

About the code of the decoder part

Thank you for your great work!🙂
The idea of ​​combining an encoder and a vocoder gave me great inspiration. I am trying to implement this idea now. Would you like to provide this part of the code for reference?
Thank you again and look forward to your reply.

SLM adversarial training: 3 - 6 seconds in duration?

I'm having trouble reconciling the paper and the code when it comes to the min_len and max_len for the slmadv_params. They are set to 400 and 500 respectively here, but the paper states:
"For SLM adversarial training, both the ground truth and generated samples were ensured to be 3 to 6 seconds in duration".
I'm not sure how exactly to interpret the units on min_len and max_len, but that ratio definitely doesn't line up with 3-6 seconds.
Those parameters get used here, where they're halved and compared against the number of mel frames. If that's the correct interpretation, then with the mel transform here giving us ~80 frames per second of 24k audio, I think min_len would be set to 480 and max_len would be set to 960 to match the paper. Is that correct? Can you help clear this up for me?

inference

I used the ljspeech pre-trained model you provided for inference, and found that directly using the Inference_LJSpeech.ipynb file under the /Demo directory for generation works well. However, if I first use the compute_style function to calculate the style of the audio (data in the LJSpeech data set), then When combined, the result will be slightly less effective. I want to ask why?
image
image

iSTFTNet and LibriTTS

I'm curious if you can share any observations about using iSTFTNet with LibriTTS. The paper implies that the performance of iSTFTNet was insufficient for LibriTTS and so HiFiGAN was adopted, but I was wondering if you did any experiments with iSTFTNet and LibriTTS and what you saw.

speaker selection on inference on finetuned libritts

Hello- thanks again for sharing this project. The output quality is very impressive.
I was able to finetune the libritts model you shared with another voice to 199 steps.
Is there a way to select the speaker from the model? Im getting difference speaker outputs each time I run inference. Also- is a reference clip required? I would like to just get inference from the finetuned model without using a reference clip to see how it performs.

Possibly misleading license info

This repo claims to use an MIT license, but there are additional license requirements buried in the readme file:

Before using these models, you agree to inform the listeners that the speech samples are synthesized by StyleTTS 2 models, unless you have the permission to use the voice you synthesize. That is, when using it for voice cloning, you also agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices pubilc.

This is very misleading because someone simply checking the license file before using the repo would make the assumption that only the MIT license requirements apply.

I understand that the intention is probably to license the code in the repo as MIT, while having additional license requirements for the pre-trained models. However, because the only apparent way to get the models is a Google Drive link contained in the repo, it still seems misleading. Additionally, the wording above suggests that this license requirement applies not only to the pre-trained models but also to any StyleTTS 2 models created using the code in the repo.

IMO, any additional license requirements for using the code in the repo as intended should be mentioned in the repo license file.

If I am misunderstanding something, I take no offense to you simply closing this issue without comment.

about model and code

Thank you for your outstanding work. I am also very interested in this paper and the diffusion model. I wonder when the source code and the specific training model can be made public. 3q

Audio streaming

Hi,
Can we stream audio as it is being generated for longer texts?
Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.