nvidia / radtts Goto Github PK

Provides training, inference and voice conversion recipes for RADTTS and RADTTS++: Flow-based TTS models with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.

License: MIT License

Python 6.68% Roff 93.31% Dockerfile 0.01%

radtts's Introduction

Flow-based TTS with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.

This repository contains the source code and several checkpoints for our work based on RADTTS. RADTTS is a normalizing-flow-based TTS framework with state of the art acoustic fidelity and a highly robust audio-transcription alignment module. Our project page and some samples can be found here, with relevant works listed here.

This repository can be used to train the following models:

A normalizing-flow bipartite architecture for mapping text to mel spectrograms
A variant of the above, conditioned on F0 and Energy
Normalizing flow models for explicitly modeling text-conditional phoneme duration, fundamental frequency (F0), and energy
A standalone alignment module for learning unspervised text-audio alignments necessary for TTS training

HiFi-GAN vocoder pre-trained models

We provide a checkpoint and config for a HiFi-GAN vocoder trained on LibriTTS 100 and 360.
For a HiFi-GAN vocoder trained on LJS, please download the v1 model provided by the HiFi-GAN authors here, .

RADTTS pre-trained models

Model name	Description	Dataset
RADTTS++DAP-LJS	RADTTTS model conditioned on F0 and Energy with deterministic attribute predictors	LJSpeech Dataset

We will soon provide more pre-trained RADTTS models with generative attribute predictors trained on LJS and LibriTTS. Stay tuned!

Setup

Clone this repo: git clone https://github.com/NVIDIA/RADTTS.git
Install python requirements or build docker image
- Install python requirements: pip install -r requirements.txt
Update the filelists inside the filelists folder and json configs to point to your data
- basedir – the folder containing the filelists and the audiodir
- audiodir – name of the audiodir
- filelist – | (pipe) separated text file with relative audiopath, text, speaker, and optionally categorical label and audio duration in seconds

Training RADTTS (without pitch and energy conditioning)

Train the decoder
python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir
Further train with the duration predictor python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir_dir train_config.warmstart_checkpoint_path=model_path.pt model_config.include_modules="decatndur"

Training RADTTS++ (with pitch and energy conditioning)

Train the decoder
python train.py -c config_ljs_decoder.json -p train_config.output_directory=outdir
Train the attribute predictor: autoregressive flow (agap), bi-partite flow (bgap) or deterministic (dap)
python train.py -c config_ljs_{agap,bgap,dap}.json -p train_config.output_directory=outdir_wattr train_config.warmstart_checkpoint_path=model_path.pt

Training starting from a pre-trained model, ignoring the speaker embedding table

Download our pre-trained model
python train.py -c config.json -p train_config.ignore_layers_warmstart=["speaker_embedding.weight"] train_config.warmstart_checkpoint_path=model_path.pt

Multi-GPU (distributed)

python -m torch.distributed.launch --use_env --nproc_per_node=NUM_GPUS_YOU_HAVE train.py -c config.json -p train_config.output_directory=outdir

Inference demo

python inference.py -c CONFIG_PATH -r RADTTS_PATH -v HG_PATH -k HG_CONFIG_PATH -t TEXT_PATH -s ljs --speaker_attributes ljs --speaker_text ljs -o results/

Inference Voice Conversion demo

python inference_voice_conversion.py --radtts_path RADTTS_PATH --radtts_config_path RADTTS_CONFIG_PATH --vocoder_path HG_PATH --vocoder_config_path HG_CONFIG_PATH --f0_mean=211.413 --f0_std=46.6595 --energy_mean=0.724884 --energy_std=0.0564605 --output_dir=results/ -p data_config.validation_files="{'Dummy': {'basedir': 'data/', 'audiodir':'22khz', 'filelist': 'vc_audiopath_txt_speaker_emotion_duration_filelist.txt'}}"

Config Files

Filename	Description	Nota bene
config_ljs_decoder.json	Config for the decoder conditioned on F0 and Energy
config_ljs_radtts.json	Config for the decoder not conditioned on F0 and Energy
config_ljs_agap.json	Config for the Autoregressive Flow Attribute Predictors	Requires at least pre-trained alignment module
config_ljs_bgap.json	Config for the Bi-Partite Flow Attribute Predictors	Requires at least pre-trained alignment module
config_ljs_dap.json	Config for the Deterministic Attribute Predictors	Requires at least pre-trained alignment module

LICENSE

Unless otherwise specified, the source code within this repository is provided under the MIT License

Acknowledgements

The code in this repository is heavily inspired by or makes use of source code from the following works:

Tacotron implementation from Keith Ito
STFT code from Prem Seetharaman
Masked Autoregressive Flows
Flowtron
Source for neural spline functions used in this work: https://github.com/ndeutschmann/zunis
Original Source for neural spline functions: https://github.com/bayesiains/nsf
Bipartite Architecture based on code from WaveGlow
HiFi-GAN
Glow-TTS

Relevant Papers

Rohan Badlani, Adrian Łańcucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro.
One TTS Alignment to Rule Them All. ICASSP 2022

Kevin J Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro.
RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis.
ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models 2021

Kevin J Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro.
Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows. Technical Report

radtts's People

Contributors

Stargazers

Watchers

radtts's Issues

How to transfer speech style from another reference audio?

Hi, thank you for providing the pretrained weights. I can now synthesize speeches by providing texts. My question is how to replicate the example of conditioning on another reference audio for the pitch and f0 information? (As shown in the example to rap in the project page)

Thanks!

Training for singing models

We are trying to train a singing model. We are satisfied with the timbre of the sound being produced through the decoder - it sounds like singing, at least using ground truth features from the training data. However, the lyrics are typically not recognizable, at least with the amount of training that typically generates recognizable speech from text. We know that the phoneme encodings are reasonable since we can train text to speech models, and have tried warmstarting from a text to speech model. Have you trained a singing model, and what sort of data / training curriculum did you use? Thanks!

Output of voice conversion has source model's timbre, not destination models timbre

Hey all, I'm running this basic command from the radtts repo here:

python inference_voice_conversion.py --radtts_path RADTTS_PATH --radtts_config_path RADTTS_CONFIG_PATH --vocoder_path HG_PATH --vocoder_config_path HG_CONFIG_PATH --f0_mean=211.413 --f0_std=46.6595 --energy_mean=0.724884 --energy_std=0.0564605 --output_dir=results/ -p data_config.validation_files="{'Dummy': {'basedir': 'data/', 'audiodir':'22khz', 'filelist': 'vc_audiopath_txt_speaker_emotion_duration_filelist.txt'}}"

The problem is that audio output by this voice conversion command still uses the timbre of the .wav that I'm trying to transfer the style from. The timbre of the audio output does NOT sound like the radtts model's voice; it sounds like the file I passed in inside the validation files.

Can anyone please give me advice on what this means, and how to fix it? Is my radtts model not trained well enough? Am I doing it wrong?

I copied the file whose style I'm trying to transfer to /data/22khz, and then wrote its details inside vc_audiopath_txt_speaker_emotion_duration_filelist.txt. Any help would be greatly appreciated. Thanks!

Certain texts in LJ speech unloadable

I am getting an EOF error on certain data points within the LJ speech dataset. In particular, the text was after 1807, through the exertions of the keeper of the jail, spent in the purchase of necessaries. does not work, while both was after 1807 through the exertions of the keeper of the jail spent in the purchase of necessaries. and was after 1807, through the exertions of the keeper of the jail, spent in the purchase of necessaries do. Any idea why this is occurring and how I can fix this issue? Thanks

Required amount of data and iterations to train the model

Hi, I'm training your model from scratch on 60 votes, each with 3-15 minutes of data. Surprisingly, the model starts to retrain already at 26k iterations with batch 12, given that the total duration of all audio files is about 7-8 hours. Unfortunately, I got unsatisfactory results, the speech of many speakers is completely illegible. I attach screenshots of the decoder training.

Trouble with inferencing without pitch and energy condition

Hi, I trained LJSpeech datasets without pitch and energy conditioning, and tried to inference it.
However, I got
AttributeError: 'RADTTS' object has no attribute 'dur_pred_layer'
error.

When training, I followed the code as stated in the readme.
And here is the colab notebook that I tried to inference:
https://colab.research.google.com/drive/1pwqjZri7k_hoLUNFK3Cxmk12BPad67Kb?usp=sharing

Thank You

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I have successfully trained a model on the first step (with the decoder config), but training fails on the second step when I am trying to train a model with config_ljs_agap.json config.

The error:

...
setting up tboard log in outdir_model/logs
Training voiced prediction
Training voiced embeddings
Epoch: 0
/home/yehor/.local/lib/python3.8/site-packages/torch/nn/modules/rnn.py:777: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at ../aten/src/ATen/native/cudnn/RNN.cpp:968.)
  result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,
/home/yehor/.local/lib/python3.8/site-packages/torch/nn/modules/rnn.py:777: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at ../aten/src/ATen/native/cudnn/RNN.cpp:968.)
  result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,
Traceback (most recent call last):
  File "train.py", line 499, in <module>
    train(n_gpus, rank, **train_config)
  File "train.py", line 417, in train
    scaler.scale(loss).backward()
  File "/home/yehor/.local/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/yehor/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

A fork of the source code is here: https://github.com/egorsmkv/radtts/

I am changed the code to support Ukrainian.

Environment:

python 3.8.10
torch 1.13.0+cu117

why mix phone and word embedding

        text_phoneme = [
           re.sub(r'\s(\d)', r'\1', word[1].upper()) if word[0] == '' else (
                self.get_phoneme(word[0])
                if np.random.uniform() < self.p_phoneme
                else word[0])
            for word in words]

in text_processing.py ，this code will return word and phoneme, why use both word and phoneme embedding ? Thanks

AttributeError: 'RADTTS' object has no attribute 'dur_pred_layer'

I receive an error when I inferenced the text with the pretrained model

Question about spectrogram normalization

Since we are using a variety of pretrained vocoders, we need to be able to generate outputs with different spectrogram normalizations. I notice that spectrogram normalization in

https://github.com/NVIDIA/radtts/blob/main/data.py#L270

seems somewhat arbitrary. What was the reason for this choice?

Cannot train starting from pre-trained model b/c audio files not found

I'm running radtts on an AWS g4dn.xlarge, not in a Docker container (although God bless you for providing one!)

Here's the command I'm running that's failing:
ubuntu@ip-172-31-93-8:~/radtts$ python3 train.py -c ./configs/config_ljs_decoder.json -p train_config.output_directory=./output

Here's the output from running that command:

ubuntu@ip-172-31-93-8:~/radtts$ python3 train.py -c ./configs/config_ljs_decoder.json -p train_config.output_directory=./output
train_config.output_directory=./output
output_directory=./output
overriding output_directory with ./output
{'train_config': {'output_directory': './output', 'epochs': 10000000, 'optim_algo': 'RAdam', 'learning_rate': 0.0001, 'weight_decay': 1e-06, 'sigma': 1.0, 'iters_per_checkpoint': 2500, 'batch_size': 16, 'seed': None, 'checkpoint_path': '', 'ignore_layers': [], 'ignore_layers_warmstart': [], 'finetune_layers': [], 'include_layers': [], 'vocoder_config_path': 'models/hifigan_config_22khz.json', 'vocoder_checkpoint_path': 'models/hifigan_ljs_generator_v1', 'log_attribute_samples': False, 'log_decoder_samples': True, 'warmstart_checkpoint_path': '', 'use_amp': False, 'grad_clip_val': 1.0, 'loss_weights': {'blank_logprob': -1, 'ctc_loss_weight': 0.1, 'binarization_loss_weight': 1.0, 'dur_loss_weight': 1.0, 'f0_loss_weight': 1.0, 'energy_loss_weight': 1.0, 'vpred_loss_weight': 1.0}, 'binarization_start_iter': 6000, 'kl_loss_start_iter': 18000, 'unfreeze_modules': 'all'}, 'data_config': {'training_files': {'LJS': {'basedir': 'filelists/', 'audiodir': 'wavs', 'filelist': 'ljs_audiopath_text_speaker_train_filelist.txt', 'lmdbpath': ''}}, 'validation_files': {'LJS': {'basedir': 'filelists/', 'audiodir': 'wavs', 'filelist': 'ljs_audiopath_text_speaker_val_filelist.txt', 'lmdbpath': ''}}, 'dur_min': 0.1, 'dur_max': 10.2, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': 8000.0, 'f0_min': 80.0, 'f0_max': 640.0, 'max_wav_value': 32768.0, 'use_f0': True, 'use_log_f0': 0, 'use_energy_avg': True, 'use_scaled_energy': True, 'symbol_set': 'radtts', 'cleaner_names': ['radtts_cleaners'], 'heteronyms_path': 'tts_text_processing/heteronyms', 'phoneme_dict_path': 'tts_text_processing/cmudict-0.7b', 'p_phoneme': 1.0, 'handle_phoneme': 'word', 'handle_phoneme_ambiguous': 'ignore', 'include_speakers': None, 'n_frames': -1, 'betabinom_cache_path': 'data_cache/', 'lmdb_cache_path': '', 'use_attn_prior_masking': True, 'prepend_space_to_text': True, 'append_space_to_text': True, 'add_bos_eos_to_text': False, 'betabinom_scaling_factor': 1.0, 'distance_tx_unvoiced': False, 'mel_noise_scale': 0.0}, 'dist_config': {'dist_backend': 'nccl', 'dist_url': 'tcp://localhost:54321'}, 'model_config': {'n_speakers': 1, 'n_speaker_dim': 16, 'n_text': 185, 'n_text_dim': 512, 'n_flows': 8, 'n_conv_layers_per_step': 4, 'n_mel_channels': 80, 'n_hidden': 1024, 'mel_encoder_n_hidden': 512, 'dummy_speaker_embedding': False, 'n_early_size': 2, 'n_early_every': 2, 'n_group_size': 2, 'affine_model': 'wavenet', 'include_modules': 'decatnvpred', 'scaling_fn': 'tanh', 'matrix_decomposition': 'LUS', 'learn_alignments': True, 'use_speaker_emb_for_alignment': False, 'attn_straight_through_estimator': True, 'use_context_lstm': True, 'context_lstm_norm': 'spectral', 'context_lstm_w_f0_and_energy': True, 'text_encoder_lstm_norm': 'spectral', 'n_f0_dims': 1, 'n_energy_avg_dims': 1, 'use_first_order_features': False, 'unvoiced_bias_activation': 'relu', 'decoder_use_partial_padding': True, 'decoder_use_unvoiced_bias': True, 'ap_pred_log_f0': True, 'ap_use_unvoiced_bias': True, 'ap_use_voiced_embeddings': True, 'dur_model_config': None, 'f0_model_config': None, 'energy_model_config': None, 'v_model_config': {'name': 'dap', 'hparams': {'n_speaker_dim': 16, 'take_log_of_input': False, 'bottleneck_hparams': {'in_dim': 512, 'reduction_factor': 16, 'norm': 'weightnorm', 'non_linearity': 'relu'}, 'arch_hparams': {'out_dim': 1, 'n_layers': 2, 'n_channels': 256, 'kernel_size': 3, 'p_dropout': 0.5, 'lstm_type': '', 'use_linear': 1}}}}}
> got rank 0 and world size 1 ...
./output
Using seed 1806
Applying spectral norm to text encoder LSTM
Applying spectral norm to context encoder LSTM
/home/ubuntu/radtts/common.py:391: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2497.)
  W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
Initializing RAdam optimizer
RADTTS(
  (speaker_embedding): Embedding(1, 16)
  (embedding): Embedding(185, 512)
  (flows): ModuleList(
    (0): FlowStep(
      (invtbl_conv): Invertible1x1ConvLUS()
      (affine_tfn): AffineTransformationLayer(
        (affine_param_predictor): WN(
          (in_layers): ModuleList(
            (0): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(2,))
            )
            (1): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(4,), dilation=(2,))
            )
            (2): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(8,), dilation=(4,))
            )
            (3): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(16,), dilation=(8,))
            )
          )
          (res_skip_layers): ModuleList(
            (0): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (1): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (2): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (3): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
          )
          (start): Conv1d(1120, 1024, kernel_size=(1,), stride=(1,))
          (softplus): Softplus(beta=1, threshold=20)
          (end): Conv1d(1024, 160, kernel_size=(1,), stride=(1,))
        )
      )
    )
    (1): FlowStep(
      (invtbl_conv): Invertible1x1ConvLUS()
      (affine_tfn): AffineTransformationLayer(
        (affine_param_predictor): WN(
          (in_layers): ModuleList(
            (0): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(2,))
            )
            (1): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(4,), dilation=(2,))
            )
            (2): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(8,), dilation=(4,))
            )
            (3): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(16,), dilation=(8,))
            )
          )
          (res_skip_layers): ModuleList(
            (0): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (1): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (2): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (3): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
          )
          (start): Conv1d(1120, 1024, kernel_size=(1,), stride=(1,))
          (softplus): Softplus(beta=1, threshold=20)
          (end): Conv1d(1024, 160, kernel_size=(1,), stride=(1,))
        )
      )
    )
    (2): FlowStep(
      (invtbl_conv): Invertible1x1ConvLUS()
      (affine_tfn): AffineTransformationLayer(
        (affine_param_predictor): WN(
          (in_layers): ModuleList(
            (0): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(2,))
            )
            (1): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(4,), dilation=(2,))
            )
            (2): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(8,), dilation=(4,))
            )
            (3): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(16,), dilation=(8,))
            )
          )
          (res_skip_layers): ModuleList(
            (0): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (1): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (2): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (3): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
          )
          (start): Conv1d(1119, 1024, kernel_size=(1,), stride=(1,))
          (softplus): Softplus(beta=1, threshold=20)
          (end): Conv1d(1024, 158, kernel_size=(1,), stride=(1,))
        )
      )
    )
    (3): FlowStep(
      (invtbl_conv): Invertible1x1ConvLUS()
      (affine_tfn): AffineTransformationLayer(
        (affine_param_predictor): WN(
          (in_layers): ModuleList(
            (0): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(2,))
            )
            (1): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(4,), dilation=(2,))
            )
            (2): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(8,), dilation=(4,))
            )
            (3): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(16,), dilation=(8,))
            )
          )
          (res_skip_layers): ModuleList(
            (0): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (1): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (2): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (3): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
          )
          (start): Conv1d(1119, 1024, kernel_size=(1,), stride=(1,))
          (softplus): Softplus(beta=1, threshold=20)
          (end): Conv1d(1024, 158, kernel_size=(1,), stride=(1,))
        )
      )
    )
    (4): FlowStep(
      (invtbl_conv): Invertible1x1ConvLUS()
      (affine_tfn): AffineTransformationLayer(
        (affine_param_predictor): WN(
          (in_layers): ModuleList(
            (0): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(2,))
            )
            (1): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(4,), dilation=(2,))
            )
            (2): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(8,), dilation=(4,))
            )
            (3): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(16,), dilation=(8,))
            )
          )
          (res_skip_layers): ModuleList(
            (0): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (1): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (2): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (3): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
          )
          (start): Conv1d(1118, 1024, kernel_size=(1,), stride=(1,))
          (softplus): Softplus(beta=1, threshold=20)
          (end): Conv1d(1024, 156, kernel_size=(1,), stride=(1,))
        )
      )
    )
    (5): FlowStep(
      (invtbl_conv): Invertible1x1ConvLUS()
      (affine_tfn): AffineTransformationLayer(
        (affine_param_predictor): WN(
          (in_layers): ModuleList(
            (0): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(2,))
            )
            (1): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(4,), dilation=(2,))
            )
            (2): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(8,), dilation=(4,))
            )
            (3): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(16,), dilation=(8,))
            )
          )
          (res_skip_layers): ModuleList(
            (0): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (1): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (2): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (3): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
          )
          (start): Conv1d(1118, 1024, kernel_size=(1,), stride=(1,))
          (softplus): Softplus(beta=1, threshold=20)
          (end): Conv1d(1024, 156, kernel_size=(1,), stride=(1,))
        )
      )
    )
    (6): FlowStep(
      (invtbl_conv): Invertible1x1ConvLUS()
      (affine_tfn): AffineTransformationLayer(
        (affine_param_predictor): WN(
          (in_layers): ModuleList(
            (0): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(2,))
            )
            (1): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(4,), dilation=(2,))
            )
            (2): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(8,), dilation=(4,))
            )
            (3): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(16,), dilation=(8,))
            )
          )
          (res_skip_layers): ModuleList(
            (0): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (1): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (2): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (3): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
          )
          (start): Conv1d(1117, 1024, kernel_size=(1,), stride=(1,))
          (softplus): Softplus(beta=1, threshold=20)
          (end): Conv1d(1024, 154, kernel_size=(1,), stride=(1,))
        )
      )
    )
    (7): FlowStep(
      (invtbl_conv): Invertible1x1ConvLUS()
      (affine_tfn): AffineTransformationLayer(
        (affine_param_predictor): WN(
          (in_layers): ModuleList(
            (0): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(2,))
            )
            (1): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(4,), dilation=(2,))
            )
            (2): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(8,), dilation=(4,))
            )
            (3): ConvNorm(
              (conv): PartialConv1d(1024, 1024, kernel_size=(5,), stride=(1,), padding=(16,), dilation=(8,))
            )
          )
          (res_skip_layers): ModuleList(
            (0): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (1): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (2): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
            (3): Conv1d(1024, 1024, kernel_size=(1,), stride=(1,))
          )
          (start): Conv1d(1117, 1024, kernel_size=(1,), stride=(1,))
          (softplus): Softplus(beta=1, threshold=20)
          (end): Conv1d(1024, 154, kernel_size=(1,), stride=(1,))
        )
      )
    )
  )
  (encoder): Encoder(
    (convolutions): ModuleList(
      (0): Sequential(
        (0): ConvNorm(
          (conv): PartialConv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): InstanceNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      )
      (1): Sequential(
        (0): ConvNorm(
          (conv): PartialConv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): InstanceNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      )
      (2): Sequential(
        (0): ConvNorm(
          (conv): PartialConv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): InstanceNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      )
    )
    (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
  )
  (length_regulator): LengthRegulator()
  (attention): ConvAttention(
    (softmax): Softmax(dim=3)
    (log_softmax): LogSoftmax(dim=3)
    (key_proj): Sequential(
      (0): ConvNorm(
        (conv): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
      )
      (1): ReLU()
      (2): ConvNorm(
        (conv): Conv1d(1024, 80, kernel_size=(1,), stride=(1,))
      )
    )
    (query_proj): Sequential(
      (0): ConvNorm(
        (conv): Conv1d(80, 160, kernel_size=(3,), stride=(1,), padding=(1,))
      )
      (1): ReLU()
      (2): ConvNorm(
        (conv): Conv1d(160, 80, kernel_size=(1,), stride=(1,))
      )
      (3): ReLU()
      (4): ConvNorm(
        (conv): Conv1d(80, 80, kernel_size=(1,), stride=(1,))
      )
    )
  )
  (context_lstm): LSTM(1044, 520, batch_first=True, bidirectional=True)
  (unfold): Unfold(kernel_size=(2, 1), dilation=1, padding=0, stride=2)
  (unvoiced_bias_module): Sequential(
    (0): LinearNorm(
      (linear_layer): Linear(in_features=512, out_features=1, bias=True)
    )
    (1): ReLU()
  )
  (v_pred_module): DAP(
    (bottleneck_layer): BottleneckLayerLayer(
      (projection_fn): ConvNorm(
        (conv): Conv1d(512, 32, kernel_size=(3,), stride=(1,), padding=(1,))
      )
      (non_linearity): ReLU()
    )
    (feat_pred_fn): ConvLSTMLinear(
      (dropout): Dropout(p=0.5, inplace=False)
      (convolutions): ModuleList(
        (0): Conv1d(48, 256, kernel_size=(3,), stride=(1,), padding=(1,))
        (1): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
      )
      (dense): Linear(in_features=256, out_features=1, bias=True)
    )
  )
  (v_embeddings): Embedding(4, 512)
)
initializing training dataloader
Number of speakers: 1
Speaker IDS {'ljs': 0}
Number of files 12442
Number of files after duration filtering 12442
Dataloader initialized with no augmentations
initializing validation dataloader
Number of files 58
Number of files after duration filtering 58
Dataloader initialized with no augmentations
/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
saving current configuration in output dir
alignment.py
attribute_prediction_model.py
audio_processing.py
autoregressive_flow.py
common.py
data.py
distributed.py
hifigan_denoiser.py
hifigan_env.py
hifigan_models.py
hifigan_utils.py
inference.py
inference_voice_conversion.py
loss.py
partialconv1d.py
plotting_utils.py
radam.py
radtts.py
splines.py
train.py
transformer.py
setting up tboard log in ./output/logs
Training everything
Epoch: 0
Traceback (most recent call last):
  File "train.py", line 498, in <module>
    train(n_gpus, rank, **train_config)
  File "train.py", line 382, in train
    for batch in train_loader:
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/radtts/data.py", line 318, in __getitem__
    audio, sampling_rate = load_wav_to_torch(audiopath)
  File "/home/ubuntu/radtts/data.py", line 74, in load_wav_to_torch
    sampling_rate, data = read(full_path)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/scipy/io/wavfile.py", line 647, in read
    fid = open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'filelists/wavs/LJ048-0022.wav'

Is there somewhere I could get the LG048-0022.wav files, and the other LG/LJ files that the error is complaining about? Or am I merely betraying my complete ignorance about the subject of ML by asking this question?

My understanding is this:

filelists/ljs_audiopath_text_speaker_train_filelist.txt is the training data
filelists/ljs_audiopath_text_speaker_val_filelist.txt is the test data
Those two files contain on each line an audio file before the pipe character, and that audio file's transcription after the pipe character
Training a model will require the .wav files to exist in the place stipulated in the config by audiodir.

Apologies if these are dilettante questions, I'm a web dev by trade, definitely not an ML person. I'd appreciate any help you have to offer. I'm still going through a couple Udemy courses to learn more.

Thanks for the repo!

Best,

Martin

Is it possible to do inference in real time?

Hey! It is an absolutely mind blowing work! Thanks a lot. I wanted to ask would it be possible to do inference (voice conversion) in real time. Or will it be possible in near future?

muffled audio output for custom dataset

I have followed your work with Flowtron and it produces audio really nicely. I am using the same dataset of 1.5 hrs of a single speaker but I have trouble getting some output that is not as muffled here. The alignment is decent:

and I am using the following parameters, am I missing something? Or is it just because I am training from scratch (I already did the decoder step)? Any help will be appreciated, thanks!

{
"train_config": {
"output_directory": "/debug",
"epochs": 10000000,
"optim_algo": "RAdam",
"learning_rate": 1e-4,
"weight_decay": 1e-6,
"sigma": 1.0,
"iters_per_checkpoint": 185,
"batch_size": 4,
"seed": null,
"checkpoint_path": "",
"ignore_layers": [],
"ignore_layers_warmstart": [],
"finetune_layers": [],
"include_layers": [],
"vocoder_config_path": "/content/drive/MyDrive/hifigan_22khz_config.json",
"vocoder_checkpoint_path": "/content/drive/MyDrive/hifigan_libritts100360_generator0p5.pt",
"log_attribute_samples": true,
"log_decoder_samples": true,

    "warmstart_checkpoint_path": "path/to/pretrained/decoder",
    "use_amp": false,
    "grad_clip_val": 1.0,
    "loss_weights": {
        "blank_logprob": -1, 
        "ctc_loss_weight": 0.1,
        "binarization_loss_weight": 1.0,
        "dur_loss_weight": 1.0,
        "f0_loss_weight": 1.0,
        "energy_loss_weight": 1.0,
        "vpred_loss_weight": 1.0
    },
    "binarization_start_iter": 0,
    "kl_loss_start_iter": 0,
    "unfreeze_modules": "durf0energyvpred"
},
"data_config": {
    "training_files": {
        "LJS": {
            "basedir": "",
            "audiodir": "",
            "filelist":  "/content/drive/MyDrive/bastila_wav/filelists/bastila_train_rad.txt",
            "lmdbpath": ""
        }
    },
    "validation_files": {
        "LJS": {
            "basedir": "",
            "audiodir": "",
            "filelist":  "/content/drive/MyDrive/bastila_wav/filelists/bastila_val_rad.txt",
            "lmdbpath": ""
        }
    },
    "dur_min": 0.1,
    "dur_max": 10.2,
    "sampling_rate": 22050,
    "filter_length": 1024,
    "hop_length": 256,
    "win_length": 1024,
    "n_mel_channels": 80,
    "mel_fmin": 0.0,
    "mel_fmax": 8000.0,
    "f0_min": 80.0,
    "f0_max": 640.0,
    "max_wav_value": 32768.0,
    "use_f0": true,
    "use_log_f0": 0,
    "use_energy_avg": true,
    "use_scaled_energy": true,
    "symbol_set": "radtts",
    "cleaner_names": ["radtts_cleaners"],
    "heteronyms_path": "tts_text_processing/heteronyms",
    "phoneme_dict_path": "tts_text_processing/cmudict-0.7b",
    "p_phoneme": 1.0,
    "handle_phoneme": "word",
    "handle_phoneme_ambiguous": "ignore",
    "include_speakers": null,
    "n_frames": -1,
    "betabinom_cache_path": "data_cache/",
    "lmdb_cache_path": "", 
    "use_attn_prior_masking": true,
    "prepend_space_to_text": true,
    "append_space_to_text": true,
    "add_bos_eos_to_text": false,
    "betabinom_scaling_factor": 1.0,
    "distance_tx_unvoiced": false,
    "mel_noise_scale": 0.0
},
"dist_config": {
    "dist_backend": "nccl",
    "dist_url": "tcp://localhost:54321"
},
"model_config": {
    "n_speakers": 1,
    "n_speaker_dim": 16,
    "n_text": 185,
    "n_text_dim": 512,
    "n_flows": 8,
    "n_conv_layers_per_step": 4,
    "n_mel_channels": 80,
    "n_hidden": 1024,
    "mel_encoder_n_hidden": 512,
    "dummy_speaker_embedding": false,
    "n_early_size": 2,
    "n_early_every": 2,
    "n_group_size": 2,
    "affine_model": "wavenet",
    "include_modules": "decatndpmvpredapm",
    "scaling_fn": "tanh",
    "matrix_decomposition": "LUS",
    "learn_alignments": true,
    "use_speaker_emb_for_alignment": false,
    "attn_straight_through_estimator": true,
    "use_context_lstm": true,
    "context_lstm_norm": "spectral",
    "context_lstm_w_f0_and_energy": true,
    "text_encoder_lstm_norm": "spectral",
    "n_f0_dims": 1,
    "n_energy_avg_dims": 1,
    "use_first_order_features": false,
    "unvoiced_bias_activation": "relu",
    "decoder_use_partial_padding": true,
    "decoder_use_unvoiced_bias": true,
    "ap_pred_log_f0": true,
    "ap_use_unvoiced_bias": true,
    "ap_use_voiced_embeddings": true,
    "dur_model_config": {
        "name": "dap",
        "hparams": {
            "n_speaker_dim": 16,
            "bottleneck_hparams": {
                "in_dim": 512,
                "reduction_factor": 16,
                "norm": "weightnorm",
                "non_linearity": "relu"
            },
            "take_log_of_input": true,
            "arch_hparams": {
                "out_dim": 1,
                "n_layers": 2,
                "n_channels": 256,
                "kernel_size": 3,
                "p_dropout": 0.25
            }
        }
    },
    "f0_model_config": {
        "name": "agap",
        "hparams": {
            "n_in_dim": 1,
            "n_group_size": 1,
            "take_log_of_input": false,
            "n_speaker_dim": 22,
            "n_flows": 2,
            "n_hidden": 128,
            "n_lstm_layers": 1,
            "scaling_fn": "tanh",
            "bottleneck_hparams": {
                "in_dim": 512,
                "reduction_factor": 16,
                "norm": "weightnorm",
                "non_linearity": "relu"
            },
            "spline_flow_params": {
                "n_in_channels": 1,
                "n_context_dim": 128,
                "n_layers": 4,
                "n_bins": 24,
                "use_quadratic": true
            }
        }
    },
    "energy_model_config": {
        "name": "agap",
        "hparams": {
            "n_in_dim": 1,
            "n_group_size": 1,
            "take_log_of_input": false,
            "n_speaker_dim": 22,
            "n_flows": 4,
            "n_hidden": 128,
            "n_lstm_layers": 1,
            "scaling_fn": "tanh",
            "bottleneck_hparams": {
                "in_dim": 512,
                "reduction_factor": 16,
                "norm": "weightnorm",
                "non_linearity": "relu"
            },
            "spline_flow_params": {
                "n_in_channels": 1,
                "n_context_dim": 128,
                "n_layers": 4,
                "n_bins": 24,
                "use_quadratic": true
            }
        }
    },
    "v_model_config": {
        "name": "dap",
        "hparams": {
            "n_speaker_dim": 16,
            "take_log_of_input": false,
            "bottleneck_hparams": {
                "in_dim": 512,
                "reduction_factor": 16,
                "norm": "weightnorm",
                "non_linearity": "relu"
            },
            "arch_hparams": {
                "out_dim": 1,
                "n_layers": 2,
                "n_channels": 256,
                "kernel_size": 3,
                "p_dropout": 0.5,
                "lstm_type": "",
                "use_linear": 1
            }
        }
    }
}

}

waveglow

Hi, how can i use waveglow?

Here's a Colab notebook for using RADTTS [Documentation]

For anyone who just wants the Colab, check out this guy's comment here. It needs some updating, but if you copy it and change the args to reflect your own you'll be good to go.

Took me forever to find, so I just want to make it easier to see for anyone cruising the repo issues page. Thanks!

train decatndur

in readme, when I finished train attention and decoder, next it to train duration. the code is radtts.py
if 'dpm' in include_modules:
dur_model_config['hparams']['n_speaker_dim'] = n_speaker_dim
self.dur_pred_layer = get_attribute_prediction_model(
dur_model_config)

I just want to ask if i want train duration， the include_modules should be decatndur ? Thanks

size mismatch for context_lstm.weight_ih_l0 and context_lstm.weight_ih_l0_reverse

When I try to run the basic inference demo I get mismatches for the dimensionality of the pretrained models and context_lstm. There isn't really any location in any of the project files where 1044 or 1040 is directly specified if you search the content of every file, so this leaves me without really any context for how to start the inference. Here is a full command with verbose absolute paths for all input files using the downloaded files.

python inference.py -c '/home/skyler/Documents/RADTTS/configs/config_ljs_radtts.json' -r '/home/skyler/Documents/RADTTS/model_archives/radtts++ljs-dap.pt' -v '/home/skyler/Documents/RADTTS/model_archives/hifigan_libritts100360_generator0p5.pt' -k '/home/skyler/Documents/RADTTS/model_archives/hifigan_22khz_config.json' -t '/home/skyler/Documents/RADTTS/input/test.txt' --speaker_attributes ljs --speaker_text ljs -o results/

Applying spectral norm to text encoder LSTM
/home/skyler/Documents/RADTTS/common.py:391: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2497.)
  W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
Applying spectral norm to context encoder LSTM
Traceback (most recent call last):
  File "/home/skyler/Documents/RADTTS/inference.py", line 201, in <module>
    infer(args.radtts_path, args.vocoder_path, args.config_vocoder,
  File "/home/skyler/Documents/RADTTS/inference.py", line 95, in infer
    radtts.load_state_dict(state_dict, strict=False)
  File "/home/skyler/miniconda3/envs/radtts/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RADTTS:
	size mismatch for context_lstm.weight_ih_l0: copying a param with shape torch.Size([2080, 1044]) from checkpoint, the shape in current model is torch.Size([2080, 1040]).
	size mismatch for context_lstm.weight_ih_l0_reverse: copying a param with shape torch.Size([2080, 1044]) from checkpoint, the shape in current model is torch.Size([2080, 1040]).

No such file or directory: '/path_to_ljs/ljs_audiopath_text_sid_emotion_duration_train_filelist.txt

While traininng the model i am getting this error:

initializing training dataloader
Traceback (most recent call last):
File "/home/ubuntu/rahul/experiment/nvidia-radtts/radtts/train.py", line 498, in
train(n_gpus, rank, **train_config)
File "/home/ubuntu/rahul/experiment/nvidia-radtts/radtts/train.py", line 369, in train
train_loader, valset, collate_fn = prepare_dataloaders(
File "/home/ubuntu/rahul/experiment/nvidia-radtts/radtts/train.py", line 134, in prepare_dataloaders
trainset = Data(data_config['training_files'],
File "/home/ubuntu/rahul/experiment/nvidia-radtts/radtts/data.py", line 95, in init
self.data = self.load_data(datasets)
File "/home/ubuntu/rahul/experiment/nvidia-radtts/radtts/data.py", line 180, in load_data
with open(filelist_path, encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/path_to_ljs/ljs_audiopath_text_sid_emotion_duration_train_filelist.txt'

recommend the steps of 1st-stage training.

radtts and radtts++ both have 1st stage training(without the attribute predictor), I wonder how many steps of 1st stage is proper?
Can anyone recommend a value?

emotion = 'other' if len(d) == 3 else d[3] IndexError: list index out of range

emotion = 'other' if len(d) == 3 else d[3]
IndexError: list index out of range

what m i doing wrong? trying to train using pre train model, but not moving anywhere.

Dependencies are broken/incomplete, project deployment is hard to follow

Requirements for this project are incomplete, you run into errors saying you need to install torch and lmdb post installing requirements, no python version is specified in the readme, matplotlib gives errors with v. 2.1.0 and numpy version conflicts with other libraries when you run pip install -r requirements

If this is supposed to be only run inside a preconfigured NGC environment or some preconfigured image, and if so can that be specified in the readme (along with python version)

Finally, since there are so many files and configurations expected between these different steps can the readme also include a table of all these files are. A notebook pulling all these together to do a step by step deploy would with concrete example would be great too, and I'd be happy to help put something like that together if I had a table or some kind of clearer reference in the readme about how all this should be organized.

Train custom voice instead of the default ljs speaker.

I am attempting to train a custom speaker to be used with the provided inference script, replacing the default ljs speaker. However, I've encountered an issue where the inference outputs are muffled, similar to the problem described in this issue. I'm uncertain about the appropriate course of action. In my current training pipeline, I train the decoder, and the checkpoints are saved to /decoder_checkpoints/ag_decoder. Subsequently, I perform a Warm start training for the dap model in the directory /dap_checkpoints/rad_ag. During inference, I use the rad_ag checkpoint as the rad_tts checkpoint and utilize the provided vocoder checkpoint hifigan_libritts100360_generator0p5.pt. As a result, the ag_decoder checkpoint seems to be unused. Am I making a mistake in my approach? Should I train the decoder and the dap on the same checkpoint path? You can refer to this colab notebook for more details. I would greatly appreciate your guidance through the process or any relevant documentation you can provide.

Straight through on unsupervised aligner

Hello, I see that there is a straight through operation on the attention hard of the unsupervised aligner.
Why is this useful? There is no mention of it I think in any paper.

Inference with bgap models

I am trying to do inference with a trained BGAP model (config_ljs_bgap.json) but getting an error:

Loaded checkpoint 'outdir_pp_bgap_model/model_50000')
Number of speakers: 3
Speaker IDS {'lada': 0, 'mykyta': 1, 'tetiana': 2}

0/2: к+ам'ян+ець-под+ільський - м+істо в хмельн+ицькій +області укра+їни, ц+ентр кам'ян+ець-под+ільської міськ+ої об'+єднаної територі+альної гром+ади +і кам'ян+ець-под+ільського рай+ону
/home/yehor/Tools/anaconda3/envs/my/lib/python3.8/site-packages/torch/nn/modules/rnn.py:774: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at ../aten/src/ATen/native/cudnn/RNN.cpp:968.)
  result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
Traceback (most recent call last):
  File "inference.py", line 217, in <module>
    infer(args.radtts_path, args.vocoder_path, args.config_vocoder,
  File "inference.py", line 125, in infer
    outputs = radtts.infer(
  File "/home/yehor/RADTTS-Multiple-Voices/radtts/radtts.py", line 642, in infer
    f0 * voiced_mask + f0_bias, energy_avg)
RuntimeError: The size of tensor a (1048) must match the size of tensor b (1046) at non-singleton dimension 1

How to slow down the speed of the response?

Hey, I noticed sometimes that the results I get with inference.py are too fast to understand. Is there any argument I could pass in to slow down the speed of the response? I've played around with energy but so far, no luck.

Thank you!

Inference: size mismatch for context_lstm.weight_ih_l0: copying a param with shape torch.Size([2080, 1044]) from checkpoint, the shape in current model is torch.Size([2080, 1040]).

I'm sorry to trouble you. I'm trying to use the project to assist a patient on a ventilator -- I'm just trying to get inference working right now, but am unable to figure out some of the options:

CONFIG_PATH=configs/config_ljs_radtts.json
RADTTS_PATH=??
HG_PATH=data/archive/
HG_CONFIG_PATH=data/hifigan_22khz_config.json
TEXT_PATH=test.txt

python inference.py -c $CONFIG_PATH -r $RADTTS_PATH \
	-v $HG_PATH -k $HG_CONFIG_PATH -t $TEXT_PATH -s ljs \
	--speaker_attributes ljs --speaker_text ljs -o results/

I have hifigan_libritts100360_generator0p5.pt.zip unzipped into data/archive/*, like:

data/archive/data.pkl
data/archive/data/94135883059968
data/hifigan_22khz_config.json

I'm not sure what to put in for TEXT_PATH, nor if the config or, really, the other options should really point to.

Thanks for the help and your time.

with open(config_path) as f: FileNotFoundError: [Errno 2] No such file or directory:

I don't understand whats really wrong with this. I have the file in the folder and yet it is giving me the same error over and over.

can anyone help?

Setting model paths in Voice conversion demo

It seems there are specific requirements for the model paths and names to get the voice conversion demo to work, such as line 120 in inference_voice_conversion.py:

int(model_name.split('/')[-1].split('_')[-1])

Is there a specific model and config that makes this work out of the box?

cannot start training from a pre-trained model

i've downloaded the model provided for RADTTS
and im trying to use it to start training but i get the following error:

python train.py -c config.json -p train_config.ignore_layers=["speaker_embedding.weight"] train_config.checkpoint_path='models/radtts++ljs-dap.pt'

> got rank 0 and world size 1 ...
/debug
Using seed 1007
Applying spectral norm to text encoder LSTM
/root/radtts/common.py:391: UserWarning: torch.qr is deprecated in favor of torch.linalg.qr and will be removed in a future PyTorch release.
The boolean parameter 'some' has been replaced with a string parameter 'mode'.
Q, R = torch.qr(A, some)
should be replaced with
Q, R = torch.linalg.qr(A, 'reduced' if some else 'complete') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:1980.)
  W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
Applying spectral norm to context encoder LSTM

Initializing RAdam optimizer
Traceback (most recent call last):
  File "/root/radtts/train.py", line 498, in <module>
    train(n_gpus, rank, **train_config)
  File "/root/radtts/train.py", line 357, in train
    model, optimizer, iteration = load_checkpoint(
  File "/root/radtts/train.py", line 181, in load_checkpoint
    iteration = checkpoint_dict['iteration']
KeyError: 'iteration'

Is it a mistake in README?

Am I right that in the first arrow it must be config_ljs_decoder.json instead of config_ljs_radtts.json?