Tacotron 2 - PyTorch implementation with faster-than-realtime inference modified to enable cross lingual voice cloning.

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.09% Python 21.27% Jupyter Notebook 78.65%

multi-lingual pytorch text-to-speech vae voice-cloning

cross-lingual-voice-cloning's People

Contributors

Stargazers

Watchers

cross-lingual-voice-cloning's Issues

Attention Alignment Not Working

I am currently training the provided model with Korean and English datasets, with a total of 27 speakers.
As stated in the README.md, I added Korean to "symbols" as follows:

The problem is that even after training the model for over 45,000 steps, the attention alignment is not forming.

The target and predicted mel-spectrograms seem similar enough.

To anyone who has used this repo and to @Jeevesh8 , how long does it normally take for the attention to start aligning properly? Should I continue training?

Any help and advice would be greatly appreciated.

speaker_embedding_dimension

Line 220 in model.py:

self.speaker_embedding_dim = hparams.encoder_embedding_dim

Should be

self.speaker_embedding_dim = hparams.speaker_embedding_dim

RuntimeError: CUDA error: out of memory

Sorry to bother you,
Under v100-sxm2 GPU 32g,It always appears
`python train.py --output_directory=outdir/ --log_directory=logdir/ -c tacotron2_statedict.pt --warm_start
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Traceback (most recent call last):
File "train.py", line 292, in
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 169, in train
model = load_model(hparams)
File "train.py", line 74, in load_model
model = Tacotron2(hparams).cuda()
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 199, in _apply
param.data = fn(param.data)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 265, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory`

I modified hparams.py
batch_size=1
But the mistake remains.

Is this right? How to repair it,please?

How many hours of speech and epochs it takes to get the quality as in paper?

I am training an English-Russian model and want to know how many hours of speech it takes to get the quality as in the paper, and can I train a model with 6gb VRAM?

What all changes needs to be done in text files, if i want to do it for hindi and english crosslingual model

Input text type

The website example is awesome.

I have a question about the input text type.
For Mandarin, I know the following type:

kǎěrpǔ péi wàisūn wán huátī。
ka3er3pu3 pei2 wai4sun1 wan2 hua2ti1。
k a3 er3 p u3 p ei2 w ai4 s un1 w an2 h ua2 t i1 。
Could you tell me which of the above types you used?

For English:
Just use the English word such as:in being comparatively modern.

How about you?

I just read your paper, you had tried three types of Characters, UTF-8 Encoded Bytes, Phonemes, and the Phoneme type is the best.
In fact, I have also trained with phonemes before, I use the Python package of phonemizer==2.1 to synthesize the phonemes of Mandarin. And some generated phonemes have no tone. So the result, just like your paper said:

CN raters commented that it sounded like “a foreigner speaking Chinese”

The reason is that the model has a poor judgment of tone so that it can't distinguish the four tones of Mandarin.

Therefore, what tool do you use to generate phonemes for each language?
And finally, could you give me an example of train.txt? Like this:

<path-to-wav-file>|b ao3 an1 y ong4 sh ou3 q ia1 zh u4 j i4 zh e3 b o2 z i q iang3 x iang4 j i1 。|0|Mandarin
<path-to-wav-file>|it had arrangements to be notified about release from confinement in roughly one thousand cases;|1|English

The code seems to be incomplete, such as the preprocessing of Mandarin and English data.

The README is really not very detailed, I don't know how to train the Code-Switching of Mandarin English.

Sample dataset and train, validation files?

Hi @Jeevesh8,

Thanks for this incredible implementation of fluently speak foreign language paper! (Seems like the only open source implementation available.) I earlier never had a chance to play with this code but now I need this.

I wonder if you can provide a tiny sample dataset to fit it on this model right away. Then it'll be easier to modify and expand the dataset for anyone (including me) experimenting with your code. To be exact, I plan to train it on Common Voice dataset.

Thanks!
Rahul Bhalley

Does onboarding a new speaker after a model is trained require tr-training from scratch?

code-switch speech have different voice

I used your model. The experiment used the open source biaobei dataset and LJspeech dataset. It synthesized 22000 steps and successfully synthesized Chinese and English mixed speech, but the Chinese audio sound is the voice of biaobei and the English audio sound is the voice of LJspeech.
Is the number of training steps insufficient?
Thanks

Available pre-trained model?

Hi,

I would like to know if you could share the pre-trained model used to run the clvc-infer-gh notebook
Best,

index out of range

Traceback (most recent call last):
File "train.py", line 294, in
args.warm_start, args.n_gpus, args.local_rank, args.group_name, hparams)
File "train.py", line 211, in train
for i, batch in enumerate(train_loader):
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zero/TTS/Cross-Lingual-Voice-Cloning/data_utils.py", line 62, in getitem
return self.get_mel_text_pair(self.audiopaths_and_text[index])
File "/home/zero/TTS/Cross-Lingual-Voice-Cloning/data_utils.py", line 35, in get_mel_text_pair
mel = self.get_mel(audiopath)
File "/home/zero/TTS/Cross-Lingual-Voice-Cloning/data_utils.py", line 43, in get_mel
sampling_rate, self.stft.sampling_rate))
IndexError: tuple index out of range

i got the error: index out of range and i can't find the reason

and the hparams:

    use_saved_learning_rate=False,
    learning_rate=1e-3,
    weight_decay=1e-5,
    grad_clip_thresh=1.0,
    batch_size=10,
    mask_padding=True,  # set model's padded outputs to padded values

    ###############################
    # Speaker and Lang Embeddings #
    ###############################
    speaker_embedding_dim = 64,
    lang_embedding_dim = 3,
    n_langs = 2,
    n_speakers = 9,

    ###############################
    ## Speaker Classifier Params ##
    ###############################
    hidden_sc_dim=256,

    ##############################
    ## Residual Encoder Params  ##
    ##############################
    residual_encoding_dim = 32,          #16 for q(z_l|X) and 16 for q(z_o|X)
    dim_yo = 6,                          #(==n_speakers) dim(y_{o})
    dim_yl = 10,                         #K
    # mcn = 8                              #n for monte carlo sampling of q(z_l|X)and q(z_o|X)
    mcn = 6                              #n for monte carlo sampling of q(z_l|X)and q(z_o|X)

Train failed because the loss is nan

I train the network on pytorch1.5 with 3 gpu, and train failed because the loss is nan

hparams
################################
# Optimization Hyperparameters #
################################
use_saved_learning_rate=False,
learning_rate=1e-3,
weight_decay=1e-6,
grad_clip_thresh=1.0,
batch_size=24,
mask_padding=True, # set model's padded outputs to padded values

nvidia-smi when failed

FP16 Run: False Dynamic Loss Scaling: True Distributed Run: True cuDNN Enabled: True cuDNN Benchmark: False Initializing Distributed Done initializing distributed Epoch: 0 Train loss 0 17732198.000000 Grad Norm 2364579774464.000000 3.59s/it /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " Validation loss 0: 1936390491.411058 Saving model and optimizer state at iteration 0 to outdir/checkpoint_0 Train loss 1 3056206.750000 Grad Norm 110319550464.000000 2.15s/it Train loss 2 3184566.750000 Grad Norm 245978316800.000000 2.15s/it Train loss 3 4360.366699 Grad Norm 405133.906250 1.97s/it Train loss 4 511.444702 Grad Norm 10648.502930 1.79s/it Train loss 5 162.903091 Grad Norm 1219.052490 1.82s/it Train loss 6 76.945312 Grad Norm 334.356018 1.77s/it Train loss 7 41.395836 Grad Norm 134.526993 2.25s/it Train loss 8 27.028851 Grad Norm nan 1.94s/it Train loss 9 nan Grad Norm nan 1.90s/it Train loss 10 nan Grad Norm nan 2.47s/it Train loss 11 nan Grad Norm nan 1.82s/it Train loss 12 nan Grad Norm nan 2.03s/it

Train loss 982 nan Grad Norm nan 2.11s/it Epoch: 1 Train loss 983 nan Grad Norm nan 2.10s/it Train loss 984 nan Grad Norm nan 1.98s/it Train loss 985 nan Grad Norm nan 2.05s/it Train loss 986 nan Grad Norm nan 1.95s/it Train loss 987 nan Grad Norm nan 1.87s/it Train loss 988 nan Grad Norm nan 1.91s/it Train loss 989 nan Grad Norm nan 1.77s/it Train loss 990 nan Grad Norm nan 2.19s/it Train loss 991 nan Grad Norm nan 1.92s/it Train loss 992 nan Grad Norm nan 1.88s/it Train loss 993 nan Grad Norm nan 2.35s/it Train loss 994 nan Grad Norm nan 1.84s/it Train loss 995 nan Grad Norm nan 2.05s/it Train loss 996 nan Grad Norm nan 1.93s/it Train loss 997 nan Grad Norm nan 2.42s/it Train loss 998 nan Grad Norm nan 2.27s/it Train loss 999 nan Grad Norm nan 2.28s/it Train loss 1000 nan Grad Norm nan 1.68s/it Validation loss 1000: nan Traceback (most recent call last): File "train.py", line 292, in <module> args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 250, in train hparams.distributed_run, rank) File "train.py", line 147, in validate logger.log_validation(val_loss, model, y, y_pred, iteration) File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/logger.py", line 27, in log_validation self.add_histogram(tag, value.data.cpu().numpy(), iteration) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 425, in add_histogram histogram(tag, values, bins, max_bins=max_bins), global_step, walltime) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 226, in histogram hist = make_histogram(values.astype(float), bins, max_bins) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 264, in make_histogram raise ValueError('The histogram is empty, please file a bug report.') ValueError: The histogram is empty, please file a bug report.

Meaning of _o and _l in residual_encoder

Hi @Jeevesh8 thanks a lot for your code!

I am trying to understand the paper more. Can you please clarify the meaning of the subscripts _o and _l in residual_encoder.py? I see that you are using two residual encoders and concatenating them, and they serve different purposes.

Thanks a lot!

Tim

Problem with the input mel length

Hi there,
I'm having a Runtime Error: CUDA error: device-side assert triggered when trying to train your model in LJspeech dataset. Is it because a problem caused by the excess of the mel length of the data over the default decoder or something else?
Many thanks.

Unable to achieve cross-lingual cloning

Hi, we trained model using four languages. Model is able to synthesize well in respective speaker and language ID setting. But when we tried to change speaker ID and Language ID. Cross-lingual Voice Cloning is not happening rather it is synthesizing samples in original speaker's voice.

deterministic-algorithms-lab / cross-lingual-voice-cloning Goto Github PK

cross-lingual-voice-cloning's People

Contributors

Stargazers

Watchers

Forkers

cross-lingual-voice-cloning's Issues

Recommend Projects

Recommend Topics

Recommend Org