Giter VIP home page Giter VIP logo

cross-lingual-voice-cloning's People

Contributors

cobr123 avatar dependabot[bot] avatar grzegorz-k-karch avatar jeevesh8 avatar jybaek avatar rafaelvalle avatar raulpuric avatar yoks avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cross-lingual-voice-cloning's Issues

Attention Alignment Not Working

I am currently training the provided model with Korean and English datasets, with a total of 27 speakers.
As stated in the README.md, I added Korean to "symbols" as follows:

image

The problem is that even after training the model for over 45,000 steps, the attention alignment is not forming.

image

The target and predicted mel-spectrograms seem similar enough.

image

To anyone who has used this repo and to @Jeevesh8 , how long does it normally take for the attention to start aligning properly? Should I continue training?

Any help and advice would be greatly appreciated.

RuntimeError: CUDA error: out of memory

Sorry to bother you,
Under v100-sxm2 GPU 32g,It always appears
`python train.py --output_directory=outdir/ --log_directory=logdir/ -c tacotron2_statedict.pt --warm_start
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Traceback (most recent call last):
File "train.py", line 292, in
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 169, in train
model = load_model(hparams)
File "train.py", line 74, in load_model
model = Tacotron2(hparams).cuda()
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 199, in _apply
param.data = fn(param.data)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 265, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory`

I modified hparams.py
batch_size=1
But the mistake remains.

Is this right? How to repair it,please?

Input text type

The website example is awesome.

I have a question about the input text type.
For Mandarin, I know the following type:

  1. kǎěrpǔ péi wàisūn wán huátī。
  2. ka3er3pu3 pei2 wai4sun1 wan2 hua2ti1。
  3. k a3 er3 p u3 p ei2 w ai4 s un1 w an2 h ua2 t i1 。
    Could you tell me which of the above types you used?

For English:
Just use the English word such as:in being comparatively modern.

How about you?


I just read your paper, you had tried three types of Characters, UTF-8 Encoded Bytes, Phonemes, and the Phoneme type is the best.
In fact, I have also trained with phonemes before, I use the Python package of phonemizer==2.1 to synthesize the phonemes of Mandarin. And some generated phonemes have no tone. So the result, just like your paper said:

CN raters commented that it sounded like “a foreigner speaking Chinese”

The reason is that the model has a poor judgment of tone so that it can't distinguish the four tones of Mandarin.

Therefore, what tool do you use to generate phonemes for each language?
And finally, could you give me an example of train.txt? Like this:

<path-to-wav-file>|b ao3 an1 y ong4 sh ou3 q ia1 zh u4 j i4 zh e3 b o2 z i q iang3 x iang4 j i1 。|0|Mandarin
<path-to-wav-file>|it had arrangements to be notified about release from confinement in roughly one thousand cases;|1|English

The code seems to be incomplete, such as the preprocessing of Mandarin and English data.

The README is really not very detailed, I don't know how to train the Code-Switching of Mandarin English.

Sample dataset and train, validation files?

Hi @Jeevesh8,

Thanks for this incredible implementation of fluently speak foreign language paper! (Seems like the only open source implementation available.) I earlier never had a chance to play with this code but now I need this.

I wonder if you can provide a tiny sample dataset to fit it on this model right away. Then it'll be easier to modify and expand the dataset for anyone (including me) experimenting with your code. To be exact, I plan to train it on Common Voice dataset.

Thanks!
Rahul Bhalley

code-switch speech have different voice

I used your model. The experiment used the open source biaobei dataset and LJspeech dataset. It synthesized 22000 steps and successfully synthesized Chinese and English mixed speech, but the Chinese audio sound is the voice of biaobei and the English audio sound is the voice of LJspeech.
Is the number of training steps insufficient?
Thanks

Available pre-trained model?

Hi,

I would like to know if you could share the pre-trained model used to run the clvc-infer-gh notebook
Best,

index out of range

Traceback (most recent call last):
File "train.py", line 294, in
args.warm_start, args.n_gpus, args.local_rank, args.group_name, hparams)
File "train.py", line 211, in train
for i, batch in enumerate(train_loader):
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zero/anaconda3/envs/tacotron/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zero/TTS/Cross-Lingual-Voice-Cloning/data_utils.py", line 62, in getitem
return self.get_mel_text_pair(self.audiopaths_and_text[index])
File "/home/zero/TTS/Cross-Lingual-Voice-Cloning/data_utils.py", line 35, in get_mel_text_pair
mel = self.get_mel(audiopath)
File "/home/zero/TTS/Cross-Lingual-Voice-Cloning/data_utils.py", line 43, in get_mel
sampling_rate, self.stft.sampling_rate))
IndexError: tuple index out of range

i got the error: index out of range and i can't find the reason

and the hparams:

    use_saved_learning_rate=False,
    learning_rate=1e-3,
    weight_decay=1e-5,
    grad_clip_thresh=1.0,
    batch_size=10,
    mask_padding=True,  # set model's padded outputs to padded values

    ###############################
    # Speaker and Lang Embeddings #
    ###############################
    speaker_embedding_dim = 64,
    lang_embedding_dim = 3,
    n_langs = 2,
    n_speakers = 9,

    ###############################
    ## Speaker Classifier Params ##
    ###############################
    hidden_sc_dim=256,

    ##############################
    ## Residual Encoder Params  ##
    ##############################
    residual_encoding_dim = 32,          #16 for q(z_l|X) and 16 for q(z_o|X)
    dim_yo = 6,                          #(==n_speakers) dim(y_{o})
    dim_yl = 10,                         #K
    # mcn = 8                              #n for monte carlo sampling of q(z_l|X)and q(z_o|X)
    mcn = 6                              #n for monte carlo sampling of q(z_l|X)and q(z_o|X)

Train failed because the loss is nan

I train the network on pytorch1.5 with 3 gpu, and train failed because the loss is nan

hparams
################################
# Optimization Hyperparameters #
################################
use_saved_learning_rate=False,
learning_rate=1e-3,
weight_decay=1e-6,
grad_clip_thresh=1.0,
batch_size=24,
mask_padding=True, # set model's padded outputs to padded values

nvidia-smi when failed
image

FP16 Run: False Dynamic Loss Scaling: True Distributed Run: True cuDNN Enabled: True cuDNN Benchmark: False Initializing Distributed Done initializing distributed Epoch: 0 Train loss 0 17732198.000000 Grad Norm 2364579774464.000000 3.59s/it /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " Validation loss 0: 1936390491.411058 Saving model and optimizer state at iteration 0 to outdir/checkpoint_0 Train loss 1 3056206.750000 Grad Norm 110319550464.000000 2.15s/it Train loss 2 3184566.750000 Grad Norm 245978316800.000000 2.15s/it Train loss 3 4360.366699 Grad Norm 405133.906250 1.97s/it Train loss 4 511.444702 Grad Norm 10648.502930 1.79s/it Train loss 5 162.903091 Grad Norm 1219.052490 1.82s/it Train loss 6 76.945312 Grad Norm 334.356018 1.77s/it Train loss 7 41.395836 Grad Norm 134.526993 2.25s/it Train loss 8 27.028851 Grad Norm nan 1.94s/it Train loss 9 nan Grad Norm nan 1.90s/it Train loss 10 nan Grad Norm nan 2.47s/it Train loss 11 nan Grad Norm nan 1.82s/it Train loss 12 nan Grad Norm nan 2.03s/it

Train loss 982 nan Grad Norm nan 2.11s/it Epoch: 1 Train loss 983 nan Grad Norm nan 2.10s/it Train loss 984 nan Grad Norm nan 1.98s/it Train loss 985 nan Grad Norm nan 2.05s/it Train loss 986 nan Grad Norm nan 1.95s/it Train loss 987 nan Grad Norm nan 1.87s/it Train loss 988 nan Grad Norm nan 1.91s/it Train loss 989 nan Grad Norm nan 1.77s/it Train loss 990 nan Grad Norm nan 2.19s/it Train loss 991 nan Grad Norm nan 1.92s/it Train loss 992 nan Grad Norm nan 1.88s/it Train loss 993 nan Grad Norm nan 2.35s/it Train loss 994 nan Grad Norm nan 1.84s/it Train loss 995 nan Grad Norm nan 2.05s/it Train loss 996 nan Grad Norm nan 1.93s/it Train loss 997 nan Grad Norm nan 2.42s/it Train loss 998 nan Grad Norm nan 2.27s/it Train loss 999 nan Grad Norm nan 2.28s/it Train loss 1000 nan Grad Norm nan 1.68s/it Validation loss 1000: nan Traceback (most recent call last): File "train.py", line 292, in <module> args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 250, in train hparams.distributed_run, rank) File "train.py", line 147, in validate logger.log_validation(val_loss, model, y, y_pred, iteration) File "/data/glusterfs_speech_tts/public_data/11104653/multiLingual_voice_cloning/Cross-Lingual-Voice-Cloning/logger.py", line 27, in log_validation self.add_histogram(tag, value.data.cpu().numpy(), iteration) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 425, in add_histogram histogram(tag, values, bins, max_bins=max_bins), global_step, walltime) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 226, in histogram hist = make_histogram(values.astype(float), bins, max_bins) File "/data/glusterfs_speech_tts/public_data/11104653/tools/miniconda3/envs/myenv3.6/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 264, in make_histogram raise ValueError('The histogram is empty, please file a bug report.') ValueError: The histogram is empty, please file a bug report.

Problem with the input mel length

Hi there,
I'm having a Runtime Error: CUDA error: device-side assert triggered when trying to train your model in LJspeech dataset. Is it because a problem caused by the excess of the mel length of the data over the default decoder or something else?
Many thanks.
Capture

Unable to achieve cross-lingual cloning

Hi, we trained model using four languages. Model is able to synthesize well in respective speaker and language ID setting. But when we tried to change speaker ID and Language ID. Cross-lingual Voice Cloning is not happening rather it is synthesizing samples in original speaker's voice.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.