How to control the style or intonation of the synthesized speech? about tacotron2 HOT 8 CLOSED

nvidia commented on August 15, 2024

How to control the style or intonation of the synthesized speech?

from tacotron2.

Comments (8)

aidosRepoint commented on August 15, 2024 2

Hi, I found the synthesized speech files are different by running inference many times. Their style or intonation always are diverse. How to contral the style or intonation of the synthesized speech?

Another question: How to choose the best model from a series of checkpoint models? I found it is not sure that more iteration step is corresponding to better synthesis performance.

Hi!
For those who are interested in the same issue of getting various results on several inference iterations:
it seems that instead of using
_ = model.cuda().eval().half()
you can use

_ = model.cuda().eval().half()
__ = model.decoder.prenet.eval()

and this will provide you with the same results everytime

from tacotron2.

rafaelvalle commented on August 15, 2024 1

The variation comes from dropout on the Prenet layer. One can train without dropout or set training to False here https://github.com/NVIDIA/tacotron2/blob/master/model.py#L99.

Assuming your validation loss plateaus, any checkpoint on that plateau area.

from tacotron2.

rafaelvalle commented on August 15, 2024 1

OK. You really mean F0 by tone.
Without dropout, the decoder has access to the full mel-spectrogram and can learn, during training, to condition mostly on it, ignoring the encoder outputs. If this happens, the model doesn't learn attention and is not able to synthesize speech.

We're in the process of releasing code that learns attention faster and can be useful to your case. Keep an eye on #55

from tacotron2.

NewEricWang commented on August 15, 2024

@rafaelvalle , Thank you!
I get constant speech for many times by setting dropout=0.0. But the speech quality is awful。 Maybe I should train model by setting dropout=0.0 again.

In my case, I found the learned attention alignment can't be hold along iteration increasing. For example, I can get good alignment and synthesized speech at checkpoint 80k steps, but I get worse alignment at checkpoint 85k step and so I can't get better synthesized speech, even more awful speech, while the validation loss is better than that at 80k steps.

I try to build Chinese synthesis model by tacotron2. The tone of the speech I got is very strange.
This is a raw speech:
post_r_8001_1.wav.zip

This is a corresponding synthesized speech:
test.chn.61k.r1.wav.zip

from tacotron2.

rafaelvalle commented on August 15, 2024

Attention is very important for proper synthesis: a low validation score with bad alignment won't produce good samples during inference time when we're not teacher forcing, i.e. providing the real previous mel frame to the decoder.

Audio quality is extremely dependent on the mel2audio decoder.
If by tone you mean the robotic sound, you can improve it by using a Wavenet Vocoder, the same way you did it here #52.

If you mean something else, please explain what it is and we can work together on it. I don't speak Chinese and, except the robotic tone from using Griffin-Lim, think your sample sounds pretty good!

from tacotron2.

NewEricWang commented on August 15, 2024

@rafaelvalle ， Tone is word-dependent in Chinese. Each word has its own tone. The means of these words with different tone are different. The feature of tone is Pitch or F0. The speech sound strange while the tone of only a word is incorrect.

The attention can't be learned after 100k step when I set dropout=0 for training procedure, but a very low validation score can be got.

from tacotron2.

NewEricWang commented on August 15, 2024

@rafaelvalle ,Thank you very much. When synthesizing with setting dropout=0, the tones of synthesized sentences using different training-step models are different.
This is a sound synthesized by model with 126k-step:
test.126k.v1.wav.zip

This is a sound synthesized by model with 122k-step:
test.122k.v1.1.wav.zip

from tacotron2.

rafaelvalle commented on August 15, 2024

Closing due to inactivity.

from tacotron2.

How to control the style or intonation of the synthesized speech? about tacotron2 HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent