Giter VIP home page Giter VIP logo

Comments (8)

aidosRepoint avatar aidosRepoint commented on August 15, 2024 2

Hi, I found the synthesized speech files are different by running inference many times. Their style or intonation always are diverse. How to contral the style or intonation of the synthesized speech?

Another question: How to choose the best model from a series of checkpoint models? I found it is not sure that more iteration step is corresponding to better synthesis performance.

Hi!
For those who are interested in the same issue of getting various results on several inference iterations:
it seems that instead of using
_ = model.cuda().eval().half()
you can use

_ = model.cuda().eval().half()
__ = model.decoder.prenet.eval()

and this will provide you with the same results everytime

from tacotron2.

rafaelvalle avatar rafaelvalle commented on August 15, 2024 1

The variation comes from dropout on the Prenet layer. One can train without dropout or set training to False here https://github.com/NVIDIA/tacotron2/blob/master/model.py#L99.

Assuming your validation loss plateaus, any checkpoint on that plateau area.

from tacotron2.

rafaelvalle avatar rafaelvalle commented on August 15, 2024 1

OK. You really mean F0 by tone.
Without dropout, the decoder has access to the full mel-spectrogram and can learn, during training, to condition mostly on it, ignoring the encoder outputs. If this happens, the model doesn't learn attention and is not able to synthesize speech.

We're in the process of releasing code that learns attention faster and can be useful to your case. Keep an eye on #55

from tacotron2.

NewEricWang avatar NewEricWang commented on August 15, 2024

@rafaelvalle , Thank you!
I get constant speech for many times by setting dropout=0.0. But the speech quality is awful。 Maybe I should train model by setting dropout=0.0 again.

In my case, I found the learned attention alignment can't be hold along iteration increasing. For example, I can get good alignment and synthesized speech at checkpoint 80k steps, but I get worse alignment at checkpoint 85k step and so I can't get better synthesized speech, even more awful speech, while the validation loss is better than that at 80k steps.

I try to build Chinese synthesis model by tacotron2. The tone of the speech I got is very strange.
This is a raw speech:
post_r_8001_1.wav.zip

This is a corresponding synthesized speech:
test.chn.61k.r1.wav.zip

from tacotron2.

rafaelvalle avatar rafaelvalle commented on August 15, 2024

Attention is very important for proper synthesis: a low validation score with bad alignment won't produce good samples during inference time when we're not teacher forcing, i.e. providing the real previous mel frame to the decoder.

Audio quality is extremely dependent on the mel2audio decoder.
If by tone you mean the robotic sound, you can improve it by using a Wavenet Vocoder, the same way you did it here #52.

If you mean something else, please explain what it is and we can work together on it. I don't speak Chinese and, except the robotic tone from using Griffin-Lim, think your sample sounds pretty good!

from tacotron2.

NewEricWang avatar NewEricWang commented on August 15, 2024

@rafaelvalle , Tone is word-dependent in Chinese. Each word has its own tone. The means of these words with different tone are different. The feature of tone is Pitch or F0. The speech sound strange while the tone of only a word is incorrect.

The attention can't be learned after 100k step when I set dropout=0 for training procedure, but a very low validation score can be got.

from tacotron2.

NewEricWang avatar NewEricWang commented on August 15, 2024

@rafaelvalle ,Thank you very much. When synthesizing with setting dropout=0, the tones of synthesized sentences using different training-step models are different.
This is a sound synthesized by model with 126k-step:
test.126k.v1.wav.zip

This is a sound synthesized by model with 122k-step:
test.122k.v1.1.wav.zip

from tacotron2.

rafaelvalle avatar rafaelvalle commented on August 15, 2024

Closing due to inactivity.

from tacotron2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.