Giter VIP home page Giter VIP logo

Comments (31)

daxiangpanda avatar daxiangpanda commented on August 23, 2024 1

Let us know what works best!

lack of GPU resource.only one p40.so a bit slow

from mellotron.

rafaelvalle avatar rafaelvalle commented on August 23, 2024

Yes, it will work!
We would love to see it trained on multi-language datasets like "Common Voice: A Massively-Multilingual Speech Corpus"
https://arxiv.org/abs/1912.06670

from mellotron.

daxiangpanda avatar daxiangpanda commented on August 23, 2024

I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
image
any idea why?
and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why?
Can you share the function of making align map?

from mellotron.

z592694590 avatar z592694590 commented on August 23, 2024

I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
image
any idea why?
and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why?
Can you share the function of making align map?

Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?

from mellotron.

z592694590 avatar z592694590 commented on August 23, 2024

I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
image
any idea why?
and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why?
Can you share the function of making align map?

Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?

Meanwhile, the alignment trained by thchs30 is same as your.

from mellotron.

daxiangpanda avatar daxiangpanda commented on August 23, 2024

he same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terri

592694590 your qq?or add wechat?

from mellotron.

z592694590 avatar z592694590 commented on August 23, 2024

he same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terri

592694590 your qq?or add wechat?

It is my QQ.

from mellotron.

rafaelvalle avatar rafaelvalle commented on August 23, 2024

Please share the rhythm, pitch contour, mel and audio outputs that you obtained on the model trained on BIAOBEI such that we can help.

from mellotron.

z592694590 avatar z592694590 commented on August 23, 2024

@rafaelvalle
Thank you very much.
These figures of rhythm and mel of training dataset are as follows. And the train loss is 0.22.
image
image

These figures of rhythm, pitch and mel of test dataset are as follows. The original wav is a segment of a song. I used the function of model.forward() to obtain the rhythm. Then I used the function of model.inference_noattention() to synthesis a song. The result seems to be noot good.
image
synthesis.zip

from mellotron.

rafaelvalle avatar rafaelvalle commented on August 23, 2024

It seems to be an issue with the rhythm (alignment map) and pitch (f0).
Rhythm: between the 0th and 50th frame there's some unexpected back and forth, after the 300th frame it's multimodal.
Pitch: the F0 at the onset of the first syllable is 0 but the phoneme, I suppose, is a vowel. Mellotron invents a pitch because none exists. up

Try one of these things:

  1. Run forward a few other times to see if you can get better attention.
  2. Try making the distribution over each frame more peaky. This line should work:
temperature=0.1
rhythm = torch.softmax(rhythm/ temperature)
  1. Try adjusting the rhythm by hand.

For the pitch contour, try changing the parameters of the pitch extraction algorithm or try to adjust the pitch contour manually.

from mellotron.

z592694590 avatar z592694590 commented on August 23, 2024

It seems to be an issue with the rhythm (alignment map) and pitch (f0).
Rhythm: between the 0th and 50th frame there's some unexpected back and forth, after the 300th frame it's multimodal.
Pitch: the F0 at the onset of the first syllable is 0 but the phoneme, I suppose, is a vowel. Mellotron invents a pitch because none exists. up

Try one of these things:

  1. Run forward a few other times to see if you can get better attention.
  2. Try making the distribution over each frame more peaky. This line should work:
temperature=0.1
rhythm = torch.softmax(rhythm/ temperature)
  1. Try adjusting the rhythm by hand.

For the pitch contour, try changing the parameters of the pitch extraction algorithm or try to adjust the pitch contour manually.

Thank you for replying! I will try your way to solve it. Thanks for the suggestions again!

from mellotron.

rafaelvalle avatar rafaelvalle commented on August 23, 2024

Let us know what works best!

from mellotron.

VirtualMoon avatar VirtualMoon commented on August 23, 2024

Is it because BIAOBEI is single-speaker and thchs30 is multi-speaker

from mellotron.

daxiangpanda avatar daxiangpanda commented on August 23, 2024

Is it because BIAOBEI is single-speaker and thchs30 is multi-speaker

you try mellotron with mandarin dataset?any advice?

from mellotron.

VirtualMoon avatar VirtualMoon commented on August 23, 2024

emmmm,I didn't try, I just compared the two datasets

from mellotron.

daxiangpanda avatar daxiangpanda commented on August 23, 2024

emmmm,I didn't try, I just compared the two datasets

ok
thanks

from mellotron.

VirtualMoon avatar VirtualMoon commented on August 23, 2024

Could you please share your prepare.py for thchs30? I'd like to try it

from mellotron.

daxiangpanda avatar daxiangpanda commented on August 23, 2024

Could you please share your prepare.py for thchs30? I'd like to try it

you can add my qq:313514820

from mellotron.

karkirowle avatar karkirowle commented on August 23, 2024

I'm having similar problems with a multi-speaker Dutch corpus (Mozilla Common Voice), on approximately 17 hours of audio data with 400+ speakers. I'm using the pre-trained LibriTTS model as a warmup, and I have accidentally used the English cleaner. I think the latter should not be a huge problem, as it is only taking care of the abbreviations compared to the "transliteration setting". I'm not entirely sure how the CMU Arpabet interacts with the Dutch sentences.

The loss seems to converge, although somewhere around 0.05 would be better (LJSpeech-Tacotronish level):
Screenshot from 2020-01-01 15-28-14

Attention is diagonal:
Screenshot from 2020-01-01 15-30-37

Spectrograms
Screenshot from 2020-01-01 15-33-38

The synthesised audio follows the rhythm, pitch, but it is obviously blurred, not well-articulated speech, which is somewhere midway between Dutch and English.

from mellotron.

rafaelvalle avatar rafaelvalle commented on August 23, 2024

@karkirowle

ARPABET is based on english phonemes and I would expect the model not to work well directly on languages other than English. Try training again with p_arpabet = 0

In addition, take a look at the suggestions above.

from mellotron.

karkirowle avatar karkirowle commented on August 23, 2024

@rafaelvalle

Note that p_arpabet does not currently do anything, see: #27
Nevertheless, I tried your other suggestions.

  1. Not using the ARPABET/cmudict, I get similar loss values as with ARPABET, I'm gonna do a more rigorous loss curve comparison in Tensorboard.
  2. The temperature trick seems to improve the rhythm.

from mellotron.

rafaelvalle avatar rafaelvalle commented on August 23, 2024

@karkirowle I updated the repo before sending you the message such that p_arpabet would work.
Assuming that most of the words in Dutch are not in the ARPABET dictionary, changing it to p_arpabet=0 might not have large difference because the method returns the grapheme representation whenever the phoneme (arpabet) representation is not present. Still worth trying p_arpabet=0 after you pull from master.

Trimming silences can also help improve the rhythm.

from mellotron.

daxiangpanda avatar daxiangpanda commented on August 23, 2024

image
any idea why I got this alignment map?
I use biaobei dataset and use phone tag as input.This alignment map is 30k result.
@rafaelvalle

from mellotron.

z592694590 avatar z592694590 commented on August 23, 2024

image
any idea why I got this alignment map?
I use biaobei dataset and use phone tag as input.This alignment map is 30k result.
@rafaelvalle

I think the dataset is not enough.

from mellotron.

rafaelvalle avatar rafaelvalle commented on August 23, 2024

@z592694590 Arpabet maps english graphemes to phonemes. You can try re-training with
p_arpabet equal to 0 or using a representation that works with the chinese language.

from mellotron.

freenowill avatar freenowill commented on August 23, 2024

image
any idea why I got this alignment map?
I use biaobei dataset and use phone tag as input.This alignment map is 30k result.
@rafaelvalle

have you solved this problem? how about training more steps?

from mellotron.

gongchenghhu avatar gongchenghhu commented on August 23, 2024

I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
image
any idea why?
and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why?
Can you share the function of making align map?

Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?

Meanwhile, the alignment trained by thchs30 is same as your.

I meet the same problem, Have you solve it? #31 (comment)

from mellotron.

arijitx avatar arijitx commented on August 23, 2024

The original tacotron2 code in https://github.com/NVIDIA/tacotron2/blob/master/text/symbols.py#L18
symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet

where as in this repo symbols
symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet

as zero padding is done in the text, it pads the rest of text with the first symbol, Can this be a reason for not learning good attention when using different symbol set ?

from mellotron.

arijitx avatar arijitx commented on August 23, 2024

Some learnings from my recent experiments, firstly thanks for providing this awesome code.

I was having similar issue with attention, it was not able to learn attention after 30k iterations and the attention output was very similar to the above ones when training from scratch, I found that even if you are using a different symbol set or another language initializing the model from pretrained librittts actually makes it learn attention and converge faster. You can ignore the text embedding while loading a model if you are using a using a different language.

ignore_layers=['embedding.weight','speaker_embedding.weight']

from mellotron.

gongchenghhu avatar gongchenghhu commented on August 23, 2024

The original tacotron2 code in https://github.com/NVIDIA/tacotron2/blob/master/text/symbols.py#L18
symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet

where as in this repo symbols
symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet

as zero padding is done in the text, it pads the rest of text with the first symbol, Can this be a reason for not learning good attention when using different symbol set ?

@arijitx Thanks for your reply. But i have changed the mellotron's symbol and text_to_sequence as Tacotron2.

from mellotron.

rasenganai avatar rasenganai commented on August 23, 2024

@gongchenghhu any progress there?. I am facing a similar issue, trying to train it on Hindi but there is no progress.

from mellotron.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.