Comments (31)
Let us know what works best!
lack of GPU resource.only one p40.so a bit slow
from mellotron.
Yes, it will work!
We would love to see it trained on multi-language datasets like "Common Voice: A Massively-Multilingual Speech Corpus"
https://arxiv.org/abs/1912.06670
from mellotron.
I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
any idea why?
and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why?
Can you share the function of making align map?
from mellotron.
I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
any idea why?
and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why?
Can you share the function of making align map?
Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?
from mellotron.
I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
any idea why?
and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why?
Can you share the function of making align map?Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?
Meanwhile, the alignment trained by thchs30 is same as your.
from mellotron.
he same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terri
592694590 your qq?or add wechat?
from mellotron.
he same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terri
592694590 your qq?or add wechat?
It is my QQ.
from mellotron.
Please share the rhythm, pitch contour, mel and audio outputs that you obtained on the model trained on BIAOBEI such that we can help.
from mellotron.
@rafaelvalle
Thank you very much.
These figures of rhythm and mel of training dataset are as follows. And the train loss is 0.22.
These figures of rhythm, pitch and mel of test dataset are as follows. The original wav is a segment of a song. I used the function of model.forward() to obtain the rhythm. Then I used the function of model.inference_noattention() to synthesis a song. The result seems to be noot good.
synthesis.zip
from mellotron.
It seems to be an issue with the rhythm (alignment map) and pitch (f0).
Rhythm: between the 0th and 50th frame there's some unexpected back and forth, after the 300th frame it's multimodal.
Pitch: the F0 at the onset of the first syllable is 0 but the phoneme, I suppose, is a vowel. Mellotron invents a pitch because none exists. up
Try one of these things:
- Run forward a few other times to see if you can get better attention.
- Try making the distribution over each frame more peaky. This line should work:
temperature=0.1
rhythm = torch.softmax(rhythm/ temperature)
- Try adjusting the rhythm by hand.
For the pitch contour, try changing the parameters of the pitch extraction algorithm or try to adjust the pitch contour manually.
from mellotron.
It seems to be an issue with the rhythm (alignment map) and pitch (f0).
Rhythm: between the 0th and 50th frame there's some unexpected back and forth, after the 300th frame it's multimodal.
Pitch: the F0 at the onset of the first syllable is 0 but the phoneme, I suppose, is a vowel. Mellotron invents a pitch because none exists. upTry one of these things:
- Run forward a few other times to see if you can get better attention.
- Try making the distribution over each frame more peaky. This line should work:
temperature=0.1 rhythm = torch.softmax(rhythm/ temperature)
- Try adjusting the rhythm by hand.
For the pitch contour, try changing the parameters of the pitch extraction algorithm or try to adjust the pitch contour manually.
Thank you for replying! I will try your way to solve it. Thanks for the suggestions again!
from mellotron.
Let us know what works best!
from mellotron.
Is it because BIAOBEI is single-speaker and thchs30 is multi-speaker
from mellotron.
Is it because BIAOBEI is single-speaker and thchs30 is multi-speaker
you try mellotron with mandarin dataset?any advice?
from mellotron.
emmmm,I didn't try, I just compared the two datasets
from mellotron.
emmmm,I didn't try, I just compared the two datasets
ok
thanks
from mellotron.
Could you please share your prepare.py for thchs30? I'd like to try it
from mellotron.
Could you please share your prepare.py for thchs30? I'd like to try it
you can add my qq:313514820
from mellotron.
I'm having similar problems with a multi-speaker Dutch corpus (Mozilla Common Voice), on approximately 17 hours of audio data with 400+ speakers. I'm using the pre-trained LibriTTS model as a warmup, and I have accidentally used the English cleaner. I think the latter should not be a huge problem, as it is only taking care of the abbreviations compared to the "transliteration setting". I'm not entirely sure how the CMU Arpabet interacts with the Dutch sentences.
The loss seems to converge, although somewhere around 0.05 would be better (LJSpeech-Tacotronish level):
The synthesised audio follows the rhythm, pitch, but it is obviously blurred, not well-articulated speech, which is somewhere midway between Dutch and English.
from mellotron.
ARPABET is based on english phonemes and I would expect the model not to work well directly on languages other than English. Try training again with p_arpabet = 0
In addition, take a look at the suggestions above.
from mellotron.
Note that p_arpabet does not currently do anything, see: #27
Nevertheless, I tried your other suggestions.
- Not using the ARPABET/cmudict, I get similar loss values as with ARPABET, I'm gonna do a more rigorous loss curve comparison in Tensorboard.
- The temperature trick seems to improve the rhythm.
from mellotron.
@karkirowle I updated the repo before sending you the message such that p_arpabet would work.
Assuming that most of the words in Dutch are not in the ARPABET dictionary, changing it to p_arpabet=0 might not have large difference because the method returns the grapheme representation whenever the phoneme (arpabet) representation is not present. Still worth trying p_arpabet=0 after you pull from master.
Trimming silences can also help improve the rhythm.
from mellotron.
any idea why I got this alignment map?
I use biaobei dataset and use phone tag as input.This alignment map is 30k result.
@rafaelvalle
from mellotron.
any idea why I got this alignment map?
I use biaobei dataset and use phone tag as input.This alignment map is 30k result.
@rafaelvalle
I think the dataset is not enough.
from mellotron.
@z592694590 Arpabet maps english graphemes to phonemes. You can try re-training with
p_arpabet equal to 0 or using a representation that works with the chinese language.
from mellotron.
any idea why I got this alignment map?
I use biaobei dataset and use phone tag as input.This alignment map is 30k result.
@rafaelvalle
have you solved this problem? how about training more steps?
from mellotron.
I trained mellotron with two different dataset(BIAOBEI and thchs30),the allignment map of biaobei is well,but the alignment map of thchs30 dataset is not as well as BIAOBEI dataset.
any idea why?
and I want to synth a simple song with model trained with BIAOBEI dataset.I use the function inference_no_attention ,and the rhythm matrix made by the midi rhythmic type but the results is not good(the word timing is good but the wave sounds terriable),any idea why?
Can you share the function of making align map?Hi, I had the same problem as you. I trained this model by BIAOBEI. The alignment seems to be good. But got a terrible result after I use the function inference_no_attention to synthesis a song. Do you have some ideas?
Meanwhile, the alignment trained by thchs30 is same as your.
I meet the same problem, Have you solve it? #31 (comment)
from mellotron.
The original tacotron2 code in https://github.com/NVIDIA/tacotron2/blob/master/text/symbols.py#L18
symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet
where as in this repo symbols
symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabet
as zero padding is done in the text, it pads the rest of text with the first symbol, Can this be a reason for not learning good attention when using different symbol set ?
from mellotron.
Some learnings from my recent experiments, firstly thanks for providing this awesome code.
I was having similar issue with attention, it was not able to learn attention after 30k iterations and the attention output was very similar to the above ones when training from scratch, I found that even if you are using a different symbol set or another language initializing the model from pretrained librittts actually makes it learn attention and converge faster. You can ignore the text embedding while loading a model if you are using a using a different language.
ignore_layers=['embedding.weight','speaker_embedding.weight']
from mellotron.
The original tacotron2 code in https://github.com/NVIDIA/tacotron2/blob/master/text/symbols.py#L18
symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabetwhere as in this repo symbols
symbols = list(_punctuation + _math + _special + _accented + _numbers + _letters) + _arpabetas zero padding is done in the text, it pads the rest of text with the first symbol, Can this be a reason for not learning good attention when using different symbol set ?
@arijitx Thanks for your reply. But i have changed the mellotron's symbol and text_to_sequence as Tacotron2.
from mellotron.
@gongchenghhu any progress there?. I am facing a similar issue, trying to train it on Hindi but there is no progress.
from mellotron.
Related Issues (20)
- NoneType' object is not iterable
- Mismatch model volume
- Training on a different language HOT 1
- Inference without rhythm and pitch
- parse_output error with Blizzard2013 data
- Training on EmovDB HOT 2
- Voice synthesis by model is not the same as the voice with speaker ID HOT 1
- Try to train some new words
- inference speed on CPU
- Training time
- Two key points of training multispeaker mellotron
- how to train?
- colab demo for inferenece
- How to generate .musicxml files like the examples in `/data`? HOT 1
- Synthesize own text without style transfer gives poor audio results HOT 1
- Here's some code to start mellotron inference by calling a .py file from CLI [Docs]
- What is the reason of filtering "_" and "~" symbols?
- Something wrong with text padding HOT 5
- Can I use TensorRT to speed up model inference?
- colab error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mellotron.