Hi, I'm a little confused about the speaker id in the reference audio and text. When d

confused about source speaker id in style and rhythm transfer about mellotron HOT 4 OPEN

nvidia commented on August 23, 2024

confused about source speaker id in style and rhythm transfer

from mellotron.

Comments (4)

rafaelvalle commented on August 23, 2024

What you mentioned could've happened during training, for example, when the training and validation filelists have different number of speakers. We circumvent this by first getting a mellotron speaker ids dictionary from the training data and using it for the validation data.
https://github.com/NVIDIA/mellotron/blob/master/train.py#L44

from mellotron.

JeffpanUK commented on August 23, 2024

What you mentioned could've happened during training, for example, when the training and validation filelists have different number of speakers. We circumvent this by first getting a mellotron speaker ids dictionary from the training data and using it for the validation data.
https://github.com/NVIDIA/mellotron/blob/master/train.py#L44

Thanks for your reply. I've noticed this part of codes in training.
However, my concern is that in the inference stage, when we tried to get the rhythm from some reference audios, we need to load the reference filelist with the TextMelLoader. I found that there is no speaker id dictionary given to the TextMelLoader.

arpabet_dict = cmudict.CMUDict('data/cmu_dictionary')
audio_paths = 'data/examples_filelist.txt'
dataloader = TextMelLoader(audio_paths, hparams)
datacollate = TextMelCollate(1)

file_idx = 0
print(dataloader.audiopaths_and_text)
audio_path, text, sid = dataloader.audiopaths_and_text[file_idx]

# get audio path, encoded text, pitch contour and mel for gst
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()    
pitch_contour = dataloader[file_idx][3][None].cuda()
mel = load_mel(audio_path)
print(audio_path, text)

# load source data to obtain rhythm using tacotron 2 as a forced aligner
x, y = mellotron.parse_batch(datacollate([dataloader[file_idx]]))

In this case, the mellotron speaker ids are related to the number of speakers in the reference filelist. Then, we do mellotron.forward to get the reference rhythm as below:

with torch.no_grad():
    # get rhythm (alignment map) using tacotron 2
    mel_outputs, mel_outputs_postnet, gate_outputs, rhythm = mellotron.forward(x)
    rhythm = rhythm.permute(1, 0, 2)

where the x contains ref_text, ref_mel, ref_f0 and ref_melltron_speaker_ids, and the generated rhythm will changed if the number of speakers in the reference filelist changed, for the same reference audio.

from mellotron.

rafaelvalle commented on August 23, 2024

During experiments, we noticed that the rhythm (alignment map) we get from Tacotron seems to be independent of providing the correct speaker id. You can try, for example, to provide different speaker ids while using Tacotron as a forced aligner and observe if there is a significant difference.

from mellotron.

rafaelvalle commented on August 23, 2024

Closing due to inactivity.

from mellotron.

confused about source speaker id in style and rhythm transfer about mellotron HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent