Request from issue <a href="https://github.com/NVIDIA/tacotron2/issues/24" data-hoverc

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Problem with the Mel Spectrogram Representation about tacotron2 HOT 6 CLOSED

nvidia commented on August 15, 2024 1

Problem with the Mel Spectrogram Representation

from tacotron2.

Comments (6)

rafaelvalle commented on August 15, 2024 3

Yeah, it matches.

from tacotron2.

yliess86 commented on August 15, 2024 2

@rafaelvalle It is now working. Tank you for the help!
When I will finish my prototype I will probably retrain both models with the same mel representation.

from tacotron2.

rafaelvalle commented on August 15, 2024

The representation of the mel-spectrograms output by the Tacotron 2 model you trained does not match the mel-spectrogram used in r9y9's MoL WaveNet. More specifically, the minimum and maximum mel-spectrogram frequencies are different.

The code below converts a mel trained with the default mel-spectrogram representation in this repo to the representation used in r9y9's shared WaveNet MoL. Ideally one would train Tacotron 2 and Wavenet with the same mel representation, specially the minimum and maximum mel frequencies.

# load mel file output by Tacotron 2
mel = torch.autograd.Variable(torch.from_numpy(
    np.load('mel_spec.npy'))[None,:])

# Tacotron 2 Training Params
filter_length = 1024
hop_length = 256
win_length = 1024
sampling_rate = 22050
mel_fmin = 0.0
mel_fmax = None
taco_stft = TacotronSTFT(
    filter_length, hop_length, win_length, 
    sampling_rate=sampling_rate, mel_fmin=mel_fmin, 
    mel_fmax=mel_fmax)

# Project from Mel-Spectrogram to Spectrogram
mel_decompress = taco_stft.spectral_de_normalize(mel)
mel_decompress = mel_decompress.transpose(1, 2).data.cpu()
spec_from_mel_scaling = 1000
spec_from_mel = torch.mm(mel_decompress[0], taco_stft.mel_basis)
spec_from_mel = spec_from_mel.transpose(0, 1)
spec_from_mel = spec_from_mel * spec_from_mel_scaling

# WaveNet Decoder 2 Training Params
filter_length = 1024
hop_length = 256
win_length = 1024
sampling_rate = 22050
mel_fmin = 125
mel_fmax = 7600

taco_stft_other = TacotronSTFT(
    filter_length, hop_length, win_length, 
    sampling_rate=sampling_rate, mel_fmin=mel_fmin, mel_fmax=mel_fmax)

# Project from Spectrogram to r9y9's WaveNet Mel-Spectrogram
mel_minmax = taco_stft_other.spectral_normalize(
    torch.matmul(taco_stft_other.mel_basis, spec_from_mel))

from tacotron2.

rafaelvalle commented on August 15, 2024

The first few frames of the mel-spectrogram you provided sounds like this:
yliess86_audio_trim.wav.zip

from tacotron2.

MXGray commented on August 15, 2024

@rafaelvalle
Hope you could help me figure out if the default Tacotron2 hparams of this repo is a match to the nv-wavenet hparams I used below?
Or if not, how can I ensure that it matches?

config.json of nv-wavenet/pytorch:

"data_config": {
    "training_files": "train_files.txt",
    "segment_length": 22050,
    "mu_quantization": 256,
    "filter_length": 1024,
    "hop_length": 256,
    "win_length": 1024,
    "sampling_rate": 22050
},

"dist_config": {
    "dist_backend": "nccl",
    "dist_url": "tcp://localhost:54321"
},

"wavenet_config": {
    "n_in_channels": 256,
    "n_layers": 16,
    "max_dilation": 128,
    "n_residual_channels": 64,
    "n_skip_channels": 256,
    "n_out_channels": 256,
    "n_cond_channels": 80,
    "upsamp_window": 1024,
    "upsamp_stride": 256
}

}

And, these are my Tacotron2 hparams:

    # Data Parameters             #
    load_mel_from_disk=False,
    training_files='filelists/ljs_audio_text_train_filelist.txt',
    validation_files='filelists/ljs_audio_text_val_filelist.txt',
    text_cleaners=['english_cleaners'],
    sort_by_length=False,

    # Audio Parameters             #
    max_wav_value=32768.0,
    sampling_rate=22050,
    filter_length=1024,
    hop_length=256,
    win_length=1024,
    n_mel_channels=80,
    mel_fmin=0.0,
    mel_fmax=None,  # if None, half the sampling rate

    # Model Parameters             #
    n_symbols=len(symbols),
    symbols_embedding_dim=512,

    # Encoder parameters
    encoder_kernel_size=5,
    encoder_n_convolutions=3,
    encoder_embedding_dim=512,

    # Decoder parameters
    n_frames_per_step=1,  # currently only 1 is supported
    decoder_rnn_dim=1024,
    prenet_dim=256,
    max_decoder_steps=1000,
    gate_threshold=0.6,

    # Attention parameters
    attention_rnn_dim=1024,
    attention_dim=128,

    # Location Layer parameters
    attention_location_n_filters=32,
    attention_location_kernel_size=31,

    # Mel-post processing network parameters
    postnet_embedding_dim=512,
    postnet_kernel_size=5,
    postnet_n_convolutions=5,

    # Optimization Hyperparameters #
    use_saved_learning_rate=False,
    learning_rate=1e-3,
    weight_decay=1e-6,
    grad_clip_thresh=1,
    batch_size=12,
    mask_padding=False  # set model's padded outputs to padded values
)

Would greatly appreciate your help. Thanks!

from tacotron2.

rafaelvalle commented on August 15, 2024

Closing. Please re-open if new issues appear!

from tacotron2.

Problem with the Mel Spectrogram Representation about tacotron2 HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent