I encountered 2 issues when following the work. The first one is when I trained th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yes, the loss should be negative. We trained the model for 580 iterations with b

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

negative loss and failed to load model for retraining about waveglow HOT 13 CLOSED

benlaitang commented on May 18, 2024

negative loss and failed to load model for retraining

from waveglow.

Comments (13)

chaiyujin commented on May 18, 2024 2

@yoyololicon

When training with default model given by this repo, I encounter NAN.
[Solve]: I initialize the upsample layer weight to be 1.0, bias to be 0.0.
When training with multi-gpus, I encounter NAN again.
[Solve]: Use one gpu, batch_size = 4

I'm not sure how to avoid nan.
It works to train with inited upsample layer and on one gpu.

from waveglow.

chaiyujin commented on May 18, 2024

I also got negative loss and 'nan' during training.

I moved the waveglow into my training framework and trained it with DataParallel from scratch. It seems that the nll recovered from nan and then became nan again.

Besides, the infered samples even reach 1e17.

I used hparams:

        sigma                       = 1.0,
        n_flows                     = 12,
        n_group                     = 8,
        n_early_every               = 4,
        n_early_size                = 2,
        wn                          = Config(n_layers=8, n_channels=256, kernel_size=3)

from waveglow.

benlaitang commented on May 18, 2024

@chaiyujin so, what's your training framework? retrain from the provided model, did it work? I am lost your words

from waveglow.

chaiyujin commented on May 18, 2024

@benlaitang Sorry about my english. I have my own training framework. I trained glow from scratch. Never test fine-tuning from provided model.

from waveglow.

wenyong-h commented on May 18, 2024

I got exactly the same problem.

from waveglow.

benlaitang commented on May 18, 2024

@wenyong-h did you solve the problem?

from waveglow.

wenyong-h commented on May 18, 2024

No, I'm training from scratch now.

from waveglow.

rafaelvalle commented on May 18, 2024

Yes, the loss should be negative. We trained the model for 580 iterations with batch size 24. 22k iters with batch size 3 is probably not enough to produce intelligible speech.
The second model is shared for inference only, not resuming training.

from waveglow.

benlaitang commented on May 18, 2024

@rafaelvalle thanks a lot. I will try from scratch again.

from waveglow.

rafaelvalle commented on May 18, 2024

Closing. Please re-open if necessary.

from waveglow.

yoyololicon commented on May 18, 2024

@chaiyujin Did you solve the issue of nan loss? Because I encounter similar issue. My training curve is something like these:

from waveglow.

rishikksh20 commented on May 18, 2024

@rafaelvalle is there anyway to fine-tune or re-traing or resume training of the model ?

from waveglow.

scimagian commented on May 18, 2024

@chaiyujin thanks for your solution. I also met the NAN problem. When I set the one GPU with batch size 1, the training loss is fine. But when I set 8 GPU by using torch.nn.parallel.data_parallel with batch size 8, I got NAN loss after a few thousand steps. I adjust the learning rate from 1e-4 to 1e-5, then solved the NAN problem.

from waveglow.

negative loss and failed to load model for retraining about waveglow HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent