<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

看样子是学习率一直在上升导致的Nan，你可以把学习率调小一点，顺便问一下，训练的哪个数据集？ <p dir="

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data

total_loss: nan? about tensorflow2.0-examples HOT 16 OPEN

yunyang1994 commented on May 9, 2024

total_loss: nan?

from tensorflow2.0-examples.

Comments (16)

forever208 commented on May 9, 2024 5

Any update on this?

if your giou firstly turned out nan, it is likely that there is something wrong in the defined giou function. In my experiment, I found the union_area = 0, so the IOU = infinity. Correspondingly, you could debug it by edit the giou function. My improper method is adding a small enough number in the end of this place: (because I haven't really find the root cause of this bug)

union_area = boxes1_area + boxes2_area - inter_area + 1e-10

from tensorflow2.0-examples.

YunYang1994 commented on May 9, 2024

看样子是学习率一直在上升导致的Nan，你可以把学习率调小一点，顺便问一下，训练的哪个数据集？

from tensorflow2.0-examples.

dvlee1024 commented on May 9, 2024

看样子是学习率一直在上升导致的Nan，你可以把学习率调小一点，顺便问一下，训练的哪个数据集？

人脸的，wider face。
学习率不是应该一直下降的吗？ @YunYang1994

from tensorflow2.0-examples.

dvlee1024 commented on May 9, 2024

我知道了,我的数据集大，steps_per_epoch为1250，warmup为10的话，warmup_steps为12500。
我的global_steps一直小于warmup_steps，lr一直处于上升阶段

steps_per_epoch = len(trainset)
warmup_steps = cfg.TRAIN.WARMUP_EPOCHS * steps_per_epoch
total_steps = cfg.TRAIN.EPOCHS * steps_per_epoch

 if global_steps < warmup_steps:
       lr = global_steps / warmup_steps *cfg.TRAIN.LR_INIT
 else:
       lr = cfg.TRAIN.LR_END + 0.5 * (cfg.TRAIN.LR_INIT - cfg.TRAIN.LR_END) * (
                (1 + tf.cos((global_steps - warmup_steps) / (total_steps - warmup_steps) * np.pi))
        )

from tensorflow2.0-examples.

YunYang1994 commented on May 9, 2024

你打开tensorboard不就知道了

from tensorflow2.0-examples.

YunYang1994 commented on May 9, 2024

__C.TRAIN.LR_INIT             = 1e-4
__C.TRAIN.LR_END              = 1e-6
__C.TRAIN.WARMUP_EPOCHS       = 4

试试？

from tensorflow2.0-examples.

dvlee1024 commented on May 9, 2024

__C.TRAIN.LR_INIT             = 1e-4
__C.TRAIN.LR_END              = 1e-6
__C.TRAIN.WARMUP_EPOCHS       = 4

试试？

其实warmup有什么用的，我还打算设置成0

from tensorflow2.0-examples.

YunYang1994 commented on May 9, 2024

醉了，有什么用？自己看 https://arxiv.org/pdf/1812.01187.pdf

from tensorflow2.0-examples.

dvlee1024 commented on May 9, 2024

restore上次的weight继续训练，还需要warmup吗？
外行入门，还是要抽空看看书😂

from tensorflow2.0-examples.

YunYang1994 commented on May 9, 2024

如果loss没有出现Nan，就不用warmup

from tensorflow2.0-examples.

SinclairHudson commented on May 9, 2024

I'm having the same issue. Could I please get an english explanation?

from tensorflow2.0-examples.

SinclairHudson commented on May 9, 2024

@YunYang1994 could I get a quick english translation please?

from tensorflow2.0-examples.

aHandToHelp commented on May 9, 2024

Any update on this?

from tensorflow2.0-examples.

k-maheshkumar commented on May 9, 2024

I am facing same problem, any updates on this?

from tensorflow2.0-examples.

SinclairHudson commented on May 9, 2024

I solved the issue by reducing the learning rate and using warmup epochs. The learning rate slowly increases and then decreases, and never gets too high. This will prevent the model from diverging (NaN loss). Hope this helps!

from tensorflow2.0-examples.

IqbalLx commented on May 9, 2024

Any update on this?

if your giou firstly turned out nan, it is likely that there is something wrong in the defined giou function. In my experiment, I found the union_area = 0, so the IOU = infinity. Correspondingly, you could debug it by edit the giou function. My improper method is adding a small enough number in the end of this place: (because I haven't really find the root cause of this bug)

union_area = boxes1_area + boxes2_area - inter_area + 1e-10

already try this, and seems working fine. Thanks!

from tensorflow2.0-examples.

total_loss: nan? about tensorflow2.0-examples HOT 16 OPEN

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent