training error in stage 1,about moorethreads/moore-animateanyone

Comments (25)

TZYSJTU commented on July 24, 2024 6

我宣布此问题已解决，上面那些人乱说什么 do_classifier_free_guidance 的问题，根本不是。
问题的原因是batchsize, 代码作者在创建dataloader的时候没有设置 drop_last=True而代码又假设了batchsize是满的。这样如果最后一个batch不是满的，就会有
IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 你的batch*768, 320] at index 0

PS: 如何发现这个问题的。我用ubc 499条数据，batchsize=2，每次都是跑完249步，第250步bug。然后自己的数据集是1000条数据，就不报错。添加drop_last=True后就正常了。

from moore-animateanyone.

HaiLin545 commented on July 24, 2024 4

一个简单的解决办法就是：在eval的时候，传给log_validatation的net，做一个深拷贝就好了，避免覆盖训练的net
把

reference_unet = ori_net.reference_unet
denoising_unet = ori_net.denoising_unet

改成

reference_unet = copy.deepcopy(ori_net.reference_unet)
denoising_unet = copy.deepcopy(ori_net.denoising_unet)

from moore-animateanyone.

EzrealLee9527 commented on July 24, 2024 3

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

from moore-animateanyone.

EzrealLee9527 commented on July 24, 2024 2

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的

The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

from moore-animateanyone.

renrenzsbbb commented on July 24, 2024 2

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的
The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

嗯是的，是这个问题导致的

写在forward里就好啦请问下你们训练的数据集全部是静态纯色背景？我目前训练的motion module效果远差于animatediff的那种效果，不自然的晃动感太强烈了，你们有做过相应的实验验证是数据偏置的问题还是方法本身的问题吗

还有一种方法就是inference之后再包一层：
reference_control_writer = ReferenceAttentionControl(
reference_unet,
do_classifier_free_guidance=False,
mode="write",
fusion_blocks="full",
)
reference_control_reader = ReferenceAttentionControl(
denoising_unet,
do_classifier_free_guidance=False,
mode="read",
fusion_blocks="full",
)

from moore-animateanyone.

lixunsong commented on July 24, 2024 1

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的
The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

嗯是的，是这个问题导致的

from moore-animateanyone.

EzrealLee9527 commented on July 24, 2024 1

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的
The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

嗯是的，是这个问题导致的

写在forward里就好啦请问下你们训练的数据集全部是静态纯色背景？我目前训练的motion module效果远差于animatediff的那种效果，不自然的晃动感太强烈了，你们有做过相应的实验验证是数据偏置的问题还是方法本身的问题吗

from moore-animateanyone.

TZYSJTU commented on July 24, 2024 1

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？

We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce this problem during your training?

能稳定复现，但是我不明白为什么会在第一次validation之后又训练若干步才报错，而不是validation完继续训的第一步就报错。

from moore-animateanyone.

WuHuadong3 commented on July 24, 2024

没有修改任何config和源码，只更改了config的数据集路径

from moore-animateanyone.

lixunsong commented on July 24, 2024

是在训练过程中做 evaluation 时报的错吗？

from moore-animateanyone.

WuHuadong3 commented on July 24, 2024

是在训练过程中做 evaluation 时报的错吗？

是的。
在此之前的一些log：

01/18/2024 20:46:06 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'force_upcast', 'scaling_factor'} was not found in config. Values will be initialized to default values.
{'attention_type', 'transformer_layers_per_block', 'resnet_out_scale_factor', 'time_cond_proj_dim', 'resnet_time_scale_shift', 'time_embedding_type', 'num_attention_heads', 'conv_out_kernel', 'addition_embed_type', 'addition_embed_type_num_heads', 'dropout', 'conv_in_kernel', 'reverse_transformer_layers_per_block', 'cross_attention_norm', 'class_embed_type', 'timestep_post_act', 'encoder_hid_dim', 'resnet_skip_time_act', 'time_embedding_dim', 'mid_block_type', 'class_embeddings_concat', 'time_embedding_act_fn', 'mid_block_only_cross_attention', 'projection_class_embeddings_input_dim', 'upcast_attention', 'encoder_hid_dim_type', 'addition_time_embed_dim'} was not found in config. Values will be initialized to default values.
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias']
01/18/2024 20:46:26 - INFO - src.models.unet_3d - loaded temporal unet's pretrained weights from pretrained_weights/sd-image-variations-diffusers/unet ...
{'motion_module_decoder_only', 'motion_module_mid_block', 'class_embed_type', 'use_inflated_groupnorm', 'unet_use_cross_frame_attention', 'motion_module_type', 'motion_module_resolutions', 'upcast_attention', 'resnet_time_scale_shift', 'motion_module_kwargs'} was not found in config. Values will be initialized to default values.
01/18/2024 20:46:34 - INFO - src.models.unet_3d - Loaded 0.0M-parameter motion module
01/18/2024 20:46:49 - INFO - main - Missing key for pose guider: 2
01/18/2024 20:46:49 - INFO - main - ***** Running training *****
01/18/2024 20:46:49 - INFO - main - Num examples = 499
01/18/2024 20:46:49 - INFO - main - Num Epochs = 240
01/18/2024 20:46:49 - INFO - main - Instantaneous batch size per device = 4
01/18/2024 20:46:49 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4
01/18/2024 20:46:49 - INFO - main - Gradient Accumulation steps = 1
01/18/2024 20:46:49 - INFO - main - Total optimization steps = 30000
Steps: 1%|▋ | 200/30000 [04:36<11:37:43, 1.40s/it, lr=1e-5, step_loss=0.0665]01/18/2024 20:51:26 - INFO - main - Running validation...
2024-01-18 20:51:28.890883153 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-18 20:51:28.890915295 [W:onnxruntime:, session_state.cc:1164 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 53.12it/s]
The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 201.49it/s]
The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 206.58it/s]
The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 197.38it/s]
Steps: 1%|▊ | 249/30000 [06:25<11:27:41, 1.39s/it, lr=1e-5, step_loss=0.111]Traceback (most recent call last):00:00<?, ?it/s]

from moore-animateanyone.

lixunsong commented on July 24, 2024

看起来是单卡训练的问题，我试着修复一下。你可以先在训练过程中不做 evaluation，训到一定阶段保存 ckpt 做测试。

from moore-animateanyone.

WuHuadong3 commented on July 24, 2024

看起来是单卡训练的问题，我试着修复一下。你可以先在训练过程中不做 evaluation，训到一定阶段保存 ckpt 做测试。

感谢～

from moore-animateanyone.

WuHuadong3 commented on July 24, 2024

OSError: [Errno 16] Device or resource busy: './Moore-AnimateAnyone/mlruns/577895558829709631/215c420ddf57411b86a37f6cd75c5667/meta.yaml'
Steps: 31%|█████████████▊ | 9171/30000 [8:10:47<18:34:40, 3.21s/it, lr=1e-5, step_loss=0.0424]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1054950 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1054949) of binary: ./anaconda3/envs/moore/bin/python

用多卡训练的时候也出现了报错～

from moore-animateanyone.

ypflll commented on July 24, 2024

2卡训练遇到同样的问题：
Steps: 1%|▋ | 200/30000 [04:18<9:39:59, 1.17s/it, lr=1e-5, step_loss=0.0821]Traceback (most recent call last):
File "train_stage_1.py", line 728, in
main(config)
File "train_stage_1.py", line 564, in main
model_pred = net(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/opt/conda/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "train_stage_1.py", line 87, in forward
model_pred = self.denoising_unet(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "Moore-AnimateAnyone/src/models/unet_3d.py", line 493, in forward
sample, res_samples = downsample_block(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "Moore-AnimateAnyone/src/models/unet_3d_blocks.py", line 442, in forward
hidden_states = attn(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "Moore-AnimateAnyone/src/models/transformer_3d.py", line 140, in forward
hidden_states = block(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "Moore-AnimateAnyone/src/models/mutual_self_attention.py", line 180, in hacked_basic_transformer_inner_forward
norm_hidden_states[_uc_mask],
IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 4096, 320] at index 0
Steps: 1%|▋ | 200/30000 [04:19<10:43:40, 1.30s/it, lr=1e-5, step_loss=0.

看起来是第一次validation之后报错了。

from moore-animateanyone.

WuHuadong3 commented on July 24, 2024

2卡训练遇到同样的问题： Steps: 1%|▋ | 200/30000 [04:18<9:39:59, 1.17s/it, lr=1e-5, step_loss=0.0821]Traceback (most recent call last): File "train_stage_1.py", line 728, in main(config) File "train_stage_1.py", line 564, in main model_pred = net( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index] File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward return model_forward(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in call return convert_to_fp32(self.model_forward(*args, **kwargs)) File "/opt/conda/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, **kwargs) File "train_stage_1.py", line 87, in forward model_pred = self.denoising_unet( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "Moore-AnimateAnyone/src/models/unet_3d.py", line 493, in forward sample, res_samples = downsample_block( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "Moore-AnimateAnyone/src/models/unet_3d_blocks.py", line 442, in forward hidden_states = attn( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "Moore-AnimateAnyone/src/models/transformer_3d.py", line 140, in forward hidden_states = block( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "Moore-AnimateAnyone/src/models/mutual_self_attention.py", line 180, in hacked_basic_transformer_inner_forward norm_hidden_states[_uc_mask], IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 4096, 320] at index 0 Steps: 1%|▋ | 200/30000 [04:19<10:43:40, 1.30s/it, lr=1e-5, step_loss=0.

看起来是第一次validation之后报错了。

我用两个卡训练的时候没有遇到这个问题，不过遇到了另一个问题（上面贴了）。

from moore-animateanyone.

lixunsong commented on July 24, 2024

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？

We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce
this problem during your training?

from moore-animateanyone.

HaiLin545 commented on July 24, 2024

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？

We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce this problem during your training?

在3090上，单卡和多卡都稳定复现，只能注释掉evaluate的代码

from moore-animateanyone.

theSha1do1w commented on July 24, 2024

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？
We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce this problem during your training?

在3090上，单卡和多卡都稳定复现，只能注释掉evaluate的代码

3090可以跑起来stage1的训练吗？

from moore-animateanyone.

HaiLin545 commented on July 24, 2024

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？
We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce this problem during your training?

在3090上，单卡和多卡都稳定复现，只能注释掉evaluate的代码

3090可以跑起来stage1的训练吗？

全量跑不起来，跑lora就可以

from moore-animateanyone.

TZYSJTU commented on July 24, 2024

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

遇到同样的问题，还没仔细看代码。
请问这个怎么重置，能给一个具体修改方式吗？

from moore-animateanyone.

TZYSJTU commented on July 24, 2024

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的
The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

嗯是的，是这个问题导致的

遇到同样的问题，请问这个怎么重置，能给一个具体修改方式吗？

from moore-animateanyone.

TZYSJTU commented on July 24, 2024

是在训练过程中做 evaluation 时报的错吗？

是的。在此之前的一些log：

01/18/2024 20:46:06 - INFO - main - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda

Mixed precision type: fp16

{'force_upcast', 'scaling_factor'} was not found in config. Values will be initialized to default values. {'attention_type', 'transformer_layers_per_block', 'resnet_out_scale_factor', 'time_cond_proj_dim', 'resnet_time_scale_shift', 'time_embedding_type', 'num_attention_heads', 'conv_out_kernel', 'addition_embed_type', 'addition_embed_type_num_heads', 'dropout', 'conv_in_kernel', 'reverse_transformer_layers_per_block', 'cross_attention_norm', 'class_embed_type', 'timestep_post_act', 'encoder_hid_dim', 'resnet_skip_time_act', 'time_embedding_dim', 'mid_block_type', 'class_embeddings_concat', 'time_embedding_act_fn', 'mid_block_only_cross_attention', 'projection_class_embeddings_input_dim', 'upcast_attention', 'encoder_hid_dim_type', 'addition_time_embed_dim'} was not found in config. Values will be initialized to default values. Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: ['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias'] 01/18/2024 20:46:26 - INFO - src.models.unet_3d - loaded temporal unet's pretrained weights from pretrained_weights/sd-image-variations-diffusers/unet ... {'motion_module_decoder_only', 'motion_module_mid_block', 'class_embed_type', 'use_inflated_groupnorm', 'unet_use_cross_frame_attention', 'motion_module_type', 'motion_module_resolutions', 'upcast_attention', 'resnet_time_scale_shift', 'motion_module_kwargs'} was not found in config. Values will be initialized to default values. 01/18/2024 20:46:34 - INFO - src.models.unet_3d - Loaded 0.0M-parameter motion module 01/18/2024 20:46:49 - INFO - main - Missing key for pose guider: 2 01/18/2024 20:46:49 - INFO - main - ***** Running training ***** 01/18/2024 20:46:49 - INFO - main - Num examples = 499 01/18/2024 20:46:49 - INFO - main - Num Epochs = 240 01/18/2024 20:46:49 - INFO - main - Instantaneous batch size per device = 4 01/18/2024 20:46:49 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 01/18/2024 20:46:49 - INFO - main - Gradient Accumulation steps = 1 01/18/2024 20:46:49 - INFO - main - Total optimization steps = 30000 Steps: 1%|▋ | 200/30000 [04:36<11:37:43, 1.40s/it, lr=1e-5, step_loss=0.0665]01/18/2024 20:51:26 - INFO - main - Running validation... 2024-01-18 20:51:28.890883153 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-01-18 20:51:28.890915295 [W:onnxruntime:, session_state.cc:1164 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 53.12it/s] The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 201.49it/s] The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 206.58it/s] The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 197.38it/s] Steps: 1%|▊ | 249/30000 [06:25<11:27:41, 1.39s/it, lr=1e-5, step_loss=0.111]Traceback (most recent call last):00:00<?, ?it/s]

完全相同的报错，也是249步。已解决。
首先你是用的ubc数据集吧。然后ubc数据集有一条在提取dwpose的时候会有bug，所以删掉之后还剩499条。然后batchsize你是2吧。这样最后一个batch就只有1条数据了。而代码作者没考虑到不满一个batch的情况，就出现了这个问题。

解决方法很简单，在创建dataloader时，参数drop_last=True

from moore-animateanyone.

jim-1ee commented on July 24, 2024

我宣布此问题已解决，上面那些人乱说什么 do_classifier_free_guidance 的问题，根本不是。问题的原因是batchsize, 代码作者在创建dataloader的时候没有设置 drop_last=True而代码又假设了batchsize是满的。这样如果最后一个batch不是满的，就会有 IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 你的batch*768, 320] at index 0

PS: 如何发现这个问题的。我用ubc 499条数据，batchsize=2，每次都是跑完249步，第250步bug。然后自己的数据集是1000条数据，就不报错。添加drop_last=True后就正常了。

我设置的batchsize=1，还是一样的报错。

from moore-animateanyone.

ButoneDream commented on July 24, 2024

一个简单的解决办法就是：在eval的时候，传给log_validatation的net，做一个深拷贝就好了，避免覆盖训练的net 把
reference_unet = ori_net.reference_unet
denoising_unet = ori_net.denoising_unet
改成
reference_unet = copy.deepcopy(ori_net.reference_unet)
denoising_unet = copy.deepcopy(ori_net.denoising_unet)

may encounter OOM if VARM is not enough

from moore-animateanyone.

training error in stage 1 about moore-animateanyone HOT 25 OPEN

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent