I am fine-tuning EVA on my custom dataset. I ran into the following error (it also happens when fine-tuning on COCO):
File "train.py", line 187, in <module>
main(args)
File "train.py", line 169, in main
trainer.train(0, cfg.train.max_iter)
File "/home/appuser/eva_repo/det/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/home/appuser/eva_repo/det/detectron2/engine/train_loop.py", line 421, in run_step
self.grad_scaler.scale(losses).backward()
File "/home/appuser/.local/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/appuser/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
File "/home/appuser/.local/lib/python3.7/site-packages/torch/autograd/function.py", line 87, in apply
return self._forward_cls.backward(self, *args) # type: ignore[attr-defined]
File "/home/appuser/.local/lib/python3.7/site-packages/fairscale/nn/checkpoint/checkpoint_activations.py", line 331, in backward
outputs = ctx.run_function(*unpacked_args, **unpacked_kwargs)
File "/home/appuser/eva_repo/det/detectron2/modeling/backbone/vit.py", line 289, in forward
x = self.attn(x)
File "/home/appuser/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/appuser/eva_repo/det/detectron2/modeling/backbone/vit.py", line 139, in forward
attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W))
File "/home/appuser/eva_repo/det/detectron2/modeling/backbone/utils.py", line 133, in add_decomposed_rel_pos
Rh = get_rel_pos(q_h, k_h, rel_pos_h)
File "/home/appuser/eva_repo/det/detectron2/modeling/backbone/utils.py", line 100, in get_rel_pos
z = rel_pos[:, i].view(src_size).cpu().float().numpy()
RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
After this change, the training runs without issue and the loss decreases steadily.
But I am not sure if I understand the full implications of this change.
Calling .detach() means the gradients are not updated for this Tensor. Or is that not an issue for this call? Did you not get this error during training?
I am running EVA inside docker with CUDA 11.1, Python 3.7, torch 1.9.0, torchvision 0.10.0, mmcv-full 1.6.1. But I doubt this is a versioning issue