I'm trying to train a small net on my own dataset. AWS P2 machine with ~12GB of GPU memory.
Getting the error below. Do you know what I can do, perhaps reduce batch size or something? How do I do that?
[05/16 14:46:32 d2.data.build]: Using training sampler TrainingSampler
[05/16 14:46:32 fvcore.common.checkpoint]: Loading checkpoint from https://www.dropbox.com/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1
[05/16 14:46:32 fvcore.common.file_io]: URL https://www.dropbox.com/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1 cached in /home/ubuntu/.torch/fvcore_cache/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1
[05/16 14:46:33 fvcore.common.checkpoint]: Some model parameters or buffers are not in the checkpoint:
backbone.fpn_output5.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_predictor.bbox_pred.{weight, bias}
roi_heads.mask_head.mask_fcn3.{weight, bias}
roi_heads.mask_head.predictor.{bias, weight}
backbone.fpn_output4.{bias, weight}
backbone.fpn_output3.{weight, bias}
proposal_generator.anchor_generator.cell_anchors.{0, 2, 3, 4, 1}
proposal_generator.rpn_head.conv.{weight, bias}
roi_heads.box_predictor.cls_score.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
proposal_generator.rpn_head.anchor_deltas.{weight, bias}
roi_heads.mask_head.mask_fcn1.{weight, bias}
roi_heads.mask_head.mask_fcn2.{weight, bias}
backbone.fpn_output2.{bias, weight}
roi_heads.mask_head.mask_fcn4.{bias, weight}
backbone.fpn_lateral2.{bias, weight}
backbone.fpn_lateral4.{weight, bias}
backbone.fpn_lateral5.{weight, bias}
backbone.fpn_lateral3.{weight, bias}
[05/16 14:46:33 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
backbone.bottom_up.stem.stem_1/norm.num_batches_tracked
backbone.bottom_up.stem.stem_2/norm.num_batches_tracked
backbone.bottom_up.stem.stem_3/norm.num_batches_tracked
backbone.bottom_up.stage2.OSA2_1.layers.0.OSA2_1_0/norm.num_batches_tracked
backbone.bottom_up.stage2.OSA2_1.layers.1.OSA2_1_1/norm.num_batches_tracked
backbone.bottom_up.stage2.OSA2_1.layers.2.OSA2_1_2/norm.num_batches_tracked
backbone.bottom_up.stage2.OSA2_1.concat.OSA2_1_concat/norm.num_batches_tracked
backbone.bottom_up.stage3.OSA3_1.layers.0.OSA3_1_0/norm.num_batches_tracked
backbone.bottom_up.stage3.OSA3_1.layers.1.OSA3_1_1/norm.num_batches_tracked
backbone.bottom_up.stage3.OSA3_1.layers.2.OSA3_1_2/norm.num_batches_tracked
backbone.bottom_up.stage3.OSA3_1.concat.OSA3_1_concat/norm.num_batches_tracked
backbone.bottom_up.stage4.OSA4_1.layers.0.OSA4_1_0/norm.num_batches_tracked
backbone.bottom_up.stage4.OSA4_1.layers.1.OSA4_1_1/norm.num_batches_tracked
backbone.bottom_up.stage4.OSA4_1.layers.2.OSA4_1_2/norm.num_batches_tracked
backbone.bottom_up.stage4.OSA4_1.concat.OSA4_1_concat/norm.num_batches_tracked
backbone.bottom_up.stage5.OSA5_1.layers.0.OSA5_1_0/norm.num_batches_tracked
backbone.bottom_up.stage5.OSA5_1.layers.1.OSA5_1_1/norm.num_batches_tracked
backbone.bottom_up.stage5.OSA5_1.layers.2.OSA5_1_2/norm.num_batches_tracked
backbone.bottom_up.stage5.OSA5_1.concat.OSA5_1_concat/norm.num_batches_tracked
[05/16 14:46:33 d2.engine.train_loop]: Starting training from iteration 0
ERROR [05/16 14:46:38 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 215, in run_step
loss_dict = self.model(data)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 121, in forward
features = self.backbone(images.tensor)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward
bottom_up_features = self.bottom_up(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 367, in forward
x = getattr(self, name)(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 234, in forward
xt = self.concat(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/layers/batch_norm.py", line 55, in forward
return x * scale + bias
RuntimeError: CUDA out of memory. Tried to allocate 1.03 GiB (GPU 0; 11.17 GiB total capacity; 8.48 GiB already allocated; 845.31 MiB free; 10.03 GiB reserved in total by PyTorch)
[05/16 14:46:38 d2.engine.hooks]: Total training time: 0:00:05 (0:00:00 on hooks)
Traceback (most recent call last):
File "train_net_docs.py", line 115, in <module>
dist_url=args.dist_url,
File "/home/ubuntu/detectron2/detectron2/engine/launch.py", line 57, in launch
main_func(*args)
File "train_net_docs.py", line 93, in main
trainer.resume_or_load(resume=args.resume)
File "/home/ubuntu/detectron2/detectron2/engine/defaults.py", line 401, in train
super().train(self.start_iter, self.max_iter)
File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 215, in run_step
loss_dict = self.model(data)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 121, in forward
features = self.backbone(images.tensor)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward
bottom_up_features = self.bottom_up(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 367, in forward
x = getattr(self, name)(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 234, in forward
xt = self.concat(x)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/detectron2/detectron2/layers/batch_norm.py", line 55, in forward
return x * scale + bias
RuntimeError: CUDA out of memory. Tried to allocate 1.03 GiB (GPU 0; 11.17 GiB total capacity; 8.48 GiB already allocated; 845.31 MiB free; 10.03 GiB reserved in total by PyTorch)