youngwanlee / vovnet-detectron2 Goto Github PK

[CVPR 2020] VoVNet backbone networks for detectron2

License: Other

Python 100.00%

vovnet detectron2 object-detection instance-segmentation pytorch panoptic-segmentation cvpr2020

vovnet-detectron2's Introduction

👋 Hi there! I'm Youngwan, a senior researcher at ETRI and Ph.D student in Graduate school of AI at KAIST, where I'm advised by Prof. Sung Ju Hwang in the Machine Learning and Artificial Intelligence (MLAI) lab.

My research interest is how computers understand the world, including efficient 2D/3D neural network design, object detection, instance segmentation, semantic segmentation, and video classification. 🖥️🌏

Representative publications and Codes

See Google scholar for full list.

RC-MAE: Exploring the Role of Mean Teachers in Self-supervised Masked Auto-Encoders, ICLR 2023.
MPViT : Multi-Path Vision Transformer for Dense Prediction, CVPR 2022.
CenterMask : Real-Time Anchor-Free Instance Segmentation, CVPR 2020.
2D convolutional neural network : VoVNet
3D convolutional neural network : VoV3D

About me

📝 I enjoy ~~teaching~~ talking what I ~~know~~ learn, so I am giving lectures on AI as an AI Facilitator at ETRI AI Academy.
🌏🌱🌲🌊 ⛰️ I love to appreciate the beautiful nature.
🎾 🏀 I enjoy playing tennis and basket ball.
📫 How to reach me: [email protected] | [email protected]

💪 Skills

Platforms & Languages

vovnet-detectron2's People

Contributors

Stargazers

Watchers

vovnet-detectron2's Issues

Is the input supposed to be rgb or bgr?

the default config of detectron2 is bgr. Is it the case here?

KeyError: 'Non-existent config key: MODEL.VOVNET'

I got the error following error

WARNING [02/08 15:44:32 d2.config.compat]: Config '/home/detectron2/vovnet-detectron2/configs/faster_rcnn_V_99_FPN_3x.yaml' has no VERSION. Assuming it to be compatible with latest v2. Traceback (most recent call last): File "vovnet-detectron2/custom_vovnet_train.py", line 75, in <module> cfg = prepareConfig() File "vovnet-detectron2/custom_vovnet_train.py", line 69, in prepareConfig cfg.merge_from_file(config_file) File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/config/config.py", line 45, in merge_from_file self.merge_from_other_cfg(loaded_cfg) File "/opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/fvcore/common/config.py", line 121, in merge_from_other_cfg return super().merge_from_other_cfg(cfg_other) File "/opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/yacs/config.py", line 217, in merge_from_other_cfg _merge_a_into_b(cfg_other, self, self, []) File "/opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/yacs/config.py", line 460, in _merge_a_into_b _merge_a_into_b(v, b[k], root, key_list + [k]) File "/opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/yacs/config.py", line 473, in _merge_a_into_b raise KeyError("Non-existent config key: {}".format(full_key)) KeyError: 'Non-existent config key: MODEL.VOVNET'

SE param is not used in OSA module

From OSA stage I noticed that SE param can be changed to False in some cases, e.g. block_per_stage != 1. I guess it means the following OSA module should not include SE module.

vovnet-detectron2/vovnet/vovnet.py

Lines 260 to 265 in f96f534

 if block_per_stage != 1: 

 SE = False 

 module_name = f"OSA{stage_num}_1" 

 self.add_module( 

 module_name, _OSA_module(in_ch, stage_ch, concat_ch, layer_per_block, module_name, SE, depthwise=depthwise) 

 )

But it seems that the SE param defined in OSA module is never used, so SE module will be applied in every OSA module.

vovnet-detectron2/vovnet/vovnet.py

Lines 186 to 189 in f96f534

 class _OSA_module(nn.Module): 

 def __init__( 

 self, in_ch, stage_ch, concat_ch, layer_per_block, module_name, SE=False, identity=False, depthwise=False 

 ):

Is it a bug? or just I misunderstood it?

Keypoint RCNN

Hi,

Thanks for this great extension. I'm currently looking to try different backbones for keypoint detection in detectron2. I have to say I'm sort of loss on how to go about replacing a backbone on an existing Keypoint RCNN architecture. Do you have any Vovnet implemented in Keypoint RCNN? Also, I'm looking for a good tutorial/example for swapping backbone architecture in Detectron2. I'd appreciate any help. Thank you!

Used pretrained model mask_rcnn_V_99_FPN_3x on custom dataset

I got this error while training with custom dataset

RuntimeError: The size of tensor a (81) must match the size of tensor b (6) at non-singleton dimension 0

Can i use this code to apply VoVNet for other meta-architectures?

Add setup.py for easy installation

Consider adding setup.py for easier usage of vovnet backbones in projects.
Then we can install vovnet package as easy as

pip install git+https://github.com/youngwanLEE/vovnet-detectron2

and import it as import vovnet.

Simple setup.py will do the job (of course that is not the best solution).

I found that there may be ambiguity in the code, or maybe I don't understand it?

Regarding the content framed in red in the picture, I think，when block_per_stage = 3 or 4 ,block_per_stage != 1,the module of SE has become False, So the code module SE below is False no matter how block_per_stage changes.
I have a feeling that this may not be what you originally thought, I feel that by commenting out "last block" I think you may be more inclined to add attention to the last block, or to add attention a few blocks before the "last block"

backbone weights pretrained on ImageNet-1k dataset

Hi Youngwan,
I really love your project, but the ImageNet pretrained weights are on Dropbox which are inaccessible for me, would you mind put them on Google Drive, thank you very much!

does this include centermask?

fcos head and centermask included?

Loss NaN about using vovnet as backbone in RetinaNet

Hi! Thank you for your great work.
I wanted to improve RetinaNet project in detectron2/projects by replacing "retinanet_resnet_fpn_backbone" with "retinanet_vovnet_fpn_backbone".
However, I always encounterd "loss NaN" in period of less than 1000 iterations during training .
Training by "retinanet_resnet_fpn_backbone" is OK.

I want to make sure that I wasn't doing something wrong.

my config yaml:

_BASE_: "../Base-RetinaNet.yaml"
MODEL:
  WEIGHTS: "./pre_train/vovnet39_ese_detectron2.pth"
  RETINANET:
    NUM_CLASSES: 2
  BACKBONE:
    NAME: "build_retinanet_vovnet_fpn_backbone"
    FREEZE_AT: 0
  VOVNET:
    CONV_BODY : "V-39-eSE"
    OUT_FEATURES: ["stage3", "stage4", "stage5"]
  FPN:
    IN_FEATURES: ["stage3", "stage4", "stage5"]
SOLVER:
  STEPS: (210000, 250000)
  MAX_ITER: 270000
OUTPUT_DIR: "output/retina/V_39_ms_3x"

build_retinanet_vovnet_fpn_backbone

@BACKBONE_REGISTRY.register()
def build_retinanet_vovnet_fpn_backbone(cfg, input_shape: ShapeSpec):
    """
    Args:
        cfg: a detectron2 CfgNode

    Returns:
        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
    """

    bottom_up = build_vovnet_backbone(cfg, input_shape)
    in_features = cfg.MODEL.FPN.IN_FEATURES
    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
    in_channels_top = out_channels
    top_block = LastLevelP6P7(in_channels_top, out_channels, "p5")
    # in_channels_p6p7 = bottom_up.output_shape()["res5"].channels
    backbone = FPN(
        bottom_up=bottom_up,
        in_features=in_features,
        out_channels=out_channels,
        norm=cfg.MODEL.FPN.NORM,
        top_block=top_block,
        # top_block=LastLevelP6P7(in_channels_p6p7, out_channels),
        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
    )
    return backbone

On designing thinner VovNet

Thanks for your great work!
I'm working on a project of instance segmentation in which taget_classes=2. I tried to use Vovnet in the project to replace the ResNet that I have already used. Because the target_class is very different from COCO, the output channels of FPN of my model is only 64 and I think the VOVnet here should be a lot thinner than COCO version.
I have designed a VovNet like the model bellow, but the result is worse than ResNet. Do you have any suggestion on how to design thinner VovNet? Thanks

StageSpec = namedtuple(
"StageSpec",
[
"index", # Index of the stage, eg 1, 2, ..,. 5
"block_count", # Number of residual blocks in the stage
"layer_per_block", # Number of OSA modules per block
"return_features", # True => return the last feature map from this stage
"in_channels",
"out_channels",
],
)
VoVNet67_eSE = tuple(
StageSpec(index=i, block_count=b, layer_per_block=l, in_channels=in_c, out_channels=out_c, return_features=r)
for (i, b, l, in_c, out_c, r) in ((1, 1, 5, 16, 48, True), (2, 3, 5, 32, 96, True), (3, 6, 5, 64, 192, True), (4, 3, 5, 128, 256, True))

perform great in person detection

How can I use mobilenet as backbone, I put in detectron2/projects but it can't work

有预训练模型吗？

AttributeError: Attempted to set WEIGHTS to /checkpoints/FRCN-V2-39-3x/model_final.pth, but CfgNode is immutable

While evaluation , getting the following error

AttributeError: Attempted to set WEIGHTS to /checkpoints/FRCN-V2-39-3x/model_final.pth, but CfgNode is immutable

Error trying to export models to caffe2

When I try to run the standard caffe2 export script, I get an error:

(detectron_env_2) sal9000@sal9000-XPS-13-9370:~/Sources/detectron2/tools/deploy$ ./caffe2_converter_guitars.py --config-file /home/sal9000/Sources/detectron2/projects/vovnet-detectron2/checkpoints/MRCN-V2-19-FPNLite-3x/config.yaml  --output ./caffe2_model_guitars_lite --run-eval MODEL.WEIGHTS /home/sal9000/Sources/detectron2/projects/vovnet-detectron2/checkpoints/MRCN-V2-19-FPNLite-3x/model_final.pth  MODEL.DEVICE cpu
[05/17 15:15:55 detectron2]: Command line arguments: Namespace(config_file='/home/sal9000/Sources/detectron2/projects/vovnet-detectron2/checkpoints/MRCN-V2-19-FPNLite-3x/config.yaml', format='caffe2', opts=['MODEL.WEIGHTS', '/home/sal9000/Sources/detectron2/projects/vovnet-detectron2/checkpoints/MRCN-V2-19-FPNLite-3x/model_final.pth', 'MODEL.DEVICE', 'cpu'], output='./caffe2_model_guitars_lite', run_eval=True)
Traceback (most recent call last):
  File "./caffe2_converter_guitars.py", line 81, in <module>
    torch_model = build_model(cfg)
  File "/home/sal9000/Sources/detectron2/detectron2/modeling/meta_arch/build.py", line 21, in build_model
    model = META_ARCH_REGISTRY.get(meta_arch)(cfg)
  File "/home/sal9000/Sources/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 32, in __init__
    self.backbone = build_backbone(cfg)
  File "/home/sal9000/Sources/detectron2/detectron2/modeling/backbone/build.py", line 31, in build_backbone
    backbone = BACKBONE_REGISTRY.get(backbone_name)(cfg, input_shape)
  File "/home/sal9000/virtualenvs/detectron_env_2/lib/python3.6/site-packages/fvcore/common/registry.py", line 70, in get
    "No object named '{}' found in '{}' registry!".format(name, self._name)
KeyError: "No object named 'build_vovnet_fpn_backbone' found in 'BACKBONE' registry!"

I had already inserted a line to add_vovnet_config(cfg), which fixed an earlier error, but I'm not sure how to proceed with this missing backbone error.

P.S. which is the fastest backbone for CPU inference? Eventually I'd like to try putting this model on a mobile device.

How to set the params? such channel

Hello, I want to change the output channel to be 128, Could you give me some advice about how to modify the params? The params I set can not get a good result. The result is worse than ResNet34.

Looking forward your reply.

lower AP than Resnet backbone in my training

Hi! Thank you for your great work. I wanted to improve Densepose project in detectron2/projects by replacing resnet-fpn backbone with vovnet, but during training I always get lower results than the original resnet backbone. One of the two results and the command I used is as below:
python train_net.py --config-file configs/densepose_rcnn_R_50_FPN_s1x_legacy.yaml SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.001 (the original one)
results:
[03/12 13:05:36 d2.evaluation.testing]: copypaste: Task: bbox
[03/12 13:05:36 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[03/12 13:05:36 d2.evaluation.testing]: copypaste: 53.1516,84.5066,57.3643,26.5924,51.5737,66.5134
[03/12 13:05:36 d2.evaluation.testing]: copypaste: Task: densepose_gps
[03/12 13:05:36 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APm,APl
[03/12 13:05:36 d2.evaluation.testing]: copypaste: 44.6044,83.1504,43.4994,38.4582,46.2426
[03/12 13:05:36 d2.evaluation.testing]: copypaste: Task: densepose_gpsm
[03/12 13:05:36 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APm,APl
[03/12 13:05:36 d2.evaluation.testing]: copypaste: 48.4785,86.5655,50.8903,40.1852,50.2448
[03/12 13:05:36 d2.utils.events]: eta: 0:00:00 iter: 129999 total_loss: 2.144 loss_cls: 0.104 loss_box_reg: 0.153 loss_densepose_U: 0.444 loss_densepose_V: 0.487 loss_densepose_I: 0.176 loss_densepose_S: 0.648 loss_rpn_cls: 0.011 loss_rpn_loc: 0.021 time: 0.4843 data_time: 0.0164 lr: 0.000010 max_mem: 2939M

python train_net.py --config-file configs/densepose_rcnn_R_50_FPN_s1x_legacy.yaml SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0001 (the backbone replaced one)
results:
[03/15 06:48:25 d2.evaluation.testing]: copypaste: Task: bbox
[03/15 06:48:25 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[03/15 06:48:25 d2.evaluation.testing]: copypaste: 50.5212,83.5444,52.5758,22.3658,48.5157,64.9114
[03/15 06:48:25 d2.evaluation.testing]: copypaste: Task: densepose_gps
[03/15 06:48:25 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APm,APl
[03/15 06:48:25 d2.evaluation.testing]: copypaste: 41.6630,82.0449,37.6683,31.7831,43.3619
[03/15 06:48:25 d2.evaluation.testing]: copypaste: Task: densepose_gpsm
[03/15 06:48:25 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APm,APl
[03/15 06:48:25 d2.evaluation.testing]: copypaste: 46.4402,86.1345,45.5157,35.2128,48.2686
[03/15 06:48:25 d2.utils.events]: eta: 0:00:00 iter: 269999 total_loss: 2.757 loss_cls: 0.075 loss_box_reg: 0.136 loss_densepose_U: 0.707 loss_densepose_V: 0.751 loss_densepose_I: 0.215 loss_densepose_S: 0.635 loss_rpn_cls: 0.008 loss_rpn_loc: 0.010 time: 0.2863 data_time: 0.0129 lr: 0.000001 max_mem: 4289M

the changes I've made in Base-DensePose-RCNN-FPN.yaml:

MODEL:
  META_ARCHITECTURE: "GeneralizedRCNN"
  BACKBONE:
    NAME: "build_vovnet_fpn_backbone"
    FREEZE_AT: 0
    #NAME: "build_resnet_fpn_backbone"
  VOVNET:
    OUT_FEATURES: ["stage2", "stage3", "stage4", "stage5"]
  #RESNETS:
    #OUT_FEATURES: ["res2", "res3", "res4", "res5"]

  FPN:
    IN_FEATURES: ["stage2", "stage3", "stage4", "stage5"]
    #IN_FEATURES: ["res2", "res3", "res4", "res5"]

and in densepose_rcnn_R_50_FPN_s1x_legacy.yaml:

_BASE_: "Base-DensePose-RCNN-FPN.yaml"
MODEL:
  #WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
  #RESNETS:
    #DEPTH: 50
  WEIGHTS: "vovnet39_ese_detectron2.pth"#"https://www.dropbox.com/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1"
  VOVNET:
    CONV_BODY : "V-39-eSE"

  ROI_DENSEPOSE_HEAD:
    NUM_COARSE_SEGM_CHANNELS: 15
    POOLER_RESOLUTION: 14
    HEATMAP_SIZE: 56
    INDEX_WEIGHTS: 2.0
    PART_WEIGHTS: 0.3
    POINT_REGRESSION_WEIGHTS: 0.1
    DECODER_ON: False
SOLVER:
  BASE_LR: 0.002
  #MAX_ITER: 130000
  #STEPS: (100000, 120000)
  STEPS: (210000, 250000)
  MAX_ITER: 270000
OUTPUT_DIR: "checkpoints/MRCN-V2-39-3x"

Have I done something wrong? Like unsuitble learning rates or others? I've been deeply impressed by how your backbone can imporve a model's AP, so could you tell me how to train this new backbone in order to get higher results? Thanks a lot in advance!

AP value too low

Hello, excuse me, I'm doing cartoon character detection. The number of training sets and training times are sufficient, but the average AP (0.5-0.95) is only more than 30%. What's the reason? If you can take the time to answer me, I would be gratefu

About fasterrcnn-vovnet

vovnet.py 342:
'LastLevelMaxPool()' is not defined

Inconsistent evaluation results for same model and dataset

Command: python train_net.py --num-gpus 1 --config-file configs/faster_rcnn_V_39_FPN_3x.yaml --resume --eval-only MODEL.WEIGHTS checkpoints/FRCN-V2-39-3x_1/model_final.pth MODEL.ROI_HEADS.SCORE_THRESH_TEST 0.5

The content of config file:
BASE: "Base-RCNN-VoVNet-FPN.yaml"
MODEL:
WEIGHTS: "https://www.dropbox.com/s/q98pypf96rhtd8y/vovnet39_ese_detectron2.pth?dl=1"
MASK_ON: False
VOVNET:
CONV_BODY: "V-39-eSE"
ROI_HEADS:
NUM_CLASSES: 30
SOLVER:
STEPS: (210000, 250000)
MAX_ITER: 200000
IMS_PER_BATCH: 8
BASE_LR: 0.0001
CHECKPOINT_PERIOD: 1000
DATASETS:
TRAIN: ("train_dataset_leafi",)
TEST: ("train_dataset_leafi","val_dataset_leafi")
OUTPUT_DIR: "checkpoints/FRCN-V2-39-3x_crops/"
DATALOADER:
NUM_WORKERS: 4
TEST:
EVAL_PERIOD: 1000

The evaluation results (for validation dataset) running the command:

Run 1:

[06/17 10:07:33 d2.evaluation.coco_evaluation]: 'val_dataset_leafi' is not registered by register_coco_instances. Therefore trying to convert it to COCO format ...
[06/17 10:07:33 d2.evaluation.evaluator]: Start inference on 66 images
[06/17 10:07:35 d2.evaluation.fast_eval_api]: Evaluate annotation type bbox
[06/17 10:07:35 d2.evaluation.fast_eval_api]: COCOeval_opt.evaluate() finished in 0.01 seconds.
[06/17 10:07:35 d2.evaluation.fast_eval_api]: Accumulating evaluation results...
[06/17 10:07:35 d2.evaluation.fast_eval_api]: COCOeval_opt.accumulate() finished in 0.04 seconds.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.086
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.209
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.052
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.102
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.070
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.137
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.093
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.093
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.093
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.104
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.085
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.144
[06/17 10:07:35 d2.evaluation.coco_evaluation]: Evaluation results for bbox:

AP	AP50	AP75	APs	APm	APl	AR1	AR10	AR100	ARs	ARm	ARl
8.572	20.874	5.198	10.182	6.964	13.746	9.268	9.326	9.326	10.449	8.500	14.444

Run 2:

[06/17 10:11:20 d2.evaluation.coco_evaluation]: 'val_dataset_leafi' is not registered by register_coco_instances. Therefore trying to convert it to COCO format ...
WARNING [06/17 10:11:20 d2.data.datasets.coco]: Using previously cached COCO format annotations at 'checkpoints/FRCN-V2-39-3x_crops/inference/val_dataset_leafi_coco_format.json'. You need to clear the cache file if your dataset has been modified.
[06/17 10:11:20 d2.evaluation.evaluator]: Start inference on 66 images
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.094
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.242
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.043
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.110
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.196
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.116
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.116
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.116
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.004
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.142
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.233
[06/17 10:11:22 d2.evaluation.coco_evaluation]: Evaluation results for bbox:

AP	AP50	AP75	APs	APm	APl	AR1	AR10	AR100	ARs	ARm	ARl
9.403	24.244	4.297	0.042	10.984	19.574	11.642	11.642	11.642	0.353	14.227	23.333

This seems to be issue similar to facebookresearch/detectron2#739.

Problems in Multi Task

The vovnet performs greater than ShuffleNetV2 in objects detection on our own dataset, but when we use it as the backbone of multi task, its performance decrease sharply compared with ShuffleNetV2.
Any suggesstion ? Thx !

Training loss goes into nan values

I got nan values when used the default config in vovnet. Then i tried by reducing the bs_lr into 0.001 , 0.00025 .Hence the nan value issue solved, but the training loss not reducing (training loss starts from 1.9 to and reached in 0.7) , the AP is 11 for 75000 iterations.

Dataset : 57000 images with one class , those images are in different resolutions.

inference time

hi , I've noticed that you use V100 GPU machine to measure the inference time , can you please tell me what's the image size do you run when measuring the inference time ?

Out of memory error - how to reduce batch size?

I'm trying to train a small net on my own dataset. AWS P2 machine with ~12GB of GPU memory.

Getting the error below. Do you know what I can do, perhaps reduce batch size or something? How do I do that?

[05/16 14:46:32 d2.data.build]: Using training sampler TrainingSampler
[05/16 14:46:32 fvcore.common.checkpoint]: Loading checkpoint from https://www.dropbox.com/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1
[05/16 14:46:32 fvcore.common.file_io]: URL https://www.dropbox.com/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1 cached in /home/ubuntu/.torch/fvcore_cache/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1
[05/16 14:46:33 fvcore.common.checkpoint]: Some model parameters or buffers are not in the checkpoint:
  backbone.fpn_output5.{bias, weight}
  roi_heads.box_head.fc1.{bias, weight}
  roi_heads.box_predictor.bbox_pred.{weight, bias}
  roi_heads.mask_head.mask_fcn3.{weight, bias}
  roi_heads.mask_head.predictor.{bias, weight}
  backbone.fpn_output4.{bias, weight}
  backbone.fpn_output3.{weight, bias}
  proposal_generator.anchor_generator.cell_anchors.{0, 2, 3, 4, 1}
  proposal_generator.rpn_head.conv.{weight, bias}
  roi_heads.box_predictor.cls_score.{bias, weight}
  proposal_generator.rpn_head.objectness_logits.{bias, weight}
  roi_heads.mask_head.deconv.{bias, weight}
  roi_heads.box_head.fc2.{bias, weight}
  proposal_generator.rpn_head.anchor_deltas.{weight, bias}
  roi_heads.mask_head.mask_fcn1.{weight, bias}
  roi_heads.mask_head.mask_fcn2.{weight, bias}
  backbone.fpn_output2.{bias, weight}
  roi_heads.mask_head.mask_fcn4.{bias, weight}
  backbone.fpn_lateral2.{bias, weight}
  backbone.fpn_lateral4.{weight, bias}
  backbone.fpn_lateral5.{weight, bias}
  backbone.fpn_lateral3.{weight, bias}
[05/16 14:46:33 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
  backbone.bottom_up.stem.stem_1/norm.num_batches_tracked
  backbone.bottom_up.stem.stem_2/norm.num_batches_tracked
  backbone.bottom_up.stem.stem_3/norm.num_batches_tracked
  backbone.bottom_up.stage2.OSA2_1.layers.0.OSA2_1_0/norm.num_batches_tracked
  backbone.bottom_up.stage2.OSA2_1.layers.1.OSA2_1_1/norm.num_batches_tracked
  backbone.bottom_up.stage2.OSA2_1.layers.2.OSA2_1_2/norm.num_batches_tracked
  backbone.bottom_up.stage2.OSA2_1.concat.OSA2_1_concat/norm.num_batches_tracked
  backbone.bottom_up.stage3.OSA3_1.layers.0.OSA3_1_0/norm.num_batches_tracked
  backbone.bottom_up.stage3.OSA3_1.layers.1.OSA3_1_1/norm.num_batches_tracked
  backbone.bottom_up.stage3.OSA3_1.layers.2.OSA3_1_2/norm.num_batches_tracked
  backbone.bottom_up.stage3.OSA3_1.concat.OSA3_1_concat/norm.num_batches_tracked
  backbone.bottom_up.stage4.OSA4_1.layers.0.OSA4_1_0/norm.num_batches_tracked
  backbone.bottom_up.stage4.OSA4_1.layers.1.OSA4_1_1/norm.num_batches_tracked
  backbone.bottom_up.stage4.OSA4_1.layers.2.OSA4_1_2/norm.num_batches_tracked
  backbone.bottom_up.stage4.OSA4_1.concat.OSA4_1_concat/norm.num_batches_tracked
  backbone.bottom_up.stage5.OSA5_1.layers.0.OSA5_1_0/norm.num_batches_tracked
  backbone.bottom_up.stage5.OSA5_1.layers.1.OSA5_1_1/norm.num_batches_tracked
  backbone.bottom_up.stage5.OSA5_1.layers.2.OSA5_1_2/norm.num_batches_tracked
  backbone.bottom_up.stage5.OSA5_1.concat.OSA5_1_concat/norm.num_batches_tracked
[05/16 14:46:33 d2.engine.train_loop]: Starting training from iteration 0
ERROR [05/16 14:46:38 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 215, in run_step
    loss_dict = self.model(data)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 121, in forward
    features = self.backbone(images.tensor)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward
    bottom_up_features = self.bottom_up(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 367, in forward
    x = getattr(self, name)(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 234, in forward
    xt = self.concat(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/layers/batch_norm.py", line 55, in forward
    return x * scale + bias
RuntimeError: CUDA out of memory. Tried to allocate 1.03 GiB (GPU 0; 11.17 GiB total capacity; 8.48 GiB already allocated; 845.31 MiB free; 10.03 GiB reserved in total by PyTorch)
[05/16 14:46:38 d2.engine.hooks]: Total training time: 0:00:05 (0:00:00 on hooks)
Traceback (most recent call last):
  File "train_net_docs.py", line 115, in <module>
    dist_url=args.dist_url,
  File "/home/ubuntu/detectron2/detectron2/engine/launch.py", line 57, in launch
    main_func(*args)
  File "train_net_docs.py", line 93, in main
    trainer.resume_or_load(resume=args.resume)
  File "/home/ubuntu/detectron2/detectron2/engine/defaults.py", line 401, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 215, in run_step
    loss_dict = self.model(data)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 121, in forward
    features = self.backbone(images.tensor)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward
    bottom_up_features = self.bottom_up(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 367, in forward
    x = getattr(self, name)(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 234, in forward
    xt = self.concat(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/layers/batch_norm.py", line 55, in forward
    return x * scale + bias
RuntimeError: CUDA out of memory. Tried to allocate 1.03 GiB (GPU 0; 11.17 GiB total capacity; 8.48 GiB already allocated; 845.31 MiB free; 10.03 GiB reserved in total by PyTorch)

	if block_per_stage != 1:
	SE = False
	module_name = f"OSA{stage_num}_1"
	self.add_module(
	module_name, _OSA_module(in_ch, stage_ch, concat_ch, layer_per_block, module_name, SE, depthwise=depthwise)
	)

	class _OSA_module(nn.Module):
	def __init__(
	self, in_ch, stage_ch, concat_ch, layer_per_block, module_name, SE=False, identity=False, depthwise=False
	):