Comments (7)
i use torch.nn.utils。clip_grad_norm_
to clip the gradients, so that it can solve the NaN
problem.
from gcnet.
The setting looks good to me. I suggest you first run maskrcnn without GC with 16 images on 8 gpus.
I don't think GC would cause gradient explosion.
from gcnet.
yes, i run maskrcnn with/without GC but using sync bn. if i don't use clip_grads, both of them will encounter the NaN. now i solve it by clipping gradients.
thanks
from gcnet.
hi @xvjiarui
sorry to bother u again.
when i finish training and start to test the final model. the performance is close to zero. how to deal with the sync bn when test?
from gcnet.
hi @xvjiarui
sorry to bother u again.
when i finish training and start to test the final model. the performance is close to zero. how to deal with the sync bn when test?
The Sync BN is all fixed during the test just as BN. How does your loss look like? I suspect that may due to the clip_grad_norm
.
I also suggest you check your code. I don't think baseline of maskrcnn-benchmark would encounter gradient explosion even with Sync BN.
from gcnet.
hi
i use sync bn from apex, like this:
class BottleneckWithAPSyncBN(Bottleneck):
def __init__(
self,
in_channels,
bottleneck_channels,
out_channels,
num_groups=1,
stride_in_1x1=True,
stride=1,
dilation=1,
configs={},
):
super(BottleneckWithAPSyncBN, self).__init__(
in_channels=in_channels,
bottleneck_channels=bottleneck_channels,
out_channels=out_channels,
num_groups=num_groups,
stride_in_1x1=stride_in_1x1,
stride=stride,
dilation=dilation,
norm_func=ap.SyncBatchNorm,
configs=configs
)
and use clip_grad after losses.backward() as below where max_norm is set to 35 and norm_type is set to 2:
from torch.nn.utils import clip_grad_norm_ as clip_grad_norm
clip_grad_norm(model.parameters(), max_norm, norm_type=norm_type)
after training using these configs, the mAP on cocoval-2017 is 0.0. the log as below:
2019-04-29 16:15:49,151 maskrcnn_benchmark.trainer INFO: eta: 1 day, 3:45:33 iter: 20 loss: 3.1540 (5.0863) loss_box_reg: 0.0448 (0.0604) loss_classifier: 0.9114 (1.8578) loss_mask: 0.8628 (2.3964) loss_objectness: 0.3248 (0.4229) loss_rpn_box_reg: 0.2593 (0.3488) ftime: 0.2230 (0.5223) backbone_ftime: 0.0715 (0.3736) roi_heads_ftime: 0.0337 (0.0381) rpn_ftime: 0.1085 (0.1106) time: 0.6449 (1.1106) data: 0.0086 (0.1168) lr: 0.007173 max mem: 4027
2019-04-29 16:16:02,300 maskrcnn_benchmark.trainer INFO: eta: 22:05:28 iter: 40 loss: 1.7453 (3.4724) loss_box_reg: 0.1120 (0.0850) loss_classifier: 0.5803 (1.2815) loss_mask: 0.6930 (1.5453) loss_objectness: 0.2294 (0.3306) loss_rpn_box_reg: 0.1077 (0.2299) ftime: 0.2254 (0.3735) backbone_ftime: 0.0676 (0.2215) roi_heads_ftime: 0.0475 (0.0427) rpn_ftime: 0.1078 (0.1093) time: 0.6584 (0.8840) data: 0.0102 (0.0640) lr: 0.007707 max mem: 4027
2019-04-29 16:16:15,600 maskrcnn_benchmark.trainer INFO: eta: 20:15:43 iter: 60 loss: 1.8415 (2.9447) loss_box_reg: 0.1149 (0.0985) loss_classifier: 0.5927 (1.0666) loss_mask: 0.6890 (1.2598) loss_objectness: 0.2689 (0.3151) loss_rpn_box_reg: 0.1651 (0.2048) ftime: 0.2295 (0.3256) backbone_ftime: 0.0709 (0.1714) roi_heads_ftime: 0.0507 (0.0455) rpn_ftime: 0.1077 (0.1087) time: 0.6604 (0.8110) data: 0.0131 (0.0475) lr: 0.008240 max mem: 4027
2019-04-29 16:16:28,584 maskrcnn_benchmark.trainer INFO: eta: 19:14:49 iter: 80 loss: 1.6740 (2.6838) loss_box_reg: 0.1086 (0.1015) loss_classifier: 0.4860 (0.9273) loss_mask: 0.6849 (1.1166) loss_objectness: 0.2614 (0.3094) loss_rpn_box_reg: 0.1261 (0.2290) ftime: 0.2221 (0.3000) backbone_ftime: 0.0684 (0.1458) roi_heads_ftime: 0.0497 (0.0461) rpn_ftime: 0.1057 (0.1081) time: 0.6513 (0.7706) data: 0.0099 (0.0383) lr: 0.008773 max mem: 4027
2019-04-29 16:16:41,528 maskrcnn_benchmark.trainer INFO: eta: 18:37:35 iter: 100 loss: 1.8355 (2.5213) loss_box_reg: 0.1045 (0.1035) loss_classifier: 0.5610 (0.8732) loss_mask: 0.6875 (1.0307) loss_objectness: 0.2637 (0.3028) loss_rpn_box_reg: 0.1414 (0.2110) ftime: 0.2239 (0.2848) backbone_ftime: 0.0703 (0.1307) roi_heads_ftime: 0.0474 (0.0465) rpn_ftime: 0.1052 (0.1076) time: 0.6486 (0.7459) data: 0.0103 (0.0331) lr: 0.009307 max mem: 4027
2019-04-29 16:16:54,769 maskrcnn_benchmark.trainer INFO: eta: 18:16:24 iter: 120 loss: 2.8539 (2.8987) loss_box_reg: 0.1312 (0.1103) loss_classifier: 1.3366 (1.2354) loss_mask: 0.6863 (0.9734) loss_objectness: 0.3254 (0.3663) loss_rpn_box_reg: 0.1463 (0.2133) ftime: 0.2244 (0.2749) backbone_ftime: 0.0674 (0.1202) roi_heads_ftime: 0.0494 (0.0469) rpn_ftime: 0.1083 (0.1077) time: 0.6607 (0.7319) data: 0.0112 (0.0296) lr: 0.009840 max mem: 4092
2019-04-29 16:17:07,720 maskrcnn_benchmark.trainer INFO: eta: 17:58:07 iter: 140 loss: 2.9404 (3.3908) loss_box_reg: 0.1603 (0.1218) loss_classifier: 1.6348 (1.7406) loss_mask: 0.6878 (0.9326) loss_objectness: 0.5641 (0.3878) loss_rpn_box_reg: 0.1102 (0.2080) ftime: 0.2173 (0.2669) backbone_ftime: 0.0646 (0.1123) roi_heads_ftime: 0.0448 (0.0468) rpn_ftime: 0.1075 (0.1078) time: 0.6432 (0.7199) data: 0.0113 (0.0271) lr: 0.010373 max mem: 4092
......
2019-04-30 07:18:47,434 maskrcnn_benchmark.trainer INFO: eta: 0:01:24 iter: 89860 loss: 0.9836 (1.3513) loss_box_reg: 0.0075 (0.0120) loss_classifier: 0.2176 (0.4399) loss_mask: 0.3144 (0.3686) loss_objectness: 0.3680 (0.4110) loss_rpn_box_reg: 0.1062 (0.1197) ftime: 0.1991 (0.2021) backbone_ftime: 0.0668 (0.0694) roi_heads_ftime: 0.0225 (0.0238) rpn_ftime: 0.1078 (0.1089) time: 0.5898 (0.6032) data: 0.0129 (0.0131) lr: 0.000200 max mem: 4250
2019-04-30 07:18:59,354 maskrcnn_benchmark.trainer INFO: eta: 0:01:12 iter: 89880 loss: 1.1297 (1.3512) loss_box_reg: 0.0080 (0.0120) loss_classifier: 0.2681 (0.4399) loss_mask: 0.3165 (0.3686) loss_objectness: 0.4145 (0.4110) loss_rpn_box_reg: 0.1112 (0.1197) ftime: 0.1974 (0.2021) backbone_ftime: 0.0662 (0.0694) roi_heads_ftime: 0.0242 (0.0238) rpn_ftime: 0.1069 (0.1089) time: 0.5999 (0.6032) data: 0.0144 (0.0131) lr: 0.000200 max mem: 4250
2019-04-30 07:19:11,233 maskrcnn_benchmark.trainer INFO: eta: 0:01:00 iter: 89900 loss: 1.0263 (1.3512) loss_box_reg: 0.0039 (0.0120) loss_classifier: 0.2237 (0.4398) loss_mask: 0.3114 (0.3686) loss_objectness: 0.3918 (0.4110) loss_rpn_box_reg: 0.1096 (0.1197) ftime: 0.1980 (0.2021) backbone_ftime: 0.0665 (0.0694) roi_heads_ftime: 0.0231 (0.0238) rpn_ftime: 0.1074 (0.1089) time: 0.5881 (0.6032) data: 0.0113 (0.0131) lr: 0.000200 max mem: 4250
2019-04-30 07:19:22,989 maskrcnn_benchmark.trainer INFO: eta: 0:00:48 iter: 89920 loss: 1.0952 (1.3511) loss_box_reg: 0.0034 (0.0120) loss_classifier: 0.2197 (0.4398) loss_mask: 0.3247 (0.3686) loss_objectness: 0.4153 (0.4110) loss_rpn_box_reg: 0.1111 (0.1197) ftime: 0.1954 (0.2021) backbone_ftime: 0.0658 (0.0694) roi_heads_ftime: 0.0231 (0.0238) rpn_ftime: 0.1062 (0.1089) time: 0.5863 (0.6032) data: 0.0108 (0.0131) lr: 0.000200 max mem: 4250
2019-04-30 07:19:34,826 maskrcnn_benchmark.trainer INFO: eta: 0:00:36 iter: 89940 loss: 1.0778 (1.3511) loss_box_reg: 0.0088 (0.0120) loss_classifier: 0.2394 (0.4397) loss_mask: 0.3099 (0.3686) loss_objectness: 0.4080 (0.4110) loss_rpn_box_reg: 0.1153 (0.1197) ftime: 0.1986 (0.2021) backbone_ftime: 0.0659 (0.0694) roi_heads_ftime: 0.0233 (0.0238) rpn_ftime: 0.1074 (0.1089) time: 0.5890 (0.6032) data: 0.0108 (0.0131) lr: 0.000200 max mem: 4250
2019-04-30 07:19:46,781 maskrcnn_benchmark.trainer INFO: eta: 0:00:24 iter: 89960 loss: 0.9723 (1.3510) loss_box_reg: 0.0067 (0.0120) loss_classifier: 0.2280 (0.4397) loss_mask: 0.3218 (0.3686) loss_objectness: 0.3552 (0.4110) loss_rpn_box_reg: 0.0949 (0.1197) ftime: 0.1975 (0.2021) backbone_ftime: 0.0671 (0.0694) roi_heads_ftime: 0.0231 (0.0238) rpn_ftime: 0.1086 (0.1089) time: 0.5903 (0.6032) data: 0.0137 (0.0131) lr: 0.000200 max mem: 4250
2019-04-30 07:19:58,708 maskrcnn_benchmark.trainer INFO: eta: 0:00:12 iter: 89980 loss: 1.1414 (1.3509) loss_box_reg: 0.0117 (0.0120) loss_classifier: 0.2371 (0.4396) loss_mask: 0.3259 (0.3686) loss_objectness: 0.4193 (0.4110) loss_rpn_box_reg: 0.1314 (0.1197) ftime: 0.1980 (0.2021) backbone_ftime: 0.0681 (0.0694) roi_heads_ftime: 0.0236 (0.0238) rpn_ftime: 0.1060 (0.1089) time: 0.5894 (0.6032) data: 0.0100 (0.0131) lr: 0.000200 max mem: 4250
if i don't use the clip_grad, it encounters the NaN problem.
from gcnet.
Hi, my experiment could not run with pytorch syncbn or just BatchNorm, do you remove the broadcast_buffer in the DDP. I want to check there the problem is, thank you!
from gcnet.
Related Issues (20)
- What‘s the value of transform module mean? HOT 3
- Did anyone use GCNet on Optical Flow features?
- How can I use gc block in resnet18? HOT 3
- Simplified NL HOT 1
- gcnet performs not good on segmentation tasks. HOT 1
- Questions about training
- AP, AR=-1 while evaluation at the end of each epoch HOT 8
- Attention maps in Different query position HOT 1
- GCNet with pretrianed model on COCO detection?
- Could it be used in 3D data? HOT 1
- possible replacements for layernorm HOT 1
- 找个GC Block这么难? HOT 3
- Mask for training
- Visualization code wanted HOT 1
- change mask-rcnn to faster-rcnn?
- Does GCNet have 1d? HOT 1
- some of the problems HOT 3
- Add location based on yolov7
- Welcome update to OpenMMLab 2.0
- Weight download problem
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gcnet.