Giter VIP home page Giter VIP logo

Comments (7)

zimenglan-sysu-512 avatar zimenglan-sysu-512 commented on May 28, 2024

i use torch.nn.utils。clip_grad_norm_ to clip the gradients, so that it can solve the NaN problem.

from gcnet.

xvjiarui avatar xvjiarui commented on May 28, 2024

The setting looks good to me. I suggest you first run maskrcnn without GC with 16 images on 8 gpus.
I don't think GC would cause gradient explosion.

from gcnet.

zimenglan-sysu-512 avatar zimenglan-sysu-512 commented on May 28, 2024

yes, i run maskrcnn with/without GC but using sync bn. if i don't use clip_grads, both of them will encounter the NaN. now i solve it by clipping gradients.
thanks

from gcnet.

zimenglan-sysu-512 avatar zimenglan-sysu-512 commented on May 28, 2024

hi @xvjiarui
sorry to bother u again.
when i finish training and start to test the final model. the performance is close to zero. how to deal with the sync bn when test?

from gcnet.

xvjiarui avatar xvjiarui commented on May 28, 2024

hi @xvjiarui
sorry to bother u again.
when i finish training and start to test the final model. the performance is close to zero. how to deal with the sync bn when test?

The Sync BN is all fixed during the test just as BN. How does your loss look like? I suspect that may due to the clip_grad_norm.
I also suggest you check your code. I don't think baseline of maskrcnn-benchmark would encounter gradient explosion even with Sync BN.

from gcnet.

zimenglan-sysu-512 avatar zimenglan-sysu-512 commented on May 28, 2024

hi
i use sync bn from apex, like this:

class BottleneckWithAPSyncBN(Bottleneck):
    def __init__(
        self,
        in_channels,
        bottleneck_channels,
        out_channels,
        num_groups=1,
        stride_in_1x1=True,
        stride=1,
        dilation=1,
        configs={},
    ):
        super(BottleneckWithAPSyncBN, self).__init__(
            in_channels=in_channels,
            bottleneck_channels=bottleneck_channels,
            out_channels=out_channels,
            num_groups=num_groups,
            stride_in_1x1=stride_in_1x1,
            stride=stride,
            dilation=dilation,
            norm_func=ap.SyncBatchNorm,
            configs=configs
        )

and use clip_grad after losses.backward() as below where max_norm is set to 35 and norm_type is set to 2:

from torch.nn.utils import clip_grad_norm_ as clip_grad_norm
clip_grad_norm(model.parameters(), max_norm, norm_type=norm_type)

after training using these configs, the mAP on cocoval-2017 is 0.0. the log as below:

2019-04-29 16:15:49,151 maskrcnn_benchmark.trainer INFO: eta: 1 day, 3:45:33  iter: 20  loss: 3.1540 (5.0863)  loss_box_reg: 0.0448 (0.0604)  loss_classifier: 0.9114 (1.8578)  loss_mask: 0.8628 (2.3964)  loss_objectness: 0.3248 (0.4229)  loss_rpn_box_reg: 0.2593 (0.3488)  ftime: 0.2230 (0.5223)  backbone_ftime: 0.0715 (0.3736)  roi_heads_ftime: 0.0337 (0.0381)  rpn_ftime: 0.1085 (0.1106)  time: 0.6449 (1.1106)  data: 0.0086 (0.1168)  lr: 0.007173  max mem: 4027
2019-04-29 16:16:02,300 maskrcnn_benchmark.trainer INFO: eta: 22:05:28  iter: 40  loss: 1.7453 (3.4724)  loss_box_reg: 0.1120 (0.0850)  loss_classifier: 0.5803 (1.2815)  loss_mask: 0.6930 (1.5453)  loss_objectness: 0.2294 (0.3306)  loss_rpn_box_reg: 0.1077 (0.2299)  ftime: 0.2254 (0.3735)  backbone_ftime: 0.0676 (0.2215)  roi_heads_ftime: 0.0475 (0.0427)  rpn_ftime: 0.1078 (0.1093)  time: 0.6584 (0.8840)  data: 0.0102 (0.0640)  lr: 0.007707  max mem: 4027
2019-04-29 16:16:15,600 maskrcnn_benchmark.trainer INFO: eta: 20:15:43  iter: 60  loss: 1.8415 (2.9447)  loss_box_reg: 0.1149 (0.0985)  loss_classifier: 0.5927 (1.0666)  loss_mask: 0.6890 (1.2598)  loss_objectness: 0.2689 (0.3151)  loss_rpn_box_reg: 0.1651 (0.2048)  ftime: 0.2295 (0.3256)  backbone_ftime: 0.0709 (0.1714)  roi_heads_ftime: 0.0507 (0.0455)  rpn_ftime: 0.1077 (0.1087)  time: 0.6604 (0.8110)  data: 0.0131 (0.0475)  lr: 0.008240  max mem: 4027
2019-04-29 16:16:28,584 maskrcnn_benchmark.trainer INFO: eta: 19:14:49  iter: 80  loss: 1.6740 (2.6838)  loss_box_reg: 0.1086 (0.1015)  loss_classifier: 0.4860 (0.9273)  loss_mask: 0.6849 (1.1166)  loss_objectness: 0.2614 (0.3094)  loss_rpn_box_reg: 0.1261 (0.2290)  ftime: 0.2221 (0.3000)  backbone_ftime: 0.0684 (0.1458)  roi_heads_ftime: 0.0497 (0.0461)  rpn_ftime: 0.1057 (0.1081)  time: 0.6513 (0.7706)  data: 0.0099 (0.0383)  lr: 0.008773  max mem: 4027
2019-04-29 16:16:41,528 maskrcnn_benchmark.trainer INFO: eta: 18:37:35  iter: 100  loss: 1.8355 (2.5213)  loss_box_reg: 0.1045 (0.1035)  loss_classifier: 0.5610 (0.8732)  loss_mask: 0.6875 (1.0307)  loss_objectness: 0.2637 (0.3028)  loss_rpn_box_reg: 0.1414 (0.2110)  ftime: 0.2239 (0.2848)  backbone_ftime: 0.0703 (0.1307)  roi_heads_ftime: 0.0474 (0.0465)  rpn_ftime: 0.1052 (0.1076)  time: 0.6486 (0.7459)  data: 0.0103 (0.0331)  lr: 0.009307  max mem: 4027
2019-04-29 16:16:54,769 maskrcnn_benchmark.trainer INFO: eta: 18:16:24  iter: 120  loss: 2.8539 (2.8987)  loss_box_reg: 0.1312 (0.1103)  loss_classifier: 1.3366 (1.2354)  loss_mask: 0.6863 (0.9734)  loss_objectness: 0.3254 (0.3663)  loss_rpn_box_reg: 0.1463 (0.2133)  ftime: 0.2244 (0.2749)  backbone_ftime: 0.0674 (0.1202)  roi_heads_ftime: 0.0494 (0.0469)  rpn_ftime: 0.1083 (0.1077)  time: 0.6607 (0.7319)  data: 0.0112 (0.0296)  lr: 0.009840  max mem: 4092
2019-04-29 16:17:07,720 maskrcnn_benchmark.trainer INFO: eta: 17:58:07  iter: 140  loss: 2.9404 (3.3908)  loss_box_reg: 0.1603 (0.1218)  loss_classifier: 1.6348 (1.7406)  loss_mask: 0.6878 (0.9326)  loss_objectness: 0.5641 (0.3878)  loss_rpn_box_reg: 0.1102 (0.2080)  ftime: 0.2173 (0.2669)  backbone_ftime: 0.0646 (0.1123)  roi_heads_ftime: 0.0448 (0.0468)  rpn_ftime: 0.1075 (0.1078)  time: 0.6432 (0.7199)  data: 0.0113 (0.0271)  lr: 0.010373  max mem: 4092

......

2019-04-30 07:18:47,434 maskrcnn_benchmark.trainer INFO: eta: 0:01:24  iter: 89860  loss: 0.9836 (1.3513)  loss_box_reg: 0.0075 (0.0120)  loss_classifier: 0.2176 (0.4399)  loss_mask: 0.3144 (0.3686)  loss_objectness: 0.3680 (0.4110)  loss_rpn_box_reg: 0.1062 (0.1197)  ftime: 0.1991 (0.2021)  backbone_ftime: 0.0668 (0.0694)  roi_heads_ftime: 0.0225 (0.0238)  rpn_ftime: 0.1078 (0.1089)  time: 0.5898 (0.6032)  data: 0.0129 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:18:59,354 maskrcnn_benchmark.trainer INFO: eta: 0:01:12  iter: 89880  loss: 1.1297 (1.3512)  loss_box_reg: 0.0080 (0.0120)  loss_classifier: 0.2681 (0.4399)  loss_mask: 0.3165 (0.3686)  loss_objectness: 0.4145 (0.4110)  loss_rpn_box_reg: 0.1112 (0.1197)  ftime: 0.1974 (0.2021)  backbone_ftime: 0.0662 (0.0694)  roi_heads_ftime: 0.0242 (0.0238)  rpn_ftime: 0.1069 (0.1089)  time: 0.5999 (0.6032)  data: 0.0144 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:19:11,233 maskrcnn_benchmark.trainer INFO: eta: 0:01:00  iter: 89900  loss: 1.0263 (1.3512)  loss_box_reg: 0.0039 (0.0120)  loss_classifier: 0.2237 (0.4398)  loss_mask: 0.3114 (0.3686)  loss_objectness: 0.3918 (0.4110)  loss_rpn_box_reg: 0.1096 (0.1197)  ftime: 0.1980 (0.2021)  backbone_ftime: 0.0665 (0.0694)  roi_heads_ftime: 0.0231 (0.0238)  rpn_ftime: 0.1074 (0.1089)  time: 0.5881 (0.6032)  data: 0.0113 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:19:22,989 maskrcnn_benchmark.trainer INFO: eta: 0:00:48  iter: 89920  loss: 1.0952 (1.3511)  loss_box_reg: 0.0034 (0.0120)  loss_classifier: 0.2197 (0.4398)  loss_mask: 0.3247 (0.3686)  loss_objectness: 0.4153 (0.4110)  loss_rpn_box_reg: 0.1111 (0.1197)  ftime: 0.1954 (0.2021)  backbone_ftime: 0.0658 (0.0694)  roi_heads_ftime: 0.0231 (0.0238)  rpn_ftime: 0.1062 (0.1089)  time: 0.5863 (0.6032)  data: 0.0108 (0.0131)  lr: 0.000200  max mem: 4250
2019-04-30 07:19:34,826 maskrcnn_benchmark.trainer INFO: eta: 0:00:36  iter: 89940  loss: 1.0778 (1.3511)  loss_box_reg: 0.0088 (0.0120)  loss_classifier: 0.2394 (0.4397)  loss_mask: 0.3099 (0.3686)  loss_objectness: 0.4080 (0.4110)  loss_rpn_box_reg: 0.1153 (0.1197)  ftime: 0.1986 (0.2021)  backbone_ftime: 0.0659 (0.0694)  roi_heads_ftime: 0.0233 (0.0238)  rpn_ftime: 0.1074 (0.1089)  time: 0.5890 (0.6032)  data: 0.0108 (0.0131)  lr: 0.000200  max mem: 4250  
2019-04-30 07:19:46,781 maskrcnn_benchmark.trainer INFO: eta: 0:00:24  iter: 89960  loss: 0.9723 (1.3510)  loss_box_reg: 0.0067 (0.0120)  loss_classifier: 0.2280 (0.4397)  loss_mask: 0.3218 (0.3686)  loss_objectness: 0.3552 (0.4110)  loss_rpn_box_reg: 0.0949 (0.1197)  ftime: 0.1975 (0.2021)  backbone_ftime: 0.0671 (0.0694)  roi_heads_ftime: 0.0231 (0.0238)  rpn_ftime: 0.1086 (0.1089)  time: 0.5903 (0.6032)  data: 0.0137 (0.0131)  lr: 0.000200  max mem: 4250  
2019-04-30 07:19:58,708 maskrcnn_benchmark.trainer INFO: eta: 0:00:12  iter: 89980  loss: 1.1414 (1.3509)  loss_box_reg: 0.0117 (0.0120)  loss_classifier: 0.2371 (0.4396)  loss_mask: 0.3259 (0.3686)  loss_objectness: 0.4193 (0.4110)  loss_rpn_box_reg: 0.1314 (0.1197)  ftime: 0.1980 (0.2021)  backbone_ftime: 0.0681 (0.0694)  roi_heads_ftime: 0.0236 (0.0238)  rpn_ftime: 0.1060 (0.1089)  time: 0.5894 (0.6032)  data: 0.0100 (0.0131)  lr: 0.000200  max mem: 4250 

if i don't use the clip_grad, it encounters the NaN problem.

from gcnet.

Iamal1 avatar Iamal1 commented on May 28, 2024

Hi, my experiment could not run with pytorch syncbn or just BatchNorm, do you remove the broadcast_buffer in the DDP. I want to check there the problem is, thank you!

from gcnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.