Giter VIP home page Giter VIP logo

bonnetal's Issues

some skip layers no grad in segmentation

in some segmenation tasks, why some skip layers from early layers doesn'y have a grad infomation?
because it seems that in mobilenetv2 backbone,it is x.detatch()

Error with ros while building the docker Image.

after running:
nvidia-docker build -t tano297/bonnetal:base -f docker/base/Dockerfile .

I get the following Error message:
Err:12 http://packages.ros.org/ros/ubuntu bionic InRelease The following signatures couldn't be verified because the public key is not available: NO_PUBKEY F42ED6FBAB17C654
Reading package lists...
W: GPG error: http://packages.ros.org/ros/ubuntu bionic InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY F42ED6FBAB17C654
E: The repository 'http://packages.ros.org/ros/ubuntu bionic InRelease' is not signed.

Pytorch 1.6 Issues with OneShot

Hi,

I am trying to use the oneshot while using Pytorch 1.6 however it gives me a warning:

UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate.

Do you know how I could solve this? Thank you

Can't Fix RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED ?

I am finding the same error but do not seem to be able to solve it. I have changes the labels and preprocessed the label file (changed labels.py and ran python createTrainIdLabelImgs.py ) but the code still exits before completing
File ../../tasks/segmentation/modules/trainer.py, line 488, in train_epoch loss.backward()

Do you have any idea what I could do to solve this issue?

My labels.py file in cityscapes:

labels = [
    #       name                     id    trainId   category            catId     hasInstances   ignoreInEval   color
    Label(  'unlabeled'            ,  0 ,      19 , 'void'            , 0       , False        , True         , (  0,  0,  0) ),
    Label(  'ego vehicle'          ,  1 ,      19 , 'void'            , 0       , False        , True         , (  0,  0,  0) ),
    Label(  'rectification border' ,  2 ,      19 , 'void'            , 0       , False        , True         , (  0,  0,  0) ),
    Label(  'out of roi'           ,  3 ,      19 , 'void'            , 0       , False        , True         , (  0,  0,  0) ),
    Label(  'static'               ,  4 ,      19 , 'void'            , 0       , False        , True         , (  0,  0,  0) ),
    Label(  'dynamic'              ,  5 ,      19 , 'void'            , 0       , False        , True         , (111, 74,  0) ),
    Label(  'ground'               ,  6 ,      19 , 'void'            , 0       , False        , True         , ( 81,  0, 81) ),
    Label(  'road'                 ,  7 ,        0 , 'flat'            , 1       , False        , False        , (128, 64,128) ),
    Label(  'sidewalk'             ,  8 ,        1 , 'flat'            , 1       , False        , False        , (244, 35,232) ),
    Label(  'parking'              ,  9 ,      19 , 'flat'            , 1       , False        , True         , (250,170,160) ),
    Label(  'rail track'           , 10 ,      19 , 'flat'            , 1       , False        , True         , (230,150,140) ),
    Label(  'building'             , 11 ,        2 , 'construction'    , 2       , False        , False        , ( 70, 70, 70) ),
    Label(  'wall'                 , 12 ,        3 , 'construction'    , 2       , False        , False        , (102,102,156) ),
    Label(  'fence'                , 13 ,        4 , 'construction'    , 2       , False        , False        , (190,153,153) ),
    Label(  'guard rail'           , 14 ,      19 , 'construction'    , 2       , False        , True         , (180,165,180) ),
    Label(  'bridge'               , 15 ,      19 , 'construction'    , 2       , False        , True         , (150,100,100) ),
    Label(  'tunnel'               , 16 ,      19 , 'construction'    , 2       , False        , True         , (150,120, 90) ),
    Label(  'pole'                 , 17 ,        5 , 'object'          , 3       , False        , False        , (153,153,153) ),
    Label(  'polegroup'            , 18 ,      19 , 'object'          , 3       , False        , True         , (153,153,153) ),
    Label(  'traffic light'        , 19 ,        6 , 'object'          , 3       , False        , False        , (250,170, 30) ),
    Label(  'traffic sign'         , 20 ,        7 , 'object'          , 3       , False        , False        , (220,220,  0) ),
    Label(  'vegetation'           , 21 ,        8 , 'nature'          , 4       , False        , False        , (107,142, 35) ),
    Label(  'terrain'              , 22 ,        9 , 'nature'          , 4       , False        , False        , (152,251,152) ),
    Label(  'sky'                  , 23 ,       10 , 'sky'             , 5       , False        , False        , ( 70,130,180) ),
    Label(  'person'               , 24 ,       11 , 'human'           , 6       , True         , False        , (220, 20, 60) ),
    Label(  'rider'                , 25 ,       12 , 'human'           , 6       , True         , False        , (255,  0,  0) ),
    Label(  'car'                  , 26 ,       13 , 'vehicle'         , 7       , True         , False        , (  0,  0,142) ),
    Label(  'truck'                , 27 ,       14 , 'vehicle'         , 7       , True         , False        , (  0,  0, 70) ),
    Label(  'bus'                  , 28 ,       15 , 'vehicle'         , 7       , True         , False        , (  0, 60,100) ),
    Label(  'caravan'              , 29 ,      19 , 'vehicle'         , 7       , True         , True         , (  0,  0, 90) ),
    Label(  'trailer'              , 30 ,      19 , 'vehicle'         , 7       , True         , True         , (  0,  0,110) ),
    Label(  'train'                , 31 ,       16 , 'vehicle'         , 7       , True         , False        , (  0, 80,100) ),
    Label(  'motorcycle'           , 32 ,       17 , 'vehicle'         , 7       , True         , False        , (  0,  0,230) ),
    Label(  'bicycle'              , 33 ,       18 , 'vehicle'         , 7       , True         , False        , (119, 11, 32) ),
    Label(  'license plate'        , -1 ,       19 , 'vehicle'         , 7       , False        , True         , (  0,  0,142) ),
] 

Traceback:

./train.py -c ~/bonnetal/train/tasks/segmentation/config/cityscapes/ERFNet.yaml -l ~/bonnetal/train/tasks/segmentation/log1
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/cris/.local/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
----------
INTERFACE:
config yaml:  /home/cris/bonnetal/train/tasks/segmentation/config/cityscapes/ERFNet.yaml
log dir /home/cris/bonnetal/train/tasks/segmentation/log1
model path None
eval only False
No batchnorm False
----------

Commit hash (training version):  b'5368eed'
----------

Opening config file /home/cris/bonnetal/train/tasks/segmentation/config/cityscapes/ERFNet.yaml
No pretrained directory found.
Copying files to /home/cris/bonnetal/train/tasks/segmentation/log1 for further reference.
WARNING:tensorflow:From ../../common/logger.py:16: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Images from:  ~/bonnetal/cityscapes/leftImg8bit/train
Labels from:  ~/bonnetal/cityscapes/gtFine/train
LENGTH 2975 2975
Inference batch size:  4
Images from:  ~/bonnetal/cityscapes/leftImg8bit/val
Labels from:  ~/bonnetal/cityscapes/gtFine/val
LENGTH 500 500
Original OS:  8
New OS:  8
Trying to get backbone weights online from Bonnetal server.
Using pretrained weights from bonnetal server for backbone
OS:  1 , channels:  16
OS:  2 , channels:  16
OS:  4 , channels:  64
[Decoder] os:  4 in:  128 skip: 64 out:  64
[Decoder] os:  2 in:  64 skip: 16 out:  16
[Decoder] os:  1 in:  16 skip: 3 out:  16
Using normalized weights as bias for head.
No path to pretrained, using bonnetal Imagenet backbone weights and random decoder.
Total number of parameters:  2252148
Total number of parameters requires_grad:  2252148
Param encoder  1913168
Param decoder  338640
Param head  340
Training in device:  cuda
/home/cris/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:100: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Ignoring class  19  in IoU evaluation
[IOU EVAL] IGNORE:  tensor([19])
[IOU EVAL] INCLUDE:  tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18])
Let's see if it finishes this
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [1,0,0], thread: [576,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [1,0,0], thread: [577,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [1,0,0], thread: [578,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [1,0,0], thread: [579,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "./train.py", line 117, in <module>
    trainer.train()
  File "../../tasks/segmentation/modules/trainer.py", line 302, in train
    scheduler=self.scheduler)
  File "../../tasks/segmentation/modules/trainer.py", line 488, in train_epoch
    loss.backward()
  File "/home/cris/.local/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/cris/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Platform DOESN'T HAVE fp16 support.

I tried the cityscapes_erfnet_1024_70 dataset and got an error on start.

I am running Cuda 10.2 with TensorRT 6 on Ubuntu 18.04.

Specs:
Intel® Xeon(R) CPU E5-1650 v4 @ 3.60GHz × 12
64GB DDR4-RAM
3x Titan X (12GB)

Trying to open model
Trying to deserialize previously stored: /home/tobias/src/bonnetal/cityscapes_erfnet_1024_70/model.trt
Could not deserialize TensorRT engine.
Generating from sratch... This may take a while...
Trying to generate trt engine from : /home/tobias/src/bonnetal/cityscapes_erfnet_1024_70/model.onnx
Platform DOESN'T HAVE fp16 support.
No DLA selected.
Could not open file /home/tobias/src/bonnetal/cityscapes_erfnet_1024_70/model.onnx
Could not open file /home/tobias/src/bonnetal/cityscapes_erfnet_1024_70/model.onnx
Failed to parse ONNX model from file/home/tobias/src/bonnetal/cityscapes_erfnet_1024_70/model.onnx
Success picking up ONNX model
Success adding argmax to trt model
[bonnetal_segmentation_node-2] process has died [pid 9628, exit code -11, cmd /home/tobias/src/bonnetal/deploy/devel/lib/bonnetal_segmentation_ros/bonnetal_segmentation_node __name:=bonnetal_segmentation_node __log:=/home/tobias/.ros/log/2eb40dc4-f663-11ea-893f-38d547c88646/bonnetal_segmentation_node-2.log].
log file: /home/tobias/.ros/log/2eb40dc4-f663-11ea-893f-38d547c88646/bonnetal_segmentation_node-2*.log

libcublas.so.10, needed by libnvinfer.so, not found

Hey,

I try to compile the packages without docker (I dont like it and it is also not working for me).

I am running Ubuntu 18.04 with Cuda 10.1, CUDNN8, TensorRT 5.1.5.

/usr/bin/ld: warning: libcublas.so.10, needed by /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so, not found (try using -rpath or -rpath-link) /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to[email protected]'
/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to [email protected]' /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to [email protected]'
/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to [email protected]' /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to [email protected]'
/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to [email protected]' /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to [email protected]'
/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to [email protected]' /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to [email protected]'
/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to [email protected]' /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to [email protected]'
/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libnvinfer.so: undefined reference to `[email protected]'
collect2: error: ld returned 1 exit status
make[2]: *** [/home/tobias/src/bonnetal/deploy/devel/.private/bonnetal_segmentation_ros/lib/bonnetal_segmentation_ros/bonnetal_segmentation_node] Error 1
make[1]: *** [CMakeFiles/bonnetal_segmentation_node.dir/all] Error 2
make: *** [all] Error 2

`

How to configure cfg.yaml file?

Hello!
I am using bonnetal for semantic segmentation task. It works and I would like to ask you how to configure the parameters in cfg.yaml file to make the results better. I have only one class to segment and the second is background as in the example you provided.

If there is any documentation of the meaning of each of the parameters?

Error building the base docker due to unspecified depenency versions

When building the base docker, it fails due to this error:
Depends: libnvinfer-dev (= 5.1.5-1+cuda10.1) but 8.2.3-1+cuda11.4 is to be installed
The solution was to replace the line:
apt install tensorrt python3-libnvinfer-dev -yqq
with
apt install tensorrt python3-libnvinfer=5.1.5-1+cuda10.1 python3-libnvinfer-dev=5.1.5-1+cuda10.1 -yqq

in the DockerFile of the base image!

[Support] Need help reducing GPU memory usage.

Hello!

Nice looking library! I'd like to train mobilenetv2 for semantic segmentation using my coco-like dataset. I've copied the coco dataloader and updated things for my data.

But even though my GPU has 16GB of ram and I've set batch size to 1, I'm still consuming all my GPU memory as soon as training begins, crashing the session.

My config is below, followed by a dump of my terminal output. I'm not sure what I'm doing wrong.

#training parameters
train:
  loss: "xentropy"       # must be either xentropy or iou
  max_epochs: 1000
  max_lr: 0.005          # sgd learning rate max
  min_lr: 0.001          # warmup initial learning rate
  up_epochs: 1           # warmup during first XX epochs (can be float)
  down_epochs:  20       # warmdown during second XX epochs  (can be float)
  max_momentum: 0.7      # sgd momentum max when lr is mim
  min_momentum: 0.5      # sgd momentum min when lr is max
  final_decay: 0.95      # learning rate decay per epoch after initial cycle (from min lr)
  w_decay: 0.0001        # weight decay
  batch_size:  1         # batch size
  report_batch: 1        # every x batches, report loss
  report_epoch: 1        # every x epochs, report validation set
  save_summary: False    # Summary of weight histograms for tensorboard
  save_imgs: False        # False doesn't save anything, True saves some
                         # sample images (one per batch of the last calculated batch)
                         # in log folder
  avg_N: 3               # average the N best models
  crop_prop:
    width: 2560
    height: 1440

# backbone parameters
backbone:
  name: "mobilenetv2"
  dropout: 0.01
  bn_d: 0.01
  OS: 8  # output stride
  train: True # train backbone?
  extra:
    width_mult: 1.0
    shallow_feats: True # get features before the last layer (mn2)

decoder:
  name: "aspp_residual"
  dropout: 0.01
  bn_d: 0.01
  train: True # train decoder?
  extra:
    aspp_channels: 64
    skip_os: [4]
    last_channels: 32

# classification head parameters
head:
  name: "segmentation"
  dropout: 0.01

# dataset (to find parser)
dataset:
  name: "rover"
  location: "/home/taylor/datasets/rover"
  workers: 1  # number of threads to get data
  img_means: #rgb
    - 0.47037394
    - 0.44669544
    - 0.40731883
  img_stds: #rgb
    - 0.27876515
    - 0.27429348
    - 0.28861644
  img_prop:
    width: 2560
    height: 1440
    depth: 3
  labels:
    0: 'nothing'
    1: 'trail'
    2: 'terrain'
    3: 'sidewalk'
    4: 'person'
    5: 'traffic_cone'
    6: 'vehicle'
    7: 'private_road'
    8: 'dirt_road'
    9: 'drivable_ground'
    10: 'building'
    11: 'public_street'
  labels_w:
    0: 1.0
    1: 1.0
    2: 1.0
    3: 1.0
    4: 1.0
    5: 1.0
    6: 1.0
    7: 1.0
    8: 1.0
    9: 1.0
    10: 1.0
    11: 1.0
  color_map: # bgr
    0: [0, 0, 0]
    1: [220, 20, 60]
    2: [119, 11, 32]
    3: [0, 0, 142]
    4: [0, 0, 230]
    5: [106, 0, 228]
    6: [0, 60, 100]
    7: [0, 80, 100]
    8: [0, 0, 70]
    9: [0, 0, 192]
    10: [250, 170, 30]
    11: [100, 170, 30]

Here is my terminal output (There's a few extra things being printed due to some debugging):

developer@taylor-desktop:~/bonnetal/train/tasks/segmentation$ ./train.py --cfg config/rover/mobilenetv2_aspp_res.yaml 
----------
INTERFACE:
config yaml:  config/rover/mobilenetv2_aspp_res.yaml
log dir /home/developer/logs/2020-5-01-03:19/
model path None
eval only False
No batchnorm False
----------

Commit hash (training version):  b'5368eed'
----------

Opening config file config/rover/mobilenetv2_aspp_res.yaml
No pretrained directory found.
Copying files to /home/developer/logs/2020-5-01-03:19/ for further reference.
Images from:  /home/taylor/datasets/rover/images/rover_train
Labels from:  /home/taylor/datasets/rover/annotations/rover_train
Inference batch size:  1
Images from:  /home/taylor/datasets/rover/images/rover_test
Labels from:  /home/taylor/datasets/rover/annotations/rover_test
['__add__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'crop', 'crop_param', 'filenames', 'filenamesGt', 'h', 'h_flip', 'images_root', 'jitter', 'labels_root', 'means', 'norm', 'stds', 'subset', 'tensorize_img', 'tensorize_lbl', 'w']
dict_items([(0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0), (9, 1.0), (10, 1.0), (11, 1.0)])
Original OS:  32
New OS:  8.0
Trying to get backbone weights online from Bonnetal server.
Using pretrained weights from bonnetal server for backbone
[Decoder] os:  4 in:  32 skip: 24 out:  24
[Decoder] os:  2 in:  24 skip: 16 out:  16
[Decoder] os:  1 in:  16 skip: 3 out:  16
Using normalized weights as bias for head.
No path to pretrained, using bonnetal Imagenet backbone weights and random decoder.
Total number of parameters:  2144900
Total number of parameters requires_grad:  2144900
Param encoder  1812800
Param decoder  331896
Param head  204
Training in device:  cuda
[IOU EVAL] IGNORE:  tensor([], dtype=torch.int64)
[IOU EVAL] INCLUDE:  tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
Traceback (most recent call last):
  File "./train.py", line 117, in <module>
    trainer.train()
  File "../../tasks/segmentation/modules/trainer.py", line 303, in train
    scheduler=self.scheduler)
  File "../../tasks/segmentation/modules/trainer.py", line 479, in train_epoch
    output = model(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../../tasks/segmentation/modules/segmentator.py", line 102, in forward
    x = self.decoder(x, skips)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../..//tasks/segmentation/decoders/aspp_residual.py", line 83, in forward
    features = mixconv(features)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../../common/layers.py", line 83, in forward
    return x + self.inv_dwise_conv(x)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 338.00 MiB (GPU 0; 15.89 GiB total capacity; 11.95 GiB already allocated; 324.06 MiB free; 558.73 MiB cached)

Here is nvidia-smi when the program is not running. Looks like plenty of RAM available:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P5000        On   | 00000000:08:00.0  On |                  Off |
| 32%   52C    P0    45W / 180W |   2694MiB / 16275MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1635      G   /usr/lib/xorg/Xorg                            97MiB |
|    0      1872      G   /usr/bin/gnome-shell                          53MiB |
|    0      3769      G   /usr/lib/xorg/Xorg                           974MiB |
|    0      3910      G   /usr/bin/gnome-shell                         894MiB |
|    0      6807      G   ...quest-channel-token=7931726709970216186   113MiB |
|    0      7241      G   gnome-control-center                          94MiB |
|    0     10884      G   /usr/bin/vlc                                 110MiB |
|    0     12677      G   ...quest-channel-token=7615775727565811985    56MiB |
|    0     21292      G   ...-token=1B3DA049A377FA772C5604DC206A395E   234MiB |
|    0     23554      G   /usr/lib/firefox/firefox                       1MiB |
|    0     23638      G   /usr/lib/firefox/firefox                       1MiB |
|    0     23666      G   /usr/lib/firefox/firefox                       1MiB |
|    0     27000      G   /usr/lib/firefox/firefox                       1MiB |
|    0     30196      G   kicad                                         46MiB |
+-----------------------------------------------------------------------------+

Any help would be appreciated!

FPS

Hello,

I'm using PyTorch GPU and "MobilenetsV2 ASPP Res - 512px" model with Python and C++ but I get ~94ms on my PC. Is it ok or not?

OS: Win 10
GPU: GTX 1080 Ti
Video size: 2160x3840
Video FPS: 60

Thanks

Add depth channel

Hi,

I am currently thinking of using bonnetal for a project and I saw that there is a benchmark using the Zed Depth. Which encoder-decoder was it used? Do you have any tips on how to include rgb+depth to the current available backbone and decoders in bonnetal or would I need to create my own in this case? Thanks!

Error using standalone image inference

hey there,

training worked fine, also making the inference models was no problem. I then build for standalone use, which gives me no errors.

But when I try using standalone inference on images as described in deploy/segmentation, I get the following C++ error:

Predicting image: ../train/tasks/segmentation/dataset/504-896/rgb/rgb-cropped/
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 140) > this->size() (which is 0)
Aborted (core dumped)

Since I dont yet have any idea of C++ whatsoever, I thought maybe someone has a chance to see the problem directly. In the following I'll add the whole console output for my infer_img execution, aswell as the config of the model. I would appreciate any clue. Thx for looking!

My example is called with tensor rt, but I get the same error with pytorch aswell. I first thought it was about image dimensions, remade the inference models with new dimensions etc., but did not get there yet.

The full console output:

developer@olli:/home/olli/Code/XxxXxxx/bonnetal/deploy$ ./devel/lib/bonnetal_segmentation_standalone/infer_img -p ../train/tasks/segmentation/mystuff/deployed/best_darknet_leaf -i ../train/tasks/segmentation/dataset/504-896/rgb/rgb-cropped/ -b tensorrt
================================================================================
image: ../train/tasks/segmentation/dataset/504-896/rgb/rgb-cropped/
path: ../train/tasks/segmentation/mystuff/deployed/best_darknet_leaf/
backend: tensorrt
verbose: 0
================================================================================
Setting verbosity to: true
Trying to open model
Trying to deserialize previously stored: ../train/tasks/segmentation/mystuff/deployed/best_darknet_leaf//model.trt
Successfully found TensorRT engine file ../train/tasks/segmentation/mystuff/deployed/best_darknet_leaf//model.trt
Successfully created inference runtime
No DLA selected.
Successfully allocated 122542320 for model.
Successfully read 122542320 to modelmem.
INFO: Glob Size is 122257600 bytes.
INFO: Added linear block of size 173408256
INFO: Added linear block of size 173408256
INFO: Added linear block of size 28901376
INFO: Added linear block of size 28901376
INFO: Added linear block of size 14450688
INFO: Added linear block of size 7225344
INFO: Deserialize required 1790698 microseconds.
Created engine!
Successfully deserialized Engine from trt file
Binding: 0, type: 0
[Dim 3][Dim 504][Dim 896]
Binding: 1, type: 3
[Dim 1][Dim 504][Dim 896]
Successfully create binding buffer
================================================================================
Predicting image: ../train/tasks/segmentation/dataset/504-896/rgb/rgb-cropped/
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 140) > this->size() (which is 0)
Aborted (core dumped)

And the config

backbone:
  OS: 8
  bn_d: 0.01
  dropout: 0.01
  extra:
    darknet: darknet53
  name: darknet
  train: true
dataset:
  color_map:
    0:
    - 0
    - 0
    - 0
    1:
    - 255
    - 0
    - 0
    2:
    - 0
    - 0
    - 255
  img_means:
  - 0.4148446751432425
  - 0.5053385609354691
  - 0.45907718096515426
  img_prop:
    depth: 3
    height: 504
    width: 896
  img_stds:
  - 0.14722864867881855
  - 0.16334236156069035
  - 0.17758600209156641
  labels:
    0: ground
    1: carrot
    2: weed
  labels_w:
    0: 1.0
    1: 1.0
    2: 1.0
  location: dataset/leaf_moreweed/
  name: leaf_moreweed
  workers: 12
decoder:
  bn_d: 0.01
  dropout: 0.01
  extra:
    aspp_channels: 256
    last_channels: 32
    skip_os:
    - 4
    - 2
  name: aspp_residual
  train: true
head:
  dropout: 0.01
  name: segmentation
train:
  avg_N: 3
  batch_size: 2
  crop_prop:
    height: 480
    width: 480
  down_epochs: 100
  final_decay: 0.995
  loss: xentropy
  max_epochs: 100
  max_lr: 0.0001
  max_momentum: 0.95
  min_lr: 1.0e-05
  min_momentum: 0.9
  report_batch: 1
  report_epoch: 1
  save_imgs: false
  save_summary: true
  up_epochs: 0.5
  w_decay: 0.0001

best regards, olli

GPU stops working when running inference

Hi,

I am using a GeForce RTX 2060 with bonnetal and it is crashing the GPU. I get the error:

Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU

In this case, i am using my own code for ROS which uses the user.infer. This is the code:

#!/usr/bin/env python3
# Futures
from __future__ import print_function

# STD
import sys
import time
import argparse
import subprocess
import datetime
import os
import shutil

# ROS
import rospy
import roslib
from sensor_msgs.msg import CompressedImage

# numpy and scipy
import numpy as np
from scipy.ndimage import filters

# OpenCV
import cv2
from cv_bridge import CvBridge, CvBridgeError

# For overlaying images
from PIL import Image

import torch
# check if cuda is activated
cuda = torch.cuda.is_available()
if cuda == False:
    print("Model is NOT using GPU")
print ("Cuda:", torch.cuda.is_available())

class BonnetalNode:
    """
    Encapsulates the bonnetal functionality into a ROS node.
    """
    # A ROS subscriber for input images
    img_sub = None
    labelled_img_pub = None
    overlaid_img_pub = None
    # Bonnetal interface
    user = None

    def __init__(self):
        """
        Initializes ROS (pubs and subs) and bonnetal.
        """
        # Initialize ROS
        rospy.init_node("bonnetal_node")
        init = rospy.Time.now()
        # Parameters Config 
        path_model = rospy.get_param("path_model")
        backend = rospy.get_param("backend")
        camera_topic = rospy.get_param("camera_topic")

        # Add path for bonnetal files
        abs_path = rospy.get_param("abs_path")
        print ("Abs path is: ", abs_path)
        sys.path.insert(0, abs_path + "bonnetal/train")

        # Initialize bonnetal
        self.initialize_bonnetal(path=path_model, backend=backend)

        # Initialize publishers and subscribers
        self.overlaid_img_pub = rospy.Publisher("/overlaid_image/compressed",
                CompressedImage, queue_size = 1)
        self.labelled_img_pub = rospy.Publisher("/output_labelled_img/compressed",
                CompressedImage, queue_size = 1)
        # buff size allows callback to get the latest msg instead of queueing them
        self.img_sub = rospy.Subscriber(camera_topic,
                    CompressedImage, self.image_callback,  queue_size = 1, buff_size=2**32)

        rospy.loginfo("Segmentation node initialized in {} seconds!".format(
            (rospy.Time.now()-init).to_sec()))

    def initialize_bonnetal(self, path, backend="native", workspace=8000000000, calib_images=None):
        """
        Initializes bonnetal

        :type path: string
        :param path: full path to pretrained model

        :type backend: string
        :param backend: framework for segmentation task

        :type workspace: int
        :param workspace: max workspace size (only for TensorRT framework)

        :type calib_images: list
        :param calib_images: calibration images, must be a list of images (only for TensorRT framework)
        """
        # create inference context for the desired backend
        if backend == "tensorrt":
            # import and use tensorRT
            try:
                print("Using tensorRT")
                from tasks.segmentation.modules.userTensorRT import UserTensorRT
                self.user = UserTensorRT(path, workspace, calib_images)
            except ImportError as e:
                print ("ERROR:", e)
                sys.exit(0)
            except:
                print('\nERROR:TensorRT needs to use inference model type .onnx. You can make one '
                    'using tasks/segmentation/make_deploy_model.py')
                sys.exit(0)
        elif backend == "caffe2":
            try:
                # import and use caffe2
                print("Using caffe2")
                from tasks.segmentation.modules.userCaffe2 import UserCaffe2
                self.user = UserCaffe2(path)
            except ImportError as e:
                print ("ERROR:", e)
                sys.exit(0)
            except:
                print('\nERROR:Caffe2 needs to use inference model type .onnx. You can make one '
                    'using tasks/segmentation/make_deploy_model.py')
                sys.exit(0)

        elif backend == "pytorch":
            # import and use pytorch
            try:
                print("Using PyTorch")
                from tasks.segmentation.modules.userPytorch import UserPytorch
                self.user = UserPytorch(path)
            except ImportError as e:
                print ("ERROR:", e)
                sys.exit(0)
            except:
                print('\nERROR:PyTorch needs to use inference model type .pytorch. You can make one '
                    'using tasks/segmentation/make_deploy_model.py')
                sys.exit(0)

        else:
            # default to native pytorch
            print("Using native PyTorch")
            from tasks.segmentation.modules.user import User
            self.user = User(path)

    def segment_image(self, cv_img):
        """
        Input should be cv image.

        :type cv_img: int
        :param cv_img: max workspace size (only for TensorRT framework)

        :rtype: numpy.ndarray
        :returns: OpenCV color image with labels of fuel

        :rtype: numpy.ndarray
        :returns: OpenCV color image from the camera with overlay labels of fuel
        """
        # infer
        # print("Inferring ")
        _, lbl_img = self.user.infer(cv_img, False)
        overlay_img = Image.blend(Image.fromarray(cv_img), Image.fromarray(lbl_img), 0.5)

        return lbl_img, overlay_img

    def unpack_image_msg(self, msg):
        """
        Receives a sensor_msgs/CompressedImage and returns a cv image

        :type msg: CompressedImage
        :param msg: CompressedImage ROS message

        :rtype: numpy.ndarray
        :returns: OpenCV color image
        """
        np_arr = np.fromstring(msg.data, np.uint8)
        cv_img = cv2.imdecode(np_arr, cv2.IMREAD_COLOR)

        return cv_img

    def re_pack_image_msg(self, cv_img):
        """
        Packing OpenCV image to ROS message CompressedImage

        :type cv_img: CompressedImage
        :param cv_img: CompressedImage ROS message

        :rtype: CompressedImage
        :returns: CompressedImage ROS message in jpeg format
        """
        #img_msg = cv2_to_imgmsg(cv_img, encoding="bgr8")

        img_msg = CompressedImage()
        img_msg.header.stamp = rospy.Time.now()
        img_msg.format = "jpeg"
        img_msg.data = np.array(cv2.imencode('.jpg', np.asarray(cv_img))[1]).tostring()

        return img_msg

    def pub_lbl_img(self, cv_img):
        """
        Publishes the labelled (segmented) images.

        :type cv_img: CompressedImage
        :param cv_img: CompressedImage ROS message
        """
        img_msg = self.re_pack_image_msg(cv_img)
        self.labelled_img_pub.publish(img_msg)

    def pub_overlay_img(self, cv_img):
        """
        Publishes the overlaid images.

        :type cv_img: CompressedImage
        :param cv_img: CompressedImage ROS message
        """
        img_msg = self.re_pack_image_msg(cv_img)
        self.overlaid_img_pub.publish(img_msg)

    def image_callback(self, msg):
        """
        Receives sensor_msgs/CompressedImage and publishes labelled images.

        :type msg: CompressedImage
        :param msg: CompressedImage ROS message
        """
        cv_img = self.unpack_image_msg(msg)

        lbl_img, overlay_img = self.segment_image(cv_img)

        self.pub_lbl_img(lbl_img)
        self.pub_overlay_img(overlay_img)


    def run(self):
        """
        Enters the main loop for processing messages.
        """
        rospy.spin()


def main():
    node = BonnetalNode()
    node.run()


if __name__ == "__main__":
    main()

Do you know what the issue could be?

How to fix the RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED ? Thank you!

INTERFACE:
config yaml:  config/cityscapes/darknet21_aspp.yaml
log dir /home/pc/logs/2019-8-20-16:13/
model path None
eval only False
No batchnorm False
----------

Commit hash (training version):  b'5368eed'
----------

Opening config file config/cityscapes/darknet21_aspp.yaml
No pretrained directory found.
Copying files to /home/pc/logs/2019-8-20-16:13/ for further reference.
WARNING: Logging before flag parsing goes to stderr.
W0820 16:13:16.396194 140436803987200 deprecation_wrapper.py:119] From ../../common/logger.py:16: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Images from:  /home3/data/city/city_selected/leftImg8bit/train
Labels from:  /home3/data/city/city_selected/gtFine/train
Inference batch size:  1
Images from:  /home3/data/city/city_selected/leftImg8bit/val
Labels from:  /home3/data/city/city_selected/gtFine/val
Original OS:  32
New OS:  8
Strides:  [2, 2, 2, 1, 1]
Dilations:  [1, 1, 1, 2, 4]
Trying to get backbone weights online from Bonnetal server.
Using pretrained weights from bonnetal server for backbone
[Decoder] os:  4 in:  128 skip: 128 out:  128
[Decoder] os:  2 in:  128 skip: 64 out:  64
[Decoder] os:  1 in:  64 skip: 32 out:  32
Using normalized weights as bias for head.
No path to pretrained, using bonnetal Imagenet backbone weights and random decoder.
Total number of parameters:  19239412
Total number of parameters requires_grad:  19239412
Param encoder  14920544
Param decoder  4318208
Param head  660
Training in device:  cuda
Ignoring class  19  in IoU evaluation
[IOU EVAL] IGNORE:  tensor([19])
[IOU EVAL] INCLUDE:  tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18])
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [576,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [352,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [353,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [354,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [355,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "train.py", line 117, in <module>
    trainer.train()
  File "../../tasks/segmentation/modules/trainer.py", line 302, in train
    scheduler=self.scheduler)
  File "../../tasks/segmentation/modules/trainer.py", line 487, in train_epoch
    loss.backward()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Inference uses too much GPU memory

Hi,

I have trained a model on my own data, and now trying to run inference. When running infer_img.py on one image (640x480) I see my GPU (GeForce GTX 1080) usage jump up to 7.5gb, which seems excessive for one image only. Is this expected behaviour?

Do you have any suggestions on how to decrease GPU usage during inference, as I only have 8gb of GPU memory, and need it to run a simulation as well as a couple of other inference scripts at the same time, as the data is coming in.

I would also just like to say thanks for open sourcing your work. From what I've seen, this is one of the best detection/segmentation projects out there in terms of code quality and readability, as well as good explanations how to get everything working.

Error building pytorch in the base docker

When executing the pytorch build in the base Dockerfile, it fails and outputs the following:
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_1.cpp.o /usr/bin/c++ -DAT_PARALLEL_OPENMP=1 -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -Iaten/src -I../aten/src -I. -I../ -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party/XNNPACK/include -I../cmake/../third_party/benchmark/include -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /usr/include/python3.6m -isystem /usr/local/lib/python3.6/dist-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -isystem ../cmake/../third_party/cub -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -I/usr/local/cuda/include -I../caffe2/../torch/csrc/api -I../caffe2/../torch/csrc/api/include -I../caffe2/aten/src/TH -Icaffe2/aten/src/TH -I../caffe2/../torch/../aten/src -Icaffe2/aten/src -Icaffe2/../aten/src -Icaffe2/../aten/src/ATen -I../caffe2/../torch/csrc -I../caffe2/../torch/../third_party/miniz-2.0.8 -I../aten/src/TH -I../aten/../third_party/catch/single_include -I../aten/src/ATen/.. -Icaffe2/aten/src/ATen -I../third_party/miniz-2.0.8 -I../caffe2/core/nomnigraph/include -isystem include -I../third_party/FXdiv/include -I../c10/.. -Ithird_party/ideep/mkl-dnn/include -I../third_party/ideep/mkl-dnn/src/../include -I../third_party/cpuinfo/include -I../third_party/QNNPACK/include -I../third_party/pthreadpool/include -I../aten/src/ATen/native/quantized/cpu/qnnpack/include -I../aten/src/ATen/native/quantized/cpu/qnnpack/src -I../third_party/cpuinfo/deps/clog/include -I../third_party/NNPACK/include -I../third_party/fbgemm/include -I../third_party/fbgemm -I../third_party/fbgemm/third_party/asmjit/src -I../third_party/FP16/include -I../third_party/tensorpipe -I../third_party/fmt/include -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -fPIC -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -fvisibility=hidden -O2 -DCAFFE2_BUILD_MAIN_LIB -pthread -DASMJIT_STATIC -std=gnu++14 -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_1.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_1.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_1.cpp.o -c ../torch/csrc/autograd/generated/VariableType_1.cpp c++: internal compiler error: Killed (program cc1plus) Please submit a full bug report, with preprocessed source if appropriate. See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions. [3565/4925] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_4.cpp.o FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_4.cpp.o /usr/bin/c++ -DAT_PARALLEL_OPENMP=1 -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -Iaten/src -I../aten/src -I. -I../ -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party/XNNPACK/include -I../cmake/../third_party/benchmark/include -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /usr/include/python3.6m -isystem /usr/local/lib/python3.6/dist-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -isystem ../cmake/../third_party/cub -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -I/usr/local/cuda/include -I../caffe2/../torch/csrc/api -I../caffe2/../torch/csrc/api/include -I../caffe2/aten/src/TH -Icaffe2/aten/src/TH -I../caffe2/../torch/../aten/src -Icaffe2/aten/src -Icaffe2/../aten/src -Icaffe2/../aten/src/ATen -I../caffe2/../torch/csrc -I../caffe2/../torch/../third_party/miniz-2.0.8 -I../aten/src/TH -I../aten/../third_party/catch/single_include -I../aten/src/ATen/.. -Icaffe2/aten/src/ATen -I../third_party/miniz-2.0.8 -I../caffe2/core/nomnigraph/include -isystem include -I../third_party/FXdiv/include -I../c10/.. -Ithird_party/ideep/mkl-dnn/include -I../third_party/ideep/mkl-dnn/src/../include -I../third_party/cpuinfo/include -I../third_party/QNNPACK/include -I../third_party/pthreadpool/include -I../aten/src/ATen/native/quantized/cpu/qnnpack/include -I../aten/src/ATen/native/quantized/cpu/qnnpack/src -I../third_party/cpuinfo/deps/clog/include -I../third_party/NNPACK/include -I../third_party/fbgemm/include -I../third_party/fbgemm -I../third_party/fbgemm/third_party/asmjit/src -I../third_party/FP16/include -I../third_party/tensorpipe -I../third_party/fmt/include -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -fPIC -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -fvisibility=hidden -O2 -DCAFFE2_BUILD_MAIN_LIB -pthread -DASMJIT_STATIC -std=gnu++14 -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_4.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_4.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_4.cpp.o -c ../torch/csrc/autograd/generated/VariableType_4.cpp c++: internal compiler error: Killed (program cc1plus) Please submit a full bug report, with preprocessed source if appropriate. See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions. [3566/4925] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_2.cpp.o FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_2.cpp.o /usr/bin/c++ -DAT_PARALLEL_OPENMP=1 -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -Iaten/src -I../aten/src -I. -I../ -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party/XNNPACK/include -I../cmake/../third_party/benchmark/include -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /usr/include/python3.6m -isystem /usr/local/lib/python3.6/dist-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -isystem ../cmake/../third_party/cub -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -I/usr/local/cuda/include -I../caffe2/../torch/csrc/api -I../caffe2/../torch/csrc/api/include -I../caffe2/aten/src/TH -Icaffe2/aten/src/TH -I../caffe2/../torch/../aten/src -Icaffe2/aten/src -Icaffe2/../aten/src -Icaffe2/../aten/src/ATen -I../caffe2/../torch/csrc -I../caffe2/../torch/../third_party/miniz-2.0.8 -I../aten/src/TH -I../aten/../third_party/catch/single_include -I../aten/src/ATen/.. -Icaffe2/aten/src/ATen -I../third_party/miniz-2.0.8 -I../caffe2/core/nomnigraph/include -isystem include -I../third_party/FXdiv/include -I../c10/.. -Ithird_party/ideep/mkl-dnn/include -I../third_party/ideep/mkl-dnn/src/../include -I../third_party/cpuinfo/include -I../third_party/QNNPACK/include -I../third_party/pthreadpool/include -I../aten/src/ATen/native/quantized/cpu/qnnpack/include -I../aten/src/ATen/native/quantized/cpu/qnnpack/src -I../third_party/cpuinfo/deps/clog/include -I../third_party/NNPACK/include -I../third_party/fbgemm/include -I../third_party/fbgemm -I../third_party/fbgemm/third_party/asmjit/src -I../third_party/FP16/include -I../third_party/tensorpipe -I../third_party/fmt/include -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -fPIC -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces -Wno-maybe-uninitialized -fvisibility=hidden -O2 -DCAFFE2_BUILD_MAIN_LIB -pthread -DASMJIT_STATIC -std=gnu++14 -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_2.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_2.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/autograd/generated/VariableType_2.cpp.o -c ../torch/csrc/autograd/generated/VariableType_2.cpp c++: internal compiler error: Killed (program cc1plus) Please submit a full bug report, with preprocessed source if appropriate.

The actual summary is this The command '/bin/sh -c cd pytorch && python3 setup.py install && cd ..' returned a non-zero code: 1

If someone ran into a similar issue, please let me know.

Thanks!

Training terminates after the first epoch due to excessive RAM usage

I am trying to train a semantic segmentation model from scratch using COCO dataset, and every time I try to run the training script, it is Killed at the validation step after epoch 0.

At first, I got RuntimeError: Dataloader worker (pid xxxx) is killed by signal: Killed. After looking online, I tried setting number of workers to 0, which caused a similar error at the same stage, but the message just says Killed. Looking at the memory usage, just before the process was killed, the RAM usage went all the way up to 97%. I have 64Gb of RAM, which is enough to fit the entire training set if needed, so I don't really understand where the issue originates.

I have attached two screenshots showing the errors. The first one suggests that it failed when trying to colourise the images with colorizer.py.

Could you suggest a workaround? I am hoping to train a model on COCO data to understand how it works, and then train it on my own data which I will format to be COCO-like.

Screenshot 2019-12-06 at 10 05 44

Screenshot 2019-12-06 at 00 27 05

Semantic Segmentation: Only 2 of 3 classes get trained

Hey there,

I am trying to train different backbone/decoder combinations in a semantic segmentation task. My data is similar to the sugarbeets from the old bonnet, with 3 classes: background, carrot, and weed.

Somehow I always end up training only the first two classes. From how I understand it, the number of classes is defined completely in the cfg-file, right? With labels, label-weights and color_map. I would appreciate any help :)

I just double-checked before posting, and noticed that when I run the calculate_segmentation_weights, I dont get differences in frequencies in the carrot and weed classes, which is of course weird but I cant see the reason.

Edit: To be exact, both are Zero.
Num of pixels: 681984000
Frequency: [0.89665005 0. 0. ]
I cant seem to know why

I'll attach the cfg I'm using right now.
Best regards,
Olli

P.S. Any chance you got pretrained models for the sugarbeet data in this framework?

backbone:
  OS: 8
  bn_d: 0.001
  dropout: 0.0
  extra:
    shallow_feats: true
    width_mult: 1.0
  name: mobilenetv2
  train: true
dataset:
  color_map:
    0:
    - 0
    - 0
    - 0
    1:
    - 255
    - 0
    - 0
    2:
    - 0
    - 0
    - 255
  img_means:
  - 0.42437694635119927
  - 0.5040352731582203
  - 0.4624443929799188
  img_prop:
    depth: 3
    height: 720
    width: 1280
  img_stds:
  - 0.14995696217848253
  - 0.1564923538805844
  - 0.17123649653037434
  labels:
    0: ground
    1: carrot
    2: weed
  labels_w:
    0: 0.10334995
    1: 1.0
    2: 1.0
  location: dataset/leaf1280/
  name: leaf1280
  workers: 12
decoder:
  bn_d: 0.001
  dropout: 0.0
  extra:
    aspp_channels: 64
    last_channels: 32
    skip_os:
    - 4
    - 2
  name: aspp_residual_attention
  train: true
head:
  dropout: 0.0
  name: segmentation
train:
  avg_N: 3
  batch_size: 3
  crop_prop:
    height: 448
    width: 448
  down_epochs: 0
  final_decay: 0.99
  loss: xentropy
  max_epochs: 1000
  max_lr: 0.001
  max_momentum: 0.9
  min_lr: 0.001
  min_momentum: 0.9
  report_batch: 1
  report_epoch: 1
  save_imgs: false
  save_summary: false
  up_epochs: 0
  w_decay: 1.0e-05

Worse accuracy with ONNX version (inferenced in C++ by OpenCV)

Hello,

I used your script "make_deploy_model.py" to create ONNX version of person segmentation models (for all three versions).
After I inferenced it using OpenCV:

cv::dnn::Net net = cv::dnn::readNetFromONNX("my_path/erf/model.onnx"); cv::Mat inpBlob = cv::dnn::blobFromImage(image, ...) //necessary blob image net.setInput(inpBlob); cv::Mat output = net.forward(); cv::Mat mask(H, W, CV_32F, output.ptr(0, 1));
and after some processes on obtained mask from inference otput based on your source code (necessary information like image_means for extract proper pixels I taked from cfg.yaml).

I would like to ask you if ONNX version can have worse accuracy or if I'm doing something wrong in postprocess step on obtained output from network? I was also thinking if maybe blob image step can cause some information loss.

Please let me know your opinion

sample results comparison:
resssss

/bin/sh: 1: cannot create /etc/passwd: Permission denied when running nvidia-docker build -t tano297/bonnetal:runtime -f Dockerfile .

Dear maintainers,
I am getting the following error when running nvidia-docker build -t tano297/bonnetal:runtime -f Dockerfile .

---> Running in 8399ad0e2141
/bin/sh: 1: cannot create /etc/passwd: Permission denied
The command '/bin/sh -c export uid=1000 gid=1000 && mkdir -p /home/developer && mkdir -p /etc/sudoers.d && echo "developer:x:${uid}:${gid}:Developer,,,:/home/developer:/bin/bash" >> /etc/passwd && echo "developer:x:${uid}:" >> /etc/group && echo "developer ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/developer && chmod 0440 /etc/sudoers.d/developer && chown ${uid}:${gid} -R /home/developer && adduser developer sudo' returned a non-zero code: 2

int64 support for some operations not supported

I have installed all the pip packages in a venv, and when I pip list, everything matches up. I also installed pytorch from source. When I attempt to run

python3 train.py -c /tank/home/xury1/segmentation/bonnetal/train/tasks/segmentation/config/persons/ mobilenetv2_test.yaml --log /tank/home/xury1/segmentation/bonnetal/train/tasks/segmentation/log -p /dev/null

INTERFA

CE:
config yaml: /tank/home/xury1/segmentation/bonnetal/train/tasks/segmentation/config/persons/mobilenetv2_test.yaml
log dir /tank/home/xury1/segmentation/bonnetal/train/tasks/segmentation/log
model path /dev/null
eval only False
No batchnorm False

Commit hash (training version): b'5368eed'

Opening config file /tank/home/xury1/segmentation/bonnetal/train/tasks/segmentation/config/persons/mobilenetv2_test.yaml
model folder doesnt exist! Start with random weights...
Copying files to /tank/home/xury1/segmentation/bonnetal/train/tasks/segmentation/log for further reference.
Images from: /tank/home/xury1/segmentation_data/persons/roads_annotated/ds1/train/img
Labels from: /tank/home/xury1/segmentation_data/persons/roads_annotated/ds1/train/lbl
Inference batch size: 3
Images from: /tank/home/xury1/segmentation_data/persons/roads_annotated/ds1/valid/img
Labels from: /tank/home/xury1/segmentation_data/persons/roads_annotated/ds1/valid/lbl
Original OS: 32
New OS: 16.0
[Decoder] os: 8 in: 32 skip: 32 out: 32
[Decoder] os: 4 in: 32 skip: 24 out: 24
[Decoder] os: 2 in: 24 skip: 16 out: 16
[Decoder] os: 1 in: 16 skip: 3 out: 16
Using normalized weights as bias for head.

Couldn't load backbone, using random weights. Error: [Errno 20] Not a directory: '/dev/null/backbone'
Couldn't load decoder, using random weights. Error: [Errno 20] Not a directory: '/dev/null/segmentation_decoder'
Couldn't load head, using random weights. Error: [Errno 20] Not a directory: '/dev/null/segmentation_head'
Total number of parameters: 2154794
Total number of parameters requires_grad: 2154794
Param encoder 1812800
Param decoder 341960
Param head 34
Training in device: cuda
/tank/home/xury1/segmentation/bonnetal/train/tasks/segmentation/bonnetal/lib/python3.5/site-packages/torch/optim/lr_scheduler.py:100: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
[IOU EVAL] IGNORE: tensor([], dtype=torch.int64)
[IOU EVAL] INCLUDE: tensor([0, 1])
Traceback (most recent call last):
File "train.py", line 118, in
trainer.train()
File "../../tasks/segmentation/modules/trainer.py", line 302, in train
scheduler=self.scheduler)
File "../../tasks/segmentation/modules/trainer.py", line 494, in train_epoch
evaluator.addBatch(output.argmax(dim=1), target)
File "../../tasks/segmentation/modules/ioueval.py", line 42, in addBatch
tuple(idxs), self.ones, accumulate=True)
RuntimeError: "embedding_backward" not implemented for 'Long'

terminate called after throwing an instance of 'c10::Error'

Hi,
I am trying to run segmentation using pretrained model.
I am using docker on Ubuntu 18.04 with GPU.
nvidia-smi works fine (but whole gpu mem is already used for some training in the background)

nvidia-docker run -ti --rm -e DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -v $HOME/.Xauthority:/home/developer/.Xauthority -v /home/$USER:/home/$USER --net=host --pid=host -v /mnt/Data/dataset002mp4:/home/developer/dataset2 --ipc=host tano297/bonnetal:runtime /bin/bash

In docker:

cd deploy
catkin init
catkin build
cd ~/bonnetal/deploy/devel/lib/bonnetal_segmentation_standalone
./infer_img -p mapillary_darknet53_aspp_res_512_os8_40/ -i ~/dataset2/frames/00000001.jpg -v

I get:

================================================================================
image: /home/developer/dataset2/frames/00000001.jpg
path: mapillary_darknet53_aspp_res_512_os8_40//
backend: pytorch. Using default!
verbose: 1
================================================================================
Trying to open model
Could not send model to GPU, using CPU
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed, file path: mapillary_darknet53_aspp_res_512_os8_40///model.pytorch (FileAdapter at ../caffe2/serialize/file_adapter.cc:11)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6c (0x7f824a5e845c in /usr/local/lib/libc10.so)
frame #1: caffe2::serialize::FileAdapter::FileAdapter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x208 (0x7f82c2382538 in /usr/local/lib/libcaffe2.so)
frame #2: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x40 (0x7f824b1f9250 in /usr/local/lib/libtorch.so.1)
frame #3: bonnetal::segmentation::NetPytorch::NetPytorch(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3e1 (0x7f82c4e73171 in /home/developer/bonnetal/deploy/devel/.private/bonnetal_segmentation_lib/lib/libbonnetal_segmentation_lib.so)
frame #4: bonnetal::segmentation::make_net(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3a6 (0x7f82c4e71926 in /home/developer/bonnetal/deploy/devel/.private/bonnetal_segmentation_lib/lib/libbonnetal_segmentation_lib.so)
frame #5: <unknown function> + 0x7dfe (0x55dc04edddfe in ./infer_img)
frame #6: __libc_start_main + 0xe7 (0x7f824bd55b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x87ea (0x55dc04ede7ea in ./infer_img)

Aborted (core dumped)

Anything obvious?
Is it related to no mem on GPU?
I tried also with CUDA_VISIBLE_DEVICES=''
I was looking for an example, how to use pretrained models, but haven't found any instructions.
I am finally going to use these models and present results on YT.
I will be very grateful for any help.

BTW, I am using docker because I have ROS1 with catkin_make and no catkin command.

bash: /home/developer/bonnetal/deploy/devel/setup.bash: No such file or directory

Hello everyone! I have been using bonnet for semantic segmentation before and now switching to bonnetal.

When I try to run the example it gives me the following:

developer@my-pc: /bonnetal/train/tasks/segmentation$ ./train.py -c ./config/
coco/mobilenetv2_aspp_res
mobilenetv2_aspp_res.yaml mobilenetv2_aspp_res_attention.yaml
developer@my-pc:/bonnetal/train/tasks/segmentation$ ./train.py -c ./config/
coco/mobilenetv2_aspp_res
mobilenetv2_aspp_res.yaml mobilenetv2_aspp_res_attention.yaml
developer@my-pc:/bonnetal/train/tasks/segmentation$ ./train.py -c ./config/
coco/mobilenetv2_aspp_res.yaml

INTERFACE:
config yaml: ./config/coco/mobilenetv2_aspp_res.yaml
log dir /home/developer/logs/2019-8-01-09:06/
model path None
eval only False
No batchnorm False

Commit hash (training version): b'5aed807'

Opening config file ./config/coco/mobilenetv2_aspp_res.yaml
./train.py:80: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
CFG = yaml.load(f)
No pretrained directory found.
Copying files to /home/developer/logs/2019-8-01-09:06/ for further reference.
Images from: /cache/datasets/coco/train2017
Labels from: /cache/datasets/coco/annotations/panoptic_train2017_remap
Traceback (most recent call last):
File "./train.py", line 116, in
trainer = Trainer(CFG, FLAGS.log, FLAGS.path, FLAGS.eval, FLAGS.no_batchnorm)
File "../../tasks/segmentation/modules/trainer.py", line 68, in init
workers=self.CFG["dataset"]["workers"])
File "../..//tasks/segmentation/dataset/coco/parser.py", line 377, in init
drop_last=True)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 176, in init
sampler = RandomSampler(dataset)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 66, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

I also got the following error while downloading docker image:

After the command:
sudo nvidia-docker run -ti --rm -e DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -v $HOME/.Xauthority:/home/developer/.Xauthority -v /home/$USER:/home/$USER --net=host --pid=host --ipc=host tano297/bonnetal:runtime /bin/bash

I got the following:
To run a command as administrator (user "root"), use "sudo ".
See "man sudo_root" for details.
bash: /home/developer/bonnetal/deploy/devel/setup.bash: No such file or directory

Can you please help me to solve this problem? Many thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.