skhadem / 3d-boundingbox Goto Github PK

PyTorch implementation for 3D Bounding Box Estimation Using Deep Learning and Geometry

License: MIT License

Python 44.60% CMake 0.12% C++ 36.49% MATLAB 17.68% Shell 1.12%

3d-boundingbox's Introduction

3D Bounding Box Estimation Using Deep Learning and Geometry

If interested, join the slack workspace where the paper is discussed, issues are worked through, and more! Click this link to join.

Introduction

PyTorch implementation for this paper.

At the moment, it takes approximately 0.4s per frame, depending on the number of objects detected. An improvement will be speed upgrades soon. Here is the current fastest possible:

Requirements

PyTorch
Cuda
OpenCV >= 3.4.3

Usage

In order to download the weights:

cd weights/
./get_weights.sh

This will download pre-trained weights for the 3D BoundingBox net and also YOLOv3 weights from the official yolo source.

If script is not working: pre trained weights and YOLO weights

To see all the options:

python Run.py --help

Run through all images in default directory (eval/image_2/), optionally with the 2D bounding boxes also drawn. Press SPACE to proceed to next image, and any other key to exit.

python Run.py [--show-yolo]

Note: See training for where to download the data from

There is also a script provided to download the default video from Kitti in ./eval/video. Or, download any Kitti video and corresponding calibration and use --image-dir and --cal-dir to specify where to get the frames from.

python Run.py --video [--hide-debug]

Training

First, the data must be downloaded from Kitti. Download the left color images, the training labels, and the camera calibration matrices. Total is ~13GB. Unzip the downloads into the Kitti/ directory.

python Train.py

By default, the model is saved every 10 epochs in weights/. The loss is printed every 10 batches. The loss should not converge to 0! The loss function for the orientation is driven to -1, so a negative loss is expected. The hyper-parameters to tune are alpha and w (see paper). I obtained good results after just 10 epochs, but the training script will run until 100.

How it works

The PyTorch neural net takes in images of size 224x224 and predicts the orientation and relative dimension of that object to the class average. Thus, another neural net must give the 2D bounding box and object class. I chose to use YOLOv3 through OpenCV. Using the orientation, dimension, and 2D bounding box, the 3D location is calculated, and then back projected onto the image.

There are 2 key assumptions made:

The 2D bounding box fits very tightly around the object
The object has ~0 pitch and ~0 roll (valid for cars on the road)

Future Goals

Train custom YOLO net on the Kitti dataset
Some type of Pose visualization (ROS?)

Credit

I originally started from a fork of this repo, and some of the original code still exists in the training script.
2D-3D geometric conversion.

3d-boundingbox's People

Contributors

Stargazers

Watchers

Forkers

jiajialin sduzx buptmyc dingmyu minhyung-kang staceycy samghk qizhangncs jarygrace collector-m dbeker nmboyd misslibra xuehaouwa manjotms10 vellano zoom1539 boycehbz sujalbhavsar16 raghuslash 2019-paper-fun t-devh tan-wongsathon fsxy1063200037 life-xp monocular3d-alternatives xiaowuge1201 maximumprogrammer dhruvmsheth joywalker stiiceva dangpnh2 dskov githubfragments mbaranpeker wangdali-jpg wan1995 s-sliwinski marinalpo gunhoro kunikavalecha biancaalexandru royzon gojila1029 marouene-oueslati neelpawarcmu shuiniu86 jie311 wangyitong9 cuulee jmj8038 usamakh20 cflin-cjcu amalbinessa wf-hahaha treeberrytomato satyambansal117 cosmoshua madilyuno sol0invictus liu4lin kk2491 tengfusion yuhz288 alfinnurhalim meistersinergi ali-kazzazi enginbozkurt ocissor duferen songsanling bennyustc kitenite chenyanzhan lewisk1899 alaeddint khalida1wwin bartoszsambor xiaohulugo steve5692 alanmengg rysiere partheee thoroai tmnaeem huleyun op1009 arkirito sachitanmuo anikethh upper127 dtungbmw alenaliu 2271259387 damavand1

3d-boundingbox's Issues

Camera Calibration Parameters

Hi ... I am currently working on my data generated using Unity. I am not able to get proper 3D bounding box. My 2D bounding box output is proper but 3D box is going for a toss. I am working on data's generated using Unity 3D. My Image size is 1024 x 1024. What are the changes i need to do in order to map the 3D box on my data.

error

yolo
Using previous model epoch_90.pkl
/home/zjut/anaconda3/envs/pytorch_GPU/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "
/home/zjut/anaconda3/envs/pytorch_GPU/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=VGG19_BN_Weights.IMAGENET1K_V1. You can also use weights=VGG19_BN_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
Traceback (most recent call last):
File "Run.py", line 203, in
main()
File "Run.py", line 137, in main
detections = yolo.detect(yolo_img)
File "/home/zjut/code/3D-BoundingBox/yolo/yolo.py", line 31, in detect
(H,W) = image.shape[:2]
ValueError: not enough values to unpack (expected 2, got 0)

if we have ground truth 2d box

I want to know if we have a ground truth 2d bounding box annotation, Can the performance of this model be improved to what level ,can someone give me a idea

Error model = Model(features=my_vgg.features, bins=2).cuda()

When I run this code:
my_vgg = vgg.vgg19_bn(pretrained=True)
model = Model(features=my_vgg.features, bins=2).cuda()
I get notice below:
model = Model(features=my_vgg.features, bins=2).cuda()
TypeError: 'module' object is not callable

How to train the model with nuScenes dataset?

Convert to KITTI Format for Evaluation

Hello @skhadem, thank you so much for this implementation. I would like to ask how to convert the result back into KITTI format? I have a plan to reproduce the paper result. Could you give me a hint which values should be put for KIITI format?
Also I have problem to understand the label. as we can see from the development kit as follows.

#Values    Name      Description
----------------------------------------------------------------------------
   1    type         Describes the type of object: 'Car', 'Van', 'Truck',
                     'Pedestrian', 'Person_sitting', 'Cyclist', 'Tram',
                     'Misc' or 'DontCare'
   1    truncated    Float from 0 (non-truncated) to 1 (truncated), where
                     truncated refers to the object leaving image boundaries
   1    occluded     Integer (0,1,2,3) indicating occlusion state:
                     0 = fully visible, 1 = partly occluded
                     2 = largely occluded, 3 = unknown
   1    alpha        Observation angle of object, ranging [-pi..pi]
   4    bbox         2D bounding box of object in the image (0-based index):
                     contains left, top, right, bottom pixel coordinates
   3    dimensions   3D object dimensions: height, width, length (in meters)
   3    location     3D object location x,y,z in camera coordinates (in meters)
   1    rotation_y   Rotation ry around Y-axis in camera coordinates [-pi..pi]
   1    score        Only for results: Float, indicating confidence in
                     detection, needed for p/r curves, higher is better.

However when I see the sample of label, let say 000000.txt, the content is follows.
Pedestrian 0.00 0 -0.20 712.40 143.00 810.73 307.92 1.89 0.48 1.20 1.84 1.47 8.41 0.01
As we can see in here we only have 15 values instead of 16 values as in description. For evaluation is that necessary to provide the score in the last?

Thank you so much

3D Bbox plotting and original image crop issue

Once a 3D Boundingbox is plotted, it is on the original image (names 'img') in your code. However, next cropped image would be cropped from the previous 'img' with last one or several 3D boundingbox. (i.e. Cropped images with partial 3D boundingbox plots are fed into model for training or testing. So I change tht code by plot_img = np.copy(truth_img) and then plot 3D boundingbox on plot_img rather than 'img'

Constraints in Math.py

I was unable to understand the formation of constraints in the calculation of the translation vector in Math.py file. Could you please hint me a bit towards it?

build config file for pytorch training model

Hi there,
I would like to tansfer the pytorch model to caffemodel myself.
The cfg file, which saves the neural structure is necessary to be built.
However the model we get after training is devided into 3 parts, demension, oriented, and confidence.
so I am a bit confused how to write the cfg file on my own.
Will be so appreciate if you can provide a solution or suggestion for me.
Best regards.

YOLO does not use GPU

I think the Yolo implementation here is CPU-based while PyTorch is cuda-based. How can I use Cuda even for Yolo detections?

How to generate Calibration file for my own dataset?

i want to know the KITTI method that how did they generated calib file because when i generate it and it is totally different in syntax as compared to to that and no fruitful results. I made it by opencv chess method. kindly tell me how i can get my calib file in syntax like of the KITTIs. Thanks

Dataset label "Location" property

Hi,

I'm wondering how do you construct your dataset from Kitti, especially the "Location" keyword in the label. I don't quite understand how the following lines of code works from torch_lib/Dataset.py

    Location = [line[11], line[12], line[13]] # x, y, z
    Location[1] -= Dimension[0] / 2 # bring the KITTI center up to the middle of the object

Why does the y component of "Location" corresponds to the x component of "Dimension"?
This also appears in library/Math.py in the calc_location() function:

# using a different coord system
dx = dimension[2] / 2
dy = dimension[0] / 2
dz = dimension[1] / 2

Why you are switching the coordinate system and can you please tell me how do you parse the raw data regarding location information from Kitti dataset?

p.s. I noticed this because I try to read ground truth label directly from the generated dataset and plot it using your plot_3d_box(img, cam_to_img, orient, dimensions, location) function. However, there exists serious offset in location, especially in y coordinates. Can you please tell me how to read ground truth location information from the generated dataset and plot it correctly?

Thank you so much~

Issue with downloading of "get_video.sh"

@skhadem @fuenwang Hello.

I received an error while was trying to download video dataset with "get_video.sh". Please find below the screenshot of the error.
Do you know what is the issue and how to fix it? Thanks a lot in advance!

assumption in the paper

Hi
I have a question. In the paper 3D Bounding Box Estimation Using Deep Learning and Geometry, there is a assumption that 3D bounding box fits tightly into 2D detection window requires that each side of the 2D bounding box to be touched by the projection of at least one of the 3D box corners. I have tested your codes, it seems that you have not considered that. Could you please have a discussion?

Whether it could be implemented using images from surveillance camera?

Does anybody know if this method could be used in vehicle 3D detection in residential scenes?
In that case, The images are from surveillance cameras. And those cameras are set at about 3-4 meters high from the ground.
Thus, the Roll angle of cameras are definitely not 0.
If I directly use the the pretrained model and original camera calibration files, the predictions are reaaaaally bad in my scenes.
So how could I change the camera calibration files to make it adaptable for image datasets by surveillance cameras?
Or it's not capable to do that?

how to train on custom datasets

I have my own datasets whose format are like pascal voc with labeled images only, i do not have the calibration file, how can I train on my own datasts?

hello, could you send the model weights to online drive? thanks

The slack link for discussion is not working.

An Explanation of angles used in paper and used in code.

Hi,

I am attempting to use this method to train on my own dataset which I have generated in Unity using the Unity Perception Package, therefore this requires quite a few modifications of the Dataset class. Unity will generate the ground truth and provide me with the following:

X,Y,Z position of the 3D bounding box center wrt. the camera
Object dimensions
Object rotation wrt. global coordinate frame
2D bounding box coordinates within the image
Camera intrinsic matrix

In the corresponding paper, the three angles of interest are Theta Ray, Theta L, and Theta. I believe understand what these are and the correspondance between them:

Theta ray is the ray angle of the object center (calculated as the angle between the camera principal point and 3D bounding box center).
Theta L is the local orientation i.e. orientation of object wrt. to the camera.
Theta is the global orientation of the object.
Theta = Theta Ray + Theta L

However, looking in the Dataset class, there are references to three different angles: Alpha, Ry and theta_ray. As far as I understand it, Alpha is equivalent to Theta L (as this it what you are regressing), Ry is equivalent to Theta (global orientation), and theta_ray is self-explanitory.

As far as I am aware, theta_ray is calculated using the position of the 2D bounding box within the image, and the model is predicting Alpha, and using the correspondance between these we can find the global orientation of the object.

I would just like to confirm that all this is correct, as I have been having a hard time understanding this.
Your feedback is greatly appreciated :)

How to handle truncation?

The predictions are way out from the gound truth when the object is truncated i.e. only part of the object is within the image boundary. Estimating position uses the four bounding box corners but when there is truncation the bounding box does not cover all of the object, only the part that is within the image.

Is there a way to overcome this problem? Or is this method simply just not suitable for cases where truncation occurs?

replace vgg with resnet

Hi
Thanks for your great work, I am now trying to replace the backbone of second stage using Resnet since Resnet usually has a better performance when doing a same work. However, after I replace the vgg by resnet, the result is terrible. I wonder if you have do the same thing and cloud you please tell me whether this idea is useful. Thank you!

Best regards

How to get the score (confidence) of a 3D Bounding Box?

Hi folks,

I have observed this part of the source code:

"""
det.type in self.classes and det.score > self.score_thres):

            intrinsics = ros_intrinsics(self.camera_info.P)
            input_tensor,theta_ray = preprocessing(image,det,intrinsics)
            [orient, conf, dim] = self.model(input_tensor) #Apply the model to get the estimation
            orient = orient.cpu().data.numpy()[0, :, :]
            conf = conf.cpu().data.numpy()[0, :]
            dim = dim.cpu().data.numpy()[0, :]
            # print("Conf:{}".format(conf))
            dim += self.averages.get_item(det.type)

            argmax = np.argmax(conf)
            orient = orient[argmax, :]
            cos = orient[0]
            sin = orient[1]
            alpha = np.arctan2(sin, cos)
            alpha += self.angle_bins[argmax]
            alpha -= np.pi

"""

But that conf is a tuple of two numbers, which is used to determine the best orientation, like this:

"""
Conf:[ 6.3896847 -6.5501723]
Conf:[ 6.496025 -6.7066655]
Conf:[ 5.410366 -5.5474744]
Conf:[ 7.092432 -7.3124714]
Conf:[ 9.061753 -9.251386]
Conf:[ 7.587371 -7.831802]
Conf:[ 2.149212 -2.1235662]
Conf:[-0.84504336 0.89392436]
Conf:[ 4.436549 -4.5268965]
Conf:[ 1.2938225 -1.4327605]
"""

How can I get the score of the final 3D Bounding Box? (0 to 1 value, like in every 2D or 3D object detector)

Thanks in advance.

regular image and camera

Hi
I just wondering if the training process running on normal images such as we take from regular camera instead of velo or stereo camera?

The dataset used in the file run_no_yolo.py must use images and tags right?

If I want to do a 3d inspection on an image (without a label file) what should I do?

Different outputs for same input image

Feeding in the same image multiple times produces different orientation estimations each time. Does anyone know why this would be the case?

Please could you add a licence file?

Hello,

Thanks for this great repo! Could you please add a licence file, to clarify how it can be used?

Thanks

Traceback (most recent call last):
  File "Run.py", line 201, in <module>
    main()
  File "Run.py", line 137, in main
    detections = yolo.detect(yolo_img)
  File "code/3D-BoundingBox-master/yolo/yolo.py", line 34, in detect
    ln = [ln[i[0] - 1] for i in self.net.getUnconnectedOutLayers()]
  File "code/3D-BoundingBox-master/yolo/yolo.py", line 34, in <listcomp>
    ln = [ln[i[0] - 1] for i in self.net.getUnconnectedOutLayers()]
IndexError: invalid index to scalar variable.