Giter VIP home page Giter VIP logo

3d-boundingbox's Introduction

3D Bounding Box Estimation Using Deep Learning and Geometry

If interested, join the slack workspace where the paper is discussed, issues are worked through, and more! Click this link to join.

Introduction

PyTorch implementation for this paper.

example-image

At the moment, it takes approximately 0.4s per frame, depending on the number of objects detected. An improvement will be speed upgrades soon. Here is the current fastest possible: example-video

Requirements

  • PyTorch
  • Cuda
  • OpenCV >= 3.4.3

Usage

In order to download the weights:

cd weights/
./get_weights.sh

This will download pre-trained weights for the 3D BoundingBox net and also YOLOv3 weights from the official yolo source.

If script is not working: pre trained weights and YOLO weights

To see all the options:

python Run.py --help

Run through all images in default directory (eval/image_2/), optionally with the 2D bounding boxes also drawn. Press SPACE to proceed to next image, and any other key to exit.

python Run.py [--show-yolo]

Note: See training for where to download the data from

There is also a script provided to download the default video from Kitti in ./eval/video. Or, download any Kitti video and corresponding calibration and use --image-dir and --cal-dir to specify where to get the frames from.

python Run.py --video [--hide-debug]

Training

First, the data must be downloaded from Kitti. Download the left color images, the training labels, and the camera calibration matrices. Total is ~13GB. Unzip the downloads into the Kitti/ directory.

python Train.py

By default, the model is saved every 10 epochs in weights/. The loss is printed every 10 batches. The loss should not converge to 0! The loss function for the orientation is driven to -1, so a negative loss is expected. The hyper-parameters to tune are alpha and w (see paper). I obtained good results after just 10 epochs, but the training script will run until 100.

How it works

The PyTorch neural net takes in images of size 224x224 and predicts the orientation and relative dimension of that object to the class average. Thus, another neural net must give the 2D bounding box and object class. I chose to use YOLOv3 through OpenCV. Using the orientation, dimension, and 2D bounding box, the 3D location is calculated, and then back projected onto the image.

There are 2 key assumptions made:

  1. The 2D bounding box fits very tightly around the object
  2. The object has ~0 pitch and ~0 roll (valid for cars on the road)

Future Goals

  • Train custom YOLO net on the Kitti dataset
  • Some type of Pose visualization (ROS?)

Credit

  1. I originally started from a fork of this repo, and some of the original code still exists in the training script.
  2. 2D-3D geometric conversion.

3d-boundingbox's People

Contributors

dangpnh2 avatar fuenwang avatar skhadem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

3d-boundingbox's Issues

Camera Calibration Parameters

Hi ... I am currently working on my data generated using Unity. I am not able to get proper 3D bounding box. My 2D bounding box output is proper but 3D box is going for a toss. I am working on data's generated using Unity 3D. My Image size is 1024 x 1024. What are the changes i need to do in order to map the 3D box on my data.

error

yolo
Using previous model epoch_90.pkl
/home/zjut/anaconda3/envs/pytorch_GPU/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "
/home/zjut/anaconda3/envs/pytorch_GPU/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=VGG19_BN_Weights.IMAGENET1K_V1. You can also use weights=VGG19_BN_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
Traceback (most recent call last):
File "Run.py", line 203, in
main()
File "Run.py", line 137, in main
detections = yolo.detect(yolo_img)
File "/home/zjut/code/3D-BoundingBox/yolo/yolo.py", line 31, in detect
(H,W) = image.shape[:2]
ValueError: not enough values to unpack (expected 2, got 0)

if we have ground truth 2d box

I want to know if we have a ground truth 2d bounding box annotation, Can the performance of this model be improved to what level ,can someone give me a idea

Convert to KITTI Format for Evaluation

Hello @skhadem, thank you so much for this implementation. I would like to ask how to convert the result back into KITTI format? I have a plan to reproduce the paper result. Could you give me a hint which values should be put for KIITI format?
Also I have problem to understand the label. as we can see from the development kit as follows.

#Values    Name      Description
----------------------------------------------------------------------------
   1    type         Describes the type of object: 'Car', 'Van', 'Truck',
                     'Pedestrian', 'Person_sitting', 'Cyclist', 'Tram',
                     'Misc' or 'DontCare'
   1    truncated    Float from 0 (non-truncated) to 1 (truncated), where
                     truncated refers to the object leaving image boundaries
   1    occluded     Integer (0,1,2,3) indicating occlusion state:
                     0 = fully visible, 1 = partly occluded
                     2 = largely occluded, 3 = unknown
   1    alpha        Observation angle of object, ranging [-pi..pi]
   4    bbox         2D bounding box of object in the image (0-based index):
                     contains left, top, right, bottom pixel coordinates
   3    dimensions   3D object dimensions: height, width, length (in meters)
   3    location     3D object location x,y,z in camera coordinates (in meters)
   1    rotation_y   Rotation ry around Y-axis in camera coordinates [-pi..pi]
   1    score        Only for results: Float, indicating confidence in
                     detection, needed for p/r curves, higher is better.

However when I see the sample of label, let say 000000.txt, the content is follows.
Pedestrian 0.00 0 -0.20 712.40 143.00 810.73 307.92 1.89 0.48 1.20 1.84 1.47 8.41 0.01
As we can see in here we only have 15 values instead of 16 values as in description. For evaluation is that necessary to provide the score in the last?

Thank you so much

3D Bbox plotting and original image crop issue

Once a 3D Boundingbox is plotted, it is on the original image (names 'img') in your code. However, next cropped image would be cropped from the previous 'img' with last one or several 3D boundingbox. (i.e. Cropped images with partial 3D boundingbox plots are fed into model for training or testing. So I change tht code by plot_img = np.copy(truth_img) and then plot 3D boundingbox on plot_img rather than 'img'

Constraints in Math.py

I was unable to understand the formation of constraints in the calculation of the translation vector in Math.py file. Could you please hint me a bit towards it?

build config file for pytorch training model

Hi there,
I would like to tansfer the pytorch model to caffemodel myself.
The cfg file, which saves the neural structure is necessary to be built.
However the model we get after training is devided into 3 parts, demension, oriented, and confidence.
so I am a bit confused how to write the cfg file on my own.
Will be so appreciate if you can provide a solution or suggestion for me.
Best regards.

YOLO does not use GPU

I think the Yolo implementation here is CPU-based while PyTorch is cuda-based. How can I use Cuda even for Yolo detections?

How to generate Calibration file for my own dataset?

i want to know the KITTI method that how did they generated calib file because when i generate it and it is totally different in syntax as compared to to that and no fruitful results. I made it by opencv chess method. kindly tell me how i can get my calib file in syntax like of the KITTIs. Thanks

Dataset label "Location" property

Hi,

I'm wondering how do you construct your dataset from Kitti, especially the "Location" keyword in the label. I don't quite understand how the following lines of code works from torch_lib/Dataset.py

    Location = [line[11], line[12], line[13]] # x, y, z
    Location[1] -= Dimension[0] / 2 # bring the KITTI center up to the middle of the object

Why does the y component of "Location" corresponds to the x component of "Dimension"?
This also appears in library/Math.py in the calc_location() function:

# using a different coord system
dx = dimension[2] / 2
dy = dimension[0] / 2
dz = dimension[1] / 2

Why you are switching the coordinate system and can you please tell me how do you parse the raw data regarding location information from Kitti dataset?

p.s. I noticed this because I try to read ground truth label directly from the generated dataset and plot it using your plot_3d_box(img, cam_to_img, orient, dimensions, location) function. However, there exists serious offset in location, especially in y coordinates. Can you please tell me how to read ground truth location information from the generated dataset and plot it correctly?

Thank you so much~

assumption in the paper

Hi
I have a question. In the paper 3D Bounding Box Estimation Using Deep Learning and Geometry, there is a assumption that 3D bounding box fits tightly into 2D detection window requires that each side of the 2D bounding box to be touched by the projection of at least one of the 3D box corners. I have tested your codes, it seems that you have not considered that. Could you please have a discussion?

Whether it could be implemented using images from surveillance camera?

Does anybody know if this method could be used in vehicle 3D detection in residential scenes?
In that case, The images are from surveillance cameras. And those cameras are set at about 3-4 meters high from the ground.
Thus, the Roll angle of cameras are definitely not 0.
If I directly use the the pretrained model and original camera calibration files, the predictions are reaaaaally bad in my scenes.
So how could I change the camera calibration files to make it adaptable for image datasets by surveillance cameras?
Or it's not capable to do that?

how to train on custom datasets

I have my own datasets whose format are like pascal voc with labeled images only, i do not have the calibration file, how can I train on my own datasts?

An Explanation of angles used in paper and used in code.

Hi,

I am attempting to use this method to train on my own dataset which I have generated in Unity using the Unity Perception Package, therefore this requires quite a few modifications of the Dataset class. Unity will generate the ground truth and provide me with the following:

X,Y,Z position of the 3D bounding box center wrt. the camera
Object dimensions
Object rotation wrt. global coordinate frame
2D bounding box coordinates within the image
Camera intrinsic matrix

In the corresponding paper, the three angles of interest are Theta Ray, Theta L, and Theta. I believe understand what these are and the correspondance between them:

Theta ray is the ray angle of the object center (calculated as the angle between the camera principal point and 3D bounding box center).
Theta L is the local orientation i.e. orientation of object wrt. to the camera.
Theta is the global orientation of the object.
Theta = Theta Ray + Theta L

However, looking in the Dataset class, there are references to three different angles: Alpha, Ry and theta_ray. As far as I understand it, Alpha is equivalent to Theta L (as this it what you are regressing), Ry is equivalent to Theta (global orientation), and theta_ray is self-explanitory.

As far as I am aware, theta_ray is calculated using the position of the 2D bounding box within the image, and the model is predicting Alpha, and using the correspondance between these we can find the global orientation of the object.

I would just like to confirm that all this is correct, as I have been having a hard time understanding this.
Your feedback is greatly appreciated :)

How to handle truncation?

The predictions are way out from the gound truth when the object is truncated i.e. only part of the object is within the image boundary. Estimating position uses the four bounding box corners but when there is truncation the bounding box does not cover all of the object, only the part that is within the image.

Is there a way to overcome this problem? Or is this method simply just not suitable for cases where truncation occurs?

replace vgg with resnet

Hi
Thanks for your great work, I am now trying to replace the backbone of second stage using Resnet since Resnet usually has a better performance when doing a same work. However, after I replace the vgg by resnet, the result is terrible. I wonder if you have do the same thing and cloud you please tell me whether this idea is useful. Thank you!

Best regards

How to get the score (confidence) of a 3D Bounding Box?

Hi folks,

I have observed this part of the source code:

"""
det.type in self.classes and det.score > self.score_thres):

            intrinsics = ros_intrinsics(self.camera_info.P)
            input_tensor,theta_ray = preprocessing(image,det,intrinsics)
            [orient, conf, dim] = self.model(input_tensor) #Apply the model to get the estimation
            orient = orient.cpu().data.numpy()[0, :, :]
            conf = conf.cpu().data.numpy()[0, :]
            dim = dim.cpu().data.numpy()[0, :]
            # print("Conf:{}".format(conf))
            dim += self.averages.get_item(det.type)

            argmax = np.argmax(conf)
            orient = orient[argmax, :]
            cos = orient[0]
            sin = orient[1]
            alpha = np.arctan2(sin, cos)
            alpha += self.angle_bins[argmax]
            alpha -= np.pi

"""

But that conf is a tuple of two numbers, which is used to determine the best orientation, like this:

"""
Conf:[ 6.3896847 -6.5501723]
Conf:[ 6.496025 -6.7066655]
Conf:[ 5.410366 -5.5474744]
Conf:[ 7.092432 -7.3124714]
Conf:[ 9.061753 -9.251386]
Conf:[ 7.587371 -7.831802]
Conf:[ 2.149212 -2.1235662]
Conf:[-0.84504336 0.89392436]
Conf:[ 4.436549 -4.5268965]
Conf:[ 1.2938225 -1.4327605]
"""

How can I get the score of the final 3D Bounding Box? (0 to 1 value, like in every 2D or 3D object detector)

Thanks in advance.

regular image and camera

Hi
I just wondering if the training process running on normal images such as we take from regular camera instead of velo or stereo camera?

some code I don't understand

Hi
Thanks for your great work, It helped me a lot, however, there are some codes that I don't understand, for example: dim += averages.get_item(label['Class']) , cloud you please explain it ? Thank you very much!

error: IndexError: invalid index to scalar variable.

$ python Run.py

output:

Traceback (most recent call last):
  File "Run.py", line 201, in <module>
    main()
  File "Run.py", line 137, in main
    detections = yolo.detect(yolo_img)
  File "code/3D-BoundingBox-master/yolo/yolo.py", line 34, in detect
    ln = [ln[i[0] - 1] for i in self.net.getUnconnectedOutLayers()]
  File "code/3D-BoundingBox-master/yolo/yolo.py", line 34, in <listcomp>
    ln = [ln[i[0] - 1] for i in self.net.getUnconnectedOutLayers()]
IndexError: invalid index to scalar variable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.