haofengac / monodepth-fpn-pytorch Goto Github PK

Single Image Depth Estimation with Feature Pyramid Network

License: MIT License

Jupyter Notebook 97.54% Python 2.46%

depth-map depth-estimation kitti-dataset nyu-depth feature-pyramid-network depth-prediction monocular-depth pytorch

monodepth-fpn-pytorch's Introduction

MonoDepth-FPN-PyTorch

A simple end-to-end model that achieves state-of-the-art performance in depth prediction implemented in PyTorch. We used a Feature Pyramid Network (FPN) backbone to estimate depth map from a single input RGB image. We tested the performance of our model on the NYU Depth V2 Dataset (Official Split) and the KITTI Dataset (Eigen Split).

Requirements

Python 3
Jupyter Notebook (for visualization)
PyTorch
- Tested with PyTorch 0.3.0.post4
CUDA 8 (if using CUDA)

To Run

python3 main_fpn.py --cuda --bs 6

To continue training from a saved model, use

python3 main_fpn.py --cuda --bs 6 --r True --checkepoch 10

To visualize the reconstructed data, run the jupyter notebook in vis.ipynb.

Data Processing

NYU Depth V2 Dataset

The NYU Depth V2 dataset contains a variety of indoor scenes, with 249 scenes for training and 215 scenes for testing. We used the official split for training and testing. This github page provides a very detailed walkthrough and matlab code for data processing.
Following previous works, we used the official toolbox which uses the Colorization method proposed by Levin et al. to fill in the missing values of depth map in the training set.
To make comparison with previous works, we evaluated our model using the official evaluation set of 654 densely labeled image pairs.
The images and depth maps in the NYU Depth V2 dataset are both of size 640x480. During training, we loaded the images as-is, and downscaled the depth maps to 160x120. The proposed model produces depth maps of size 160x120, and upsampled to the original size for evaluation.
We employed random crop, random rotate of 10 degrees, color jitter of brightness, contrast, saturation and hue of variation 0.1. To save time during training, we performed data augmentation in advance by running dataset/augment.py.
PyTorch does not support transformation for both the input and the target, so we implemented joint transforms for data augmentation.

KITTI Dataset

The KITTI dataset consists of 61 outdoor scenes with “city”, “road”, and “residential” categories. Following the official tutorial, we got about 22k image and depth map pairs in the training dataset.
Following previous works, we used the same NYU tool box to fill in the missing values of the sparse depth maps in the training set.
To compare with the performances of previous studies, we evaluate on the test split of KITTI dataset proposed by Eigen et al.
The images and depth maps in the KITTI dataset are both of size about 1280x384. During training, we downscaled the images to size 640x192, and downscaled the depth maps to 160x48. The proposed model produces depth maps of size 160x48, and upsampled to the original size for evaluation.
For better visualization in quantitative evaluation, we filled in the missing depth value in the ground truth depth map.

Model

We used Feature Pyramid Network (FPN) with ResNet101 as backbone (shaded yellow), loaded ImageNet pretrained weight as weight initialization.
We used pixel-shuffle for upsampling and fuse feature maps with add operation; bilinear interpolation is employed after pixel-shuffle to deal with inconsistent feature map size.
Two consecutive 3x3 convolutions for feature processing.
No non-linearity in the top-down branch of FPN, and ReLU in other convolution layers, and Sigmoid in the prediction layer for better stability.
Trained on the weighted sum of the depth loss, the gradient loss, and the normal loss for 20 epochs for the NYU Depth V2 dataset, and 40 epochs for the KITTI dataset; gradient loss added after epoch 1, and normal loss added after epoch 10.
Outputs prediction of size ¼, and evaluated after bilinear upsampling.

Loss Function

We employed three parts in the loss function in our model. The loss is a weighted sum of 3 parts: the depth loss, the gradient loss and the surface normal loss.

Depth Loss

The depth loss is RMSE in log scale, which we found converges better than L1 and L2 norm. Supervising in log scale makes the classifier focus more on closer objects.

Gradient Loss

The gradient of depth maps is obtained by a Sobel filter; the gradient loss is the L1 norm of the difference.

Surface Normal Loss

We also employed the normal vector loss proposed by Hu et al., which helps refining details.

The weight ratio between the three loss was set to 1:1:1.

Qualitative Evaluation

KITTI

Comparison with state-of-the-art methods:

More comparison:

Quantitative Evaluation

KITTI

We use the following depth evaluation metrics used by Eigen et al.:

where T is the number of valid pixels in the test set.

Discussion

FPN is an effective backbone for monocular depth estimation because of its ability to extract features and semantics at different scales. It can achieve its potential if guided by proper loss functions.
Gradient and normal losses help prevent the model getting stuck in local optimum and guide it toward better convergence, as shown in the ablation study.
Some existing state-of-the-art methods ourperform ours in certain metrics, and we believe that this is due to the adaptive BerHu loss.

Related Work

Eigen et al. were the first to use CNNs for depth estimation: predicting a coarse global output and then a finer local network.
Laina et al. explored CNNs pre-trained with AlexNet, VGG16 and ResNet.
Godard et al. and Kuznietsov et al. introduced unsupervised and semi-supervised methods that relies on 3D geometric assumptions of left-right consistency. They trained the network using binocular image data.
Fu et al. proposed a framework based on classification of discretized depth ranges and regression. They supervised depth prediction at different resolutions and used a fusion network to produce the final output.
Hu et al. proposed a novel loss of normal vectors in addition to the conventional depth and gradient losses. They first trained a base network without skip connections, and then train the refine network using the novel loss after freezing the weights of the base network.
This project: Fully convolutional with no FC layers needed, FPN provides a straightforward and unified backbone to extract features maps that incorporates strong and localized semantic information. We employ an easy to follow curriculum to add in gradient loss and normal loss during training, all losses are calculated on the output feature maps instead of intermediate ones.

References

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multiscale deep network. In: Advances in neural information processing systems (NIPS). (2014) 2366–2374
Fu, Huan, Mingming Gong, Chaohui Wang and Dacheng Tao. “A Compromise Principle in Deep Monocular Depth Estimation.” CoRR abs/1708.08267 (2017): n. pag.
R. Garg, V. Kumar, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proc. of the European Conf. on Computer Vision (ECCV), 2016.
Geiger, Andreas, et al. "Vision meets robotics: The KITTI dataset." The International Journal of Robotics Research 32.11 (2013): 1231-1237.
C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. arXiv:1609.03677v2, 2016.
Hu, Junjie, et al. "Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps with Accurate Object Boundaries." arXiv preprint arXiv:1803.08673 (2018).
Kuznietsov, Yevhen, Jörg Stückler, and Bastian Leibe. "Semi-supervised deep learning for monocular depth map prediction." Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In Proc. of the Int. Conf. on 3D Vision (3DV), 2016.
Levin, Anat, Dani Lischinski, and Yair Weiss. "Colorization using optimization." ACM Transactions on Graphics (ToG). Vol. 23. No. 3. ACM, 2004.
Silberman, Nathan, et al. "Indoor segmentation and support inference from rgbd images." European Conference on Computer Vision. Springer, Berlin, Heidelberg, 2012.

monodepth-fpn-pytorch's People

Contributors

Stargazers

Watchers

monodepth-fpn-pytorch's Issues

Why when computing RMSE loss, fake and real must be multiplied by 10?

You do transforms.ToTensor() on depth after loading it. I found that PyTorch will actually do depth = depth/255.
I guess, when computing loss, it would be better to multiply 255 in order to get a correct loss. I haven't read your code thoroughly, and don't know what you have done when you preprocessed the data. I feel confused why fake and real must be multiplied by 10 when computing loss.

I'm a beginner in this field. Thanks for your great work.

Lack of clarity over dataset

I Can read from the kitti dataset code, that there is some kitti_training_images.txt file demanded, but I can't see any such file from any kitti depth and raw dataset downloads, Can you make it more clear as to how we need to procure thekitti dataset in order to directly use the code

How many training images in NYUv2 dataset?

Hi @xanderchf,

How many training images in NYUv2 did you use to train the network? I trained my model with the small dataset and got very bad results.

Lack of instructions

Now I faced a couple of problems: there are no such modules as "constants", "utils" and etc, but its not a major problem. I'm trying to run main_fpn.py with python main_fpn.py --epochs 40 --cuda --bs 4 --num_workers 3 --output_dir output_dir/, but only got NaN's as output from network and as values of losses.

Could you please write more detailed instructions how to run your code?

Metrics code

There are a lot of information about evaluation of this network and benchmarks.
But could you explain how do you calculate this metrics (ARD, SRD, metrics with threshold and etc.) and submit your code for them?

Problem with the NYU dataset Matlab code

Hi I just want to know if you make any modification for the file process_raw.m from https://github.com/janivanecky/Depth-Estimation/tree/master/dataset or other file in the official Toolbox for the NYU dataset. Im getting an error about the get_synched_framed, its seem that it cannot go through the scene subfile (like basement_0001a), its return 0 file in the scene but all the files are there. I did it on Octave in Linux Mint and I used the Raw dataset, Single File (~428 GB) from https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html.

Thanks a lot for your help.

Problem with choosing ground truth form, filled depth or project depth

Hi @xanderchf

Thanks to your post, I tried to fill in missing values in ground truth by using NYU toolbox - fill_depth_colorization in matlab.
and applied colorized map,

like this.

But I'm curious about one thing,
when you compute errors between ground truth and prediction,
did you use filled depth or project depth?
In other words, is your network outputs filled depth or project depth?
Thanks

How to avoid output saturation?

I'm training your model on the NYU Depth V2 dataset but can't avoid the model's predictions quickly degenerating to outputting a depth value for the entire image. How did you avoid this? The model doesn't appear to learn at all, the loss doesn't decrease, it just oscillates around the same value, and very little gradient is passed to any of the parameters, if any. I've tried different learning rates with the same result.

fill in the missing values of the sparse depth maps of kitti dataset

Hi,

Thanks for posting your code!

I have a question for the depth maps in kitti dataset. You mentioned that before training you use the NYu tool box to fill in the sparse depth maps of kitti dataset. I am wondering:

Are the sparse depth image from the official website under the section "depth completion/depth prediction" with 14G?
Can you tell more specifically how to use the tool box to fill in the sparse depth map? And how accurate is it to use the filled-in depth map as ground truth label.

Thank you very much!

Endoscope images

Hi,

Thanks for the code.
I am about to use this repo for training to be able to estimate the depth for the images acquired from a stereo endoscope under the water. As I can see this and most of the mono depth methods are applied for the street and cars. Is there anything that I should do or not to do for training the model when my aim is to get the depth for underwater and small distant objects?

Also I noticed the amount of overlap between stereo images for me is not as much as typical images from the street views. So my problem is a smaller amount of the overlap.

Thanks for reading

代码

d5, d4, d3, d2 = self.up1(self.agg1(p5)), self.up2(self.agg2(p4)), self.up3(self.agg3(p3)), self.agg4(p2)
,,H,W = d2.size()
vol = torch.cat( [ F.upsample(d, size=(H,W), mode='bilinear') for d in [d5,d4,d3,d2] ], dim=1 )
请问一下,这几行是什么意思?作用是干什么的?

I can't find Utils.py

Hi @xanderchf
Thanks for your code!
But I can't find the utils.py , It would be helpful if you can release this part of code.

Thanks!

How to set grad loss factor?

In your repo, the grad loss factor is set to 10.
The reason is to make depth loss and grad loss in a same range? For example, 0.0x?

Pretrained model

Hi, is a pretrained model available ? It would greatly help with depelopment times. Thank you in advance!

The code is incomplete

Hi @xanderchf,

Thanks for your great work. It would be a great help if you can you release the full code.

constants.py

Hi, I can't find constants.py in the repository, however it does present in the imports, what can I do with that?

Problem of training phrase

I am training your code on the Stanford 2D-3D-S dataset, but I got the predict depth map between 0 and 1, but the real depth can reach to 128, Do you know why?

Angle Based loss versus cosine inverse loss

Thanks for your great work!

Have you experimented with arccos based normal loss. Does the performance vary when using arccos instead of (1 - normalized inner product)?

The code I am referring to is -

prod = ( grad_fake[:,:,None,:] @ grad_real[:,:,:,None] ).squeeze(-1).squeeze(-1)
fake_norm = torch.sqrt( torch.sum( grad_fake2, dim=-1 ) )
real_norm = torch.sqrt( torch.sum( grad_real2, dim=-1 ) )

what is the setting of the param :DOUBLE_BIAS and WEIGHT_DECAY

First of all. thanks for your code. I don't know the setting about hte param : DOUBLE_BIAS and WEIGHT_DECAY . it show me that:

[epoch 0][iter 10] loss: nan RMSElog: nan grad_loss: nan normal_loss: nan
[epoch 0][iter 20] loss: nan RMSElog: nan grad_loss: nan normal_loss: nan
[epoch 0][iter 30] loss: nan RMSElog: nan grad_loss: nan normal_loss: nan
[epoch 0][iter 40] loss: nan RMSElog: nan grad_loss: nan normal_loss: nan
[epoch 0][iter 50] loss: nan RMSElog: nan grad_loss: nan normal_loss: nan
[epoch 0][iter 60] loss: nan RMSElog: nan grad_loss: nan normal_loss: nan
[epoch 0][iter 70] loss: nan RMSElog: nan grad_loss: nan normal_loss: nan

[Question] Is there any self-supervision training method exist?

I can't find the code for self-supervise training.

haofengac / monodepth-fpn-pytorch Goto Github PK

monodepth-fpn-pytorch's Introduction

MonoDepth-FPN-PyTorch

Requirements

To Run

Data Processing

NYU Depth V2 Dataset

KITTI Dataset

Model

Loss Function

Depth Loss

Gradient Loss

Surface Normal Loss

Qualitative Evaluation

KITTI

Quantitative Evaluation

KITTI

Discussion

Related Work

References

monodepth-fpn-pytorch's People

Contributors

Stargazers

Watchers

Forkers

monodepth-fpn-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org