Depth Estimation using Data Driven Approaches

Introduction

Time of Flight, Structured light and Stereo technology have been used widely for Depth Map estimation. Each have these come with their own pros and cons in terms of speed of image capture, structural description and ambient light performance. Monocular cues such as: Texture and Gradient Variation, Shading , color/Haze, and defocus aid in accurate depth estimation. These are complex statistical models which are susceptible to noise. Recently, data driven approaches as in deep learning has been employed for depth estimation. These data driven approaches are less prone to noise if presented with enough data to learn coarser and finer details.

Convolution Neural Networks - CNN

In deep learning, CNNs are widely used in the image processing applications. Convolution layers are the basic building block of CNN and it combines with Pooling and ReLU activation layers. Kernel learns during each layer using back propagation.The CNN learns the features from the input images by applying the varied filters across the image generating feature maps at each layer. As we go deeper into the network the feature maps are able to identify complex features and objects intuitively. ConvNets have been very successful for image classification, but recently have been used for image prediction and other applications. The addition of upscaling and deconvolution layers have given way to upscale the compressed feature map for data prediction over class.

![image](https://cloud.githubusercontent.com/assets/11435669/20927466/c186f656-bb8f-11e6-86a8-2d6661db827c.png)

Related Work

A fully automatic 2D-to-3D conversion algorithm: Deep3D [1] that takes 2D images or video frames as input and outputs 3D stereo image pairs. David Eigen from NYU proposed a single monocular image based architecture that employs two deep network stacks called Multi Scale Network [2]: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. It is trained on real world dataset. “FlowNet: Learning Optical Flow with Convolutional Networks” [3] uses video created virtually to make the network learn motion parameters and hence forth extract optical flow. “Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches” [4] a method for extracting depth information from stereo data and their respective patches. Similar to [4] “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical {CRFs}” [5] uses different scale of image patches to extract depth information.

![image](https://cloud.githubusercontent.com/assets/11435669/20927710/c855abca-bb90-11e6-9dd1-3fe86007c398.png)

Multi Scale network

![image](https://cloud.githubusercontent.com/assets/11435669/20927750/f034fe20-bb90-11e6-9cb8-262d661d205a.png) ![image](https://cloud.githubusercontent.com/assets/11435669/20927757/f4e5f3b6-bb90-11e6-91c3-ba2bf66dacb0.png)

FlowNet

Methods

A. Stereo ConvNet Architecture

The images and ground truth depth maps used for training, validation and testing are produced by varying orientations of the 3D model generated using the Blender software tool. As our first step, we use SteroConvNet [6] and the first half of the network is shown below. Second half of the network is the mirror image of the last convolution layer, replacing convolution with deconvolution and pooling with upscaling. Input Image, even though consists of concatenated left and right image pairs , the network takes it as two separate images. Here, the reference output label is the ground truth depth map generated using the Blender's "Mist" function.

![image](https://cloud.githubusercontent.com/assets/11435669/20928225/cc7f1b58-bb92-11e6-9217-fa0811db36bd.png)

Stereo ConvNet Architecture

B. Deeper Stereo ConvNet Architecture

In Deeper Stereo ConvNet, input remains constant but architecture is modified with an extra convolution and deconvolution layer. Also, depth of the filters is increased referring to [3] in order to capture more details.

C. Patched Deeper Stereo ConvNet Architecture

Referring to [4] and [5], input stream has been increased to 6 for Patched Deeper Stereo ConvNet, by decomposing left image into 4 scaled parts. Thus, as in the referenced papers higher accuracy of the depth map is expected.

Patched Deeper Stereo ConvNet Architecture

Results

      Stereo ConvNet Architecture
          + smooth without holes
          + coarse structure preserved
          -Blurred at edges
          -Sharp structures lost
          -Fine objects smeared or lost.
          Time to test = 20 s

      Deeper Stereo ConvNet Architecture
          + smooth without holes
          + coarse structure preserved
          + Edges are sharper
          -Still noise at the edges
          -Fine details/objects smeared or lost.
          Note:The increased depth of the network learns more detail about the scene.
          Time to test = 70 s

      Patched Deeper Stereo ConvNet Architecture
          + smooth without holes
          + Fine structure preserved
          + Image predicted with less noise.
          -Time to train and test increases.
          Note:The increased depth and increased data resolution of the network learns more
          detail about the scene.
          Time to test = 145 s

3D modeling for Patched Deeper Stereo ConvNet Architecture:

Image	Expected output	Derived output

Conclusion

Data Driven Depth Estimation approaches would be effective if sufficiently large descriptive labelled dataset were avialable. Patched Deeper Stereo ConvNet predicts depth map very similar to the ground truth. Time to train the network is directly proportional to the depth and complexity of the CNN architecture. In further implementations, we plan to combine the architecture of our Patched Deeper StereoConvNet with Multi-Scale Deep Network and observe the results for real world images.

References

[1] “Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks” Junyuan Xie, Ross Girshick, Ali Farhadi,University of Washington.

[2] “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network” David Eigen, Christian Puhrsch, Rob Fergus Dept. of Computer Science, Courant Institute, New York University.

[3] “FlowNet: Learning Optical Flow with Convolutional Networks”, A. Dosovitskiy and P. Fischer, ICCV , 2015.

[4] “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs” by Bo Li1, Chunhua Shen , Yuchao Dai , Anton van den Hengel, Mingyi He, IEEE Conference on Computer Vision and Pattern Recognition (CVPR'15).

[5] “Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches” by Jure Zbontar ,University of Ljubljana Vecna ,Yann LeCun, Journal of Machine Learning Research 17 (2016).

[6] https://github.com/LouisFoucard/StereoConvNet

singhkavinder / depth-estimation-using-data-driven-approaches Goto Github PK