Giter VIP home page Giter VIP logo

deep-video-mvs's Introduction

DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

Paper (CVPR 2021): arXiv - CVF

Presentation (5 min.): YouTube


DeepVideoMVS is a learning-based online multi-view depth prediction approach on posed video streams, where the scene geometry information computed in the previous time steps is propagated to the current time step. The backbone of the approach is a real-time capable, lightweight encoder-decoder that relies on cost volumes computed from pairs of images. We extend it with a ConvLSTM cell at the bottleneck layer, and a hidden state propagation scheme where we partially account for the viewpoint changes between time steps.

This extension brings only a small overhead of computation time and memory consumption over the backbone, while improving the depth predictions significantly. As the result, DeepVideoMVS achieves highly accurate depth maps with real-time performance and low memory consumption. It produces noticeably more consistent depth predictions than our backbone and the existing methods throughout a sequence, which gets reflected as less noisy reconstructions.


dvmvs-fusionnet-in-scannet-756.mp4
dvmvs-fusionnet-vs-pairnet-in-7scenes-chess.mp4


Citation


If you find this project useful for your research, please cite:

@inproceedings{Duzceker_2021_CVPR,
    author    = {Duzceker, Arda and Galliani, Silvano and Vogel, Christoph and 
                 Speciale, Pablo and Dusmanu, Mihai and Pollefeys, Marc},
    title     = {DeepVideoMVS: Multi-View Stereo on Video With Recurrent Spatio-Temporal Fusion},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {15324-15333}
}


Dependencies / Installation


conda create -n dvmvs-env
conda activate dvmvs-env
conda install -y -c conda-forge -c pytorch -c fvcore -c pytorch3d \
    python=3.8 \
    pytorch=1.5.1 \
    torchvision=0.6.1 \
    cudatoolkit=10.2 \
    opencv=4.4.0 \
    tqdm=4.50.2 \
    scipy=1.5.2 \
    fvcore=0.1.2 \
    pytorch3d=0.2.5
pip install \
    pytools==2020.4 \
    kornia==0.3.2 \
    path==15.0.0 \
    protobuf==3.13.0 \
    tensorboardx==2.1

git clone https://github.com/ardaduz/deep-video-mvs.git
pip install -e deep-video-mvs


Data Structure


The scripts for parsing the datasets are provided in the dataset folder. All of the scripts might not work straight ahead due to naming and foldering conventions while downloading the datasets, however they should help reduce the effort required. Exporting ScanNet .sens files, both for training and testing, should work with very minimal effort. The script that is provided here is a modified version of the official code and similarly requires python2.

During testing, the system expects a data structure for a particular scene as provided in the sample-data/hololens-dataset/000. We assume PNG format for all images.

  • images folder contains the input images that will be used by the model and the naming convention is not important. The system considers the sequential order alphabetically.
  • depth folder contains the groundtruth depth maps that are used for metric evaluation, the names must match with the color images. The depth images must be uint16 PNG format, and the depth value in millimeters. For example, if the depth is 1.12 meters for a pixel location, it should read 1120 in the groundtruth depth image.
  • poses.txt contains CAMERA-TO-WORLD pose corresponding to each color and depth image. Each line is one flattened pose in homogeneous coordinates.
  • K.txt is the intrinsic matrix for a given sequence after the images are undistorted.

During training, the system expects each scene to be placed in a folder, and color image and depth image for a time step to be packed inside a zipped numpy archive (.npz). See the code here. We use frame_skip=4 while exporting the ScanNet training and validation scenes due to the large amount of data. The training/validation split of unique scenes which are used during this work is also provided here, one may replace the randomly generated ones with these two.



Training and Testing:


  • The pre-trained weights are provided. They are placed here and automatically loaded during testing.

  • There are no command line arguments for the system. Instead, many general parameters are controlled from the config.py within the class Config.

  • Please adjust the input and output folder locations (and/or other settings) inside the config.py.

Training:

In addition to the general Config, very specific training hyperparameters like subsequence length, learning rate, etc. are controlled directly inside the training scripts from TrainingHyperparameters class.

To train the networks from scratch, please refer to the detailed explanation of the procedure that we follow provided in the supplementary of the paper. In summary, we first train the pairnet independently and use some modules' weights to partially initialize our fusionnet. For fusionnet, we start by training the cell and the decoder, which are randomly initialized, and then gradually unfreeze the other modules. Finally, we finetune only the cell while warping the hidden states with the predictions instead of the groundtruth depths.

  • pairnet training script:
    cd deep-video-mvs/dvmvs/pairnet
    python run-training.py
    
  • fusionnet training script:
    cd deep-video-mvs/dvmvs/fusionnet
    python run-training.py
    

Testing:

We provide two scripts for running the inference:

1. Bulk Testing

First is run-testing.py for evaluating on multiple datasets and/or sequences at a run. This script requires pre-selected keyframe filenames for the desired sequences, similar to the ones provided in the sample-data/indices. In a keyframe file, each row represents a timestep, the entry in the first column represents the reference frame, and the entries in the second, third, ... columns represent the measurement frames used for the cost volume computation. One can determine the keyframe filenames with custom keyframe selection approaches, or we provide the simulation of our keyframe selection heuristic in simulate_keyframe_buffer.py. The predictions and errors of bulk testing are saved to the Config.test_result_folder.

2. Single Scene Online Testing

Second is run-testing-online.py to run the testing in an online fashion. One can specify a single scene in Config.test_online_scene_path, then run the online inference to evaluate on the specified scene. In this script, we use our keyframe selection heuristic on-the-go and predict the depth maps for the selected keyframes (Attention! We do not predict depth maps for all images). The predictions and errors of single scene online testing are saved to the working directory. To run the online testing:

cd deep-video-mvs/dvmvs/fusionnet
python run-testing-online.py

Predicted depth maps for a scene and average error of each frame are saved in .npz format. Errors contain 8 different metrics for each frame in order: abs_error, abs_relative_error, abs_inverse_error, squared_relative_error, rmse, ratio_125, ratio_125_2, ratio_125_3. They can be accessed with:

predictions = numpy.load(prediction_filename)['arr_0']
errors = numpy.load(error_filename)['arr_0']


Comparison with the Existing Methods:


In this work, our method is compared with DELTAS, GP-MVS, DPSNet, MVDepthNet and Neural RGBD. For ease of evaluation, we slightly modified the inference codes of the first four methods to make them compatible with the data structure and the keyframe selection files. For Neural RGBD, in contrast, we adjusted the data structure and used the original code. The modified inference codes (and the finetuned weights, if necessary) are provided in the dvmvs/baselines directory. Please refer to the paper for the comparison results.



TSDF Reconstructions:


TSDF reconstructions demonstrated in the paper and in the videos are acquired with the implementation from https://github.com/andyzeng/tsdf-fusion-python. A modified version of this code is provided as sample-data/run-tsdf-reconstruction.py. Same with the original implementation, additional packages are required to run the script, and can be installed to the existing environment with:

conda activate dvmvs-env
conda install -c conda-forge numba scikit-image pycuda

We strongly recommend CUDA Toolkit (nvcc is required) and pycuda installation to get reasonable runtimes.

Default arguments for sample-data/run-tsdf-reconstruction.py are readily set. In addition to the input/output locations, the reconstruction resolution (--voxel_size) and the maximum depth value in a depth map to be backprojected and fused (--max_depth) can be controlled. There are three additional flags (--use_groundtruth_to_anchor, --save_progressive, --save_groundtruth), please refer to the script or use the --help flag for their functionalities.

For convenience, example predictions from the sample scene are also provided in the sample-data/predictions folder. Finally, a couple of low resolution 3D reconstruction results are given in the sample-data/reconstructions folder.

deep-video-mvs's People

Contributors

ardaduz avatar kysucix avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

deep-video-mvs's Issues

TUM-dataset

Thank you for sharing this wonderful work.

I have a question about the usage of the TUM-dataset.
In the paper, it describes that it selectively uses 13 scenes from the TUM dataset.
But in the code, it parses 15 scenes.

input_directories = [
        input_folder / "rgbd_dataset_freiburg1_desk",
        input_folder / "rgbd_dataset_freiburg1_plant",
        input_folder / "rgbd_dataset_freiburg1_room",
        input_folder / "rgbd_dataset_freiburg1_teddy",
        input_folder / "rgbd_dataset_freiburg1_xyz",
        input_folder / "rgbd_dataset_freiburg2_desk",
        input_folder / "rgbd_dataset_freiburg2_metallic_sphere2",
        input_folder / "rgbd_dataset_freiburg2_xyz",
        input_folder / "rgbd_dataset_freiburg3_cabinet",
        input_folder / "rgbd_dataset_freiburg3_long_office_household",
        input_folder / "rgbd_dataset_freiburg3_nostructure_notexture_far",
        input_folder / "rgbd_dataset_freiburg3_nostructure_texture_far",
        input_folder / "rgbd_dataset_freiburg3_structure_notexture_far",
        input_folder / "rgbd_dataset_freiburg3_structure_texture_far",
        input_folder / "rgbd_dataset_freiburg3_teddy"]

I am confused by this inconsistency.

Results after 200k iterations training on Pairnet

Hi~ thank you for your contribution on MVS community.

I am training Pairnet on another dataset and I am not sure about the correctness of the results after the first training stage of the Pairnet. So I am wondering if you can provide the results after 200k iterations on Pairnet. Thanks a lot.

Scannet test results

Hi,

Thanks for sharing the code for your work!

I have a few questions regarding the evaluation on the ScanNet dataset:

  • How can I reproduce the results in Table 1 of the paper?
  • How do you average the results?
    1. Do you compute the results per scannet scene, then average these across all scenes?
    2. Or do you bulk test across all groups of keyframes for the whole ScanNet test set and average per-frame ?

If the latter, could you please share the keyframe file used for the scannet test ?

Thank you for your help!

Missing Citation

Hi Arda,

I just come across this work. It is a very nice paper.

However, there is a very closely related work (Open-World Stereo Video Matching with Deep RNN. ECCV18) is missing in the reference as both methods are using convLSTM on the bottleneck of the matching network.

There are some differences between these two and it would be nice to clarify them in the paper.

Thanks,
Yiran

Possibly incorrect ConvLSTM warping

Hello, thanks for releasing your work. I had a question regarding the implementation of the depth warping:

As I understand it, this is the current flow of the code during training:

  1. Use current depth D_t as the current estimate. As mentioned in the paper, this is done for stabilization of training.
  2. Compute the transformation from previous pose to current pose
  3. Project current depth to point cloud and transform it using the above transformation, followed by sampling the hidden state

This does not seem correct to me. If we are using the current depth D_t, then there isn't any need for transforming the point cloud and we can directly sample the hidden state. Additionally, why transform the current depth by using a transformation of previous to current?

If we used D_t-1 as the depth estimate i.e. depth_estimation = depths_cuda[measurement_index] instead of depth_estimation = depths_cuda[reference_index] over here, then it would make sense.

It would be great if you could shed some light on this and correct me if I've gotten anything wrong!

Hidden State Warping - GT vs Prediction

Hello,

I'm looking at your description for how to train the fusion model in the supplemental:

Finally, we load the best checkpoint and finetune only the cell for another 25K iterations with a learning rate of 5eโˆ’5 while warping the hidden states with the predicted depth maps.

image

The current training script at fusionnet/run-training.py doesn't have a flag for this. I can see that the GT depth is used for warping the current state at line 249.

What should I use as a depth estimator for this step? Should I borrow from this line at fusionnet/run-testing.py? Or (more likely) this differentiable estimator at line 157 in utils.py?

Thanks.

Code to generate mesh as the mp4 shows

Hi,

Thank you for sharing the code of your great work.

I wonder if there is any code that can help us to generate the mesh from the depth as the attached mp4 show. It would be even more perfect if it can directly provides the interface like that, and fuse the depths into a mesh in such an online fashion .

How long does it take to train the networks?

Hi, thanks for your code and paper.
I want to ask you one thing.
How many days does it take for training pairnet and fusionnet?
In the supp., you trained pairnet for 600K interations with batch size of 14 and fusionnet for more than 1000K.
It seems that this requires a too many days with 1 GTX1080ti gpu( mentioned in the paper).

Data preparation

Hi,
Thanks for your code! I think the result is very good.

I wanna test the performance of ScanNet dataset by using the training code. Can you provide a script to prepare the ScanNet dataset for training?

Testing Custom video sequences

Hi authors,
Thanks for providing the code and all the information. The online testing script is working great on the provided Sample HOLOLENS dataset and on TUM-RGB-SLAM dataset and giving great results. However, now I want to try to run on custom videos taken from a smartphone. I have one question regarding this:-

  1. I am using ORB-SLAM to predict camera poses, but it takes around 35-45 mins runtime on the GPU, do you have any advise running a faster algorithm for calculating camera poses any other algorithm ?

Thanks and hoping for a reply
Aakash Rajpal

7Scenes Depth and RGB Intrinsics/focal lengths

Hello,

How do you handle the difference in focal lengths between the RGB and depth maps for 7Scenes? Are you cropping the RGB images down to match depth or reprojecting depths given the intrinsics of both?

Thanks!

Why cell state 'C' in ConvLSTM Cell doesn't warp?

Hi,
Thanks for your nice work. After reading this paper, I'm wondering why the cell state 'Cโ€™ doesn't warp to next viewpoint? I didn't find experiments and ablation studies for this problem. Could you please explain this question?
Thanks!

Running on Colab

Hello, I'm trying to run this code in google colab, and I get this error.
I tried to use imageio, skimage instead of cv2.imread, but it doesn't work.
Thanks

!python run-testing-online.py

qt.qpa.xcb: could not connect to display
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: eglfs, minimal, minimalegl, offscreen, vnc, webgl, xcb.

Cameras with Fixed Position

Hello,

Really great work!

I've wanted to ask, would this model work based on the video input from 2 static cameras?
Let's say I have two fixed in place cameras, and the only thing moving are the contents in front of the camera.

Fusion network training. Pretrained PairNet Decoder?

Hello,

I just need to confirm the training strategy with the decoder when reproducing your results. You mention in the paper that pretrained PairNet weights at 100k are loaded for the feature extractor, FPN, and the cost volume encoder. Are the decoder's weights initialized from scratch?

image

Thanks!

Depth scale of RGBD Scene V2

Thanks for your great work!

I'm currently working on MVS and evaluate methods on RGBD Scene V2 dataset. But I found the depth image, saved in 16-bit image, is not in the millimeter (1000) scale. So may I ask what's the right scale value to have the real world depth?

Best,
Wang

Error when parsing the scannet data by using scannet-export.py

Hi, I get the following error when parsing the scannet data. Do you have any clues for it?

Traceback (most recent call last):
File "./scannet-export.py", line 300, in
pool.join()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 474, in join
assert self._state in (CLOSE, TERMINATE)

Question for frame selection

Thank you for sharing this precious work.
I have a question about the frame selection.

In the code below,

train_maximum_pose_distance = 0.325

You set the hyper-parameters as

train_minimum_pose_distance = 0.1250
train_maximum_pose_distance = 0.3250

I found that while training, there is no overlapping region between frames.
And this is especially because of the large motion between views.

Is there any reason that you choose large-motion views?
This is a bit different from the video environment since each adjacent frame has relatively small motion.

Why Nerualrgbd's performance reported is so low?

Your work is great!
However, when I see the Table.1, I am surprised that the performance of Neuralrgbd is so low especially for the key metric (\sigma<1.25). Neuralrgbd just takes frames with an interval of 5 as inputs without any frame selection.
Do you do frame selection before running the inferences in Table 1? And what's your test split of all the testing files?

Thx

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.