syscv / idisc Goto Github PK

View Code? Open in Web Editor NEW

279.0 13.0 20.0 17.89 MB

iDisc: Internal Discretization for Monocular Depth Estimation [CVPR 2023]

Home Page: https://arxiv.org/abs/2304.06334

License: Other

Python 74.56% Shell 0.65% C++ 2.24% Cuda 22.55%

computer-vision deep-learning depth-estimation discretization

idisc's People

Stargazers

Watchers

idisc's Issues

The evaluation index of the prediction result of the table plane normal vector estimation is printed incorrectly

python ./scripts/test.py
--model-file
../model/nyunormals_swinlarge.pt
--config-file
../configs/nyunorm/nyunorm_swinl.json
--base-path
/home/kantai/data/dw/porjects/idisc-main/tmp/

Internal Discretization Figure

Hi, thanks for your great work!
I doubt how is (d) Internal discretization in Figure 1 in the paper generated.

I infer that the id is the max value of (QiKi) from "(QiKi) is the spatial location for which each specific IDR is responsible ", as described at the end of the fourth page of the paper.

Could you provide me with the concrete computation process?

I want to verify your nyunormal results, but the results I ran out were different from yours, and I wanted to know what went wrong and how to run your results

My instructions are like this
python ./scripts/test.py
--model-file
../model/nyunormals_swinlarge.pt
--config-file
../configs/nyunorm/nyunorm_swinl.json
--base-path
$WorkSpace/idisc-main/tmp/

The result I got was:
Test/AngularLoss: -0.14168441648872662
Error in best. a1 (0.2765)
Error in best. a2 (0.379)
Error in best. a3 (0.4847)
Error in best. a4 (0.649)
Error in best. a5 (0.7218)
Error in best. rmse_angular (33.0139)
Error in best. mean (22.66)
Error in best. median (12.8913)

Prediction Size vs Input Size

Hello,
I was going through the predicted depth maps provided as part of the repo.
It looks like the resolution of those depth maps (for NYUv2) is 160x120 whereas the input is 640x480. Does that mean the network is trained to output depth only at 160x120? I did see in the Readme the disclaimer about resizing it to the actual resolution, wouldn't just resizing cause issues at the depth boundaries?

Thanks!

Evaluating on KITTI Improved Ground Truth

Hi, First of all, congratulations on this great work!

I'm evaluating recent Depth Estimation techniques and I'm wondering if you could help me to validate the results.

I downloaded your SwinLarge predictions and wanted to compare them with the KITTI Improved Ground Truth [1] directly by comparing your output map with the GT.

I followed your instructions by dividing by 256 (as the GT data), and I interpolated just like your code do on the output of the model, using F.interpolate with mode=bicubic and align_corners=True.

I'm following Monodepth2 procedures to compare, therefore not using Garg's crop in here.

The results are the following:

abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 |
& 0.086 & 0.539 & 4.228 & 0.153 & 0.913 & 0.979 & 0.991 \

I was expecting really lower results. Can you validate these steps, please? Are the SwinLarge predictions giving the correct outcome?

The code is quite simple, and I'll share it above here just so you can check it (if you want).

def compute_errors(gt, pred):
    thresh = np.maximum((gt / pred), (pred / gt))
    a1 = (thresh < 1.25     ).mean()
    a2 = (thresh < 1.25 ** 2).mean()
    a3 = (thresh < 1.25 ** 3).mean()
    rmse = (gt - pred) ** 2
    rmse = np.sqrt(rmse.mean())

    rmse_log = (np.log(gt) - np.log(pred)) ** 2
    rmse_log = np.sqrt(rmse_log.mean())

    abs_rel = np.mean(np.abs(gt - pred) / gt)

    sq_rel = np.mean(((gt - pred) ** 2) / gt)

    return abs_rel, sq_rel, rmse, rmse_log, a1, a2, a3
MIN_DEPTH = 1e-3
MAX_DEPTH = 80
pred = cv2.imread(pred_path, -1)
pred = pred / 256
gt = cv2.imread(gt_path, -1)
gt_depth = gt / 256
gt_height, gt_width = gt_depth.shape[:2]
mask = np.logical_and(gt_depth > MIN_DEPTH, gt_depth < MAX_DEPTH)    
pred_depth = F.interpolate(
            torch.from_numpy(pred).unsqueeze(0).unsqueeze(0),
            gt.shape,
            mode="bicubic",
            align_corners=True,
        )            
pred_depth[pred_depth < MIN_DEPTH] = MIN_DEPTH
pred_depth[pred_depth > MAX_DEPTH] = MAX_DEPTH
compute_errors(gt_depth, pred_depth)

I was expecting lower values than what you provided in the paper (like... Abs Rel probably lower than 0.05) but actually got way higher values (like... Abs Rel 0.086).

Thanks again for your work!

Ref.

[Uhrig, Jonas, et al. "Sparsity invariant cnns." 2017 international conference on 3D Vision (3DV). IEEE, 2017.]

Cannot import 'MultiScaleDeformableAttention'

Hi,

After installation, I tried to run the test.py. But cannot import 'MultiScaleDeformableAttention'. Please help to advise what should I do? Thanks.

"ModuleNotFoundError: No module named 'idisc'" when running test.py

Hi thank you for your wonderful work and the zoo. I was performing benchmark evals on SOTA MDE models for my dissertation.

I tried to replicate your work in Google Colab (instead of conda) and when I run !python ./scripts/test.py --model-file ./idisc_nyu_resnet101.pt --config-file ./configs/nyu/nyu_r101.json --base-path ../temp/datasets I get an error.

Traceback (most recent call last):
  File "/content/idisc/./scripts/test.py", line 14, in <module>
    import idisc.dataloders as custom_dataset
ModuleNotFoundError: No module named 'idisc'

I dont know why this is happening. My suspicion is !bash ./make.sh is this because it threw low of exceptions but said "Finished processing dependencies for MultiScaleDeformableAttention==1.0
".

Your help is greatly appreciated.

Abnormal Training Phenomena and Bad Performance

Dear Luigi Piccinelli,
I hope this message finds you well. I wanted to express my sincere appreciation for your exceptional article. Inspired by your work, I attempted to train your project on the KITTI Eigen partitioning dataset.

However, during my training process, I encountered several abnormal phenomena that I would like to bring to your attention:

The loss curve consistently showed a downward trend, but the evaluation indicators' curves reached a stable state very early on.
I observed poor and even abnormal performances across various evaluation indicators.

Here is a screenshot depicting the issue:

To accommodate the equipment I am using (a single machine with four RTX 3090s and no SLURM), I modified the distributed training setup from SLURM to standard DDP (DistributedDataParallel).
Additionally, I made some modifications in the dataloader directory to align with the directory structure of my existing KITTI dataset. I believe these changes should not be the cause of the undesirable results, as the code correctly outputs messages such as "Loaded 23158 images. Totally 0 invalid pairs are filtered" and "Loaded 652 images. Totally 45 invalid pairs are filtered."
Furthermore, in order to track the training process using TensorBoard, I incorporated some code in the training section to generate and save log information.
Apart from these adjustments, I have not made any additional modifications to the code. Specifically, the config file remains the same as the one you provided.

I would greatly appreciate your valuable insights and guidance regarding these issues. If there are any specific details or additional information I can provide to assist in troubleshooting, please let me know. Thank you once again for your remarkable contribution to the field.Best regards

Question about outdoor zeroshot datasets

Hello, I have a few questions regarding the outdoor datasets DDAD and Argoverse that you suggested for testing my methodology for outdoor zero-shot generalization.

In the paper, it mentions cropping the image size to 1920 x 870, but I would like to change it to 864 to avoid size mismatch. It seems fine to me, but is there any potential issue from the author's perspective?

After fixing the size to 1920 x 870 through cropping, do we need to apply eigen_crop or garg_crop separately? If so, should eigen_crop be used as mentioned in argo_swin.json?

Thank you for your excellent work.

Can you tell me about the steps you do to normalize the input depth image?

When I see your code, I don't understand steps like: divide for depth scale, choose pixels greater than min depth, ...

Surface Normal Estimation procedure

Hi,
I am performing zero-shot testing for surface normal estimation on KITTI dataset using your iDisc module. In order to understand the results better, I would like to understand how iDisc computes surface normals and what ground truth data does it use for surface normal estimation. Since I could not find much information through the paper, can you please tell me where to look for this information or perhaps give an answer here.

Thanks in advance

Saved depth seems wrong

Hello, thanks for the great work!

I am running your model on my custom dataset. However it seems that the saved depth from NYUv2 model is wrong. I think this might due to my misuse of your model's output. I have a script like this:

import os
import shutil
import torch
import numpy as np
import cv2
from tqdm import tqdm
from pathlib import Path
import sys, json
from PIL import Image
import torchvision.transforms.functional as TF
# I clone your repo and put to the place where I can directly import
sys.path.insert(0, str(Path(__file__).parent.resolve() / "idisc"))
from idisc.models.idisc import IDisc
from idisc.utils import (DICT_METRICS_DEPTH, DICT_METRICS_NORMALS,
                         RunningMetric, validate)
model = IDisc.build(json.load(open('idisc/configs/nyu/nyu_swinl.json')))
model.load_pretrained("idisc/nyu_swinlarge.pt")
model = model.to("cuda")
model.eval()


# read in image
image = np.asarray(Image.open(image_path))
image = TF.normalize(TF.to_tensor(image), **{"mean": [0.5, 0.5, 0.5], "std": [0.5, 0.5, 0.5]})
image = image.unsqueeze(0).to("cuda")

with torch.inference_mode():
    depth, *_ = model(image)

TF.to_pil_image(depth[0].cpu()).save(save_path)

I am using Swin-Large model. The image_path is the path to this image

of size 224x224 (I uploaded the exact image in case you might need to debug this), DPT can generate depth like this

however the output of idisc swin-large is this

I believe I made some mistakes somewhere. I wonder if you can help me debug this.

Thanks!

_IncompatibleKeys failure loading weights.

Hi,

Thank you for sharing the code and weights. I am trying to load the surface normal estimator using the config and weights linked in https://github.com/SysCV/idisc?tab=readme-ov-file#normals:

https://github.com/SysCV/idisc/blob/main/configs/nyunorm/nyunorm_swinl.json
https://dl.cv.ethz.ch/idisc/checkpoints/nyunormals_swinlarge.pt
And there is an error loading the weights:

import json
from idisc.models.idisc import IDisc

NORMALS_CONFIG_FILE = "models/nyunorm_swinl.json"
NORMALS_MODEL = "models/nyunormals_swinlarge.pt"
with open(NORMALS_CONFIG_FILE, "r") as f:
    config = json.load(f)

model = IDisc.build(config=config)
model.load_pretrained(NORMALS_MODEL)

-> Encoder is pretrained from: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window7_224_22k.pth Loading pretrained info: _IncompatibleKeys(missing_keys=['norm0.weight', 'norm0.bias', 'norm1.weight', 'norm1.bias', 'norm2.weight', 'norm2.bias', 'norm3.weight', 'norm3.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.weight', 'head.bias', 'layers.0.blocks.1.attn_mask', 'layers.1.blocks.1.attn_mask', 'layers.2.blocks.1.attn_mask', 'layers.2.blocks.3.attn_mask', 'layers.2.blocks.5.attn_mask', 'layers.2.blocks.7.attn_mask', 'layers.2.blocks.9.attn_mask', 'layers.2.blocks.11.attn_mask', 'layers.2.blocks.13.attn_mask', 'layers.2.blocks.15.attn_mask', 'layers.2.blocks.17.attn_mask'])

It also happens with Swin-Large for NYU2 Depth Estimation.
Am I missing something?

Thank you.

what time public code?

hi, i had read your paper.Thanks for your job.Can you public your code? I want to test the result using pics . Thanks.!!!

How to test the depth of a picture

Thank you very much for your work. I have already configured the environment, but I don’t know how to use your code to find the depth of a picture. How can I get the depth map?

I want to visualize your results, but you don't seem to use visulization.py in your code, where should I use them

NYU dataset normal annotation missing

Hi, thanks for your wonderful work! I have downloaded nyu.zip from google drive following your instructions, but I cannot find the annotation for normals. The annotation in train.txt and test.txt points to '518.8579' but I cannot find the file in the .zip. Could you please let me know how to find these normal annotations? Many thanks in advance.

Question about Diode indoor dataset

Hello, thank you for writing an excellent paper.
Is it correct that zero-shot testing for the mentioned Diode was conducted only on the 325 images within diode_indoor_val.txt? If so, is it also correct that diode_indoor_train.txt was not used in the paper?

question about geometrical data augmentation

Thanks for your great work! I have some naive questions about the geometrical data augmentation. First, to my best knowledge, the geometrical data augmentation especially the "random_scale" is not presented in previous works. However, I didn't see ablation experiments on data augmentation in your paper. Could you tell me whether you followed previous works to perform data augmentation or provide some ablation experiments about this? Second, I notice that camera intrinsics are concatenated with images and changed with image scale. But I'm not sure how do you use camera intrinsics during train or test.

Visual Comparison b/w Different Networks

Hi,
I was trying to run a visual comparison between predictions that you saved for different networks (ResNet101, EfficientNet-B5, Swin-T, Swin-B, and Swin-L) for NYU-v2 dataset. I loaded the saved .png files as 16-bit unsigned integers, converted them to float32, divided by 1000 as suggest but afterwards the max for different networks is different. For example for image 'bathroom/sync_depth_00045.png', the max for different networks after scaling with 1000 is noted below:

ResNet101: 2.224
EB-5: 3.009
Swin-T: 2.991
Swin-B: 2.654
Swin-L: 2.974

Can you please advise if I should clip the values between [0, 1] or use a different scalar, scaling by the max of each output would make the outputs look visually different.
Thanks!

Question : How is GT-Based Depth Rescaling for Diode Indoor Dataset performed ?

Hi @lpiccinelli-eth , the paper mentions the GT-Based Depth Rescaling for Diode Indoor dataset.

Could you please help explain the exact step and point towards the corresponding code?
Also, the diode indoor split that is shared on the repo mentions depths as '.pngs' but diode has depth maps as '.npys'. How did you generate the '.pngs' , and would you be able to share them?
Finally, the depth scale is used as 256 for Diode, for inference on an nyuv2 trained model. I noticed that Diode has pixel values in metres, with a min of 0 and max of ~300. Given this info, is the scale still correct?

AttributeError: 'IDisc' object has no attribute 'no_sync'

Hi,

Firstly, thanks a lot for your wonderful work. I'm facing a problem when training the model with 4 GPUs. DataLoader worked well and I got a error like this:

Any suggestions will be appreciated!

Loaded 22441 images. Totally 717 invalid pairs are filtered
Loaded 491 images. Totally 206 invalid pairs are filtered
-> Local random sampler
Start training:
Loaded 22441 images. Totally 717 invalid pairs are filtered
Loaded 491 images. Totally 206 invalid pairs are filtered
-> Local random sampler
Start training:
Loaded 22441 images. Totally 717 invalid pairs are filtered
Loaded 491 images. Totally 206 invalid pairs are filtered
-> Local random sampler
Start training:
Loaded 22441 images. Totally 717 invalid pairs are filtered
Loaded 491 images. Totally 206 invalid pairs are filtered
-> Local random sampler
Start training:
Traceback (most recent call last):
File "/project/6064028/tmp/code/idisc/scripts/train_DDP.py", line 288, in
main_worker(config, args)
File "/project/6064028/tmp/code/idisc/scripts/train_DDP.py", line 168, in main_worker
with context as fp, model.no_sync() as no_sync:
File "/project/6064028/tmp/idisc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'IDisc' object has no attribute 'no_sync'
Traceback (most recent call last):
File "/project/6064028/tmp/code/idisc/scripts/train_DDP.py", line 288, in
main_worker(config, args)
File "/project/6064028/tmp/code/idisc/scripts/train_DDP.py", line 168, in main_worker
with context as fp, model.no_sync() as no_sync:
File "/project/6064028/tmp/idisc/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'IDisc' object has no attribute 'no_sync'

About the results of my own test picture are inconsistent with the paper

Hello,
Based on the original function, I added the function of outputting a color depth map to the project code.
However, in the kitti data set, using the same picture (from the first row of Figure 13 in the original paper), whether using resnet101, effnetb5 or swint to swinl, I cannot get as good an effect as shown in Figure 13 of the paper.
The following three pictures are, in order, the test effect I used resnet101, the test effect I used swinl and the original paper effect.
Apart from that, the main part of my test code is as follows

img = Image.open(
        ".\\2011_09_26\\2011_09_26_drive_0002_sync\\image_02\\data\\0000000021.png")
    transform = trasforms.Compose([trasforms.ToTensor(), trasforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                                             std=[0.229, 0.224, 0.225])])
    img = transform(img).unsqueeze(0)
    model.eval()
    with torch.no_grad():
        preds, losses, _ = model(img.to(device), None, None)
    preds = preds.cpu().numpy()[0, 0, :, :]
    img = visulization.colorize(preds)
    img = Image.fromarray(img, mode='RGB')
    min_val = np.min(preds)
    max_val = np.max(preds)
    scaled_array = (preds - min_val) / (max_val - min_val)
    preds = scaled_array * 255
    preds = Image.fromarray(preds.astype("uint8"), mode='L')
    preds.save("gray.png")
    img.save("RGB.png")
    print("Done")

syscv / idisc Goto Github PK

idisc's People

Stargazers

Watchers

Forkers

idisc's Issues

Recommend Projects

Recommend Topics

Recommend Org