I used the following command to train on KITTI python3 train.py ./da

Blank output after training on KITTI about sfmlearner-pytorch HOT 29 CLOSED

clementpinard commented on September 3, 2024

Blank output after training on KITTI

from sfmlearner-pytorch.

Comments (29)

ClementPinard commented on September 3, 2024

Are you sure you are on master ?
can you post the exact training command you used ?
I'm going to try it myself on the last version of the code to see what's going on

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

I am using master branch. The only change I added was adding an option to be able to specify the checkpoint directory and I am reasonably sure that it won't change the output.
python3 train.py ./data/kitti/kitti_rawdata_formatted/ -b4 -m0.2 -s0.1 --epoch-size 3000 --sequence-length 3 --log-output --checkpoint_dir ./training/checkpoints/

In order to check whether this problem can be because of any multi-gpu issues, I also tried the following - I did the following change in trian.py
cudnn.benchmark = True
disp_net = torch.nn.DataParallel(disp_net, device_ids=[0])
pose_exp_net = torch.nn.DataParallel(pose_exp_net, device_ids=[0])
and then did the training again. I still get the blank images during training (it has not finished training, for me to be able to do the inference test).

from sfmlearner-pytorch.

ClementPinard commented on September 3, 2024

Thanks for the info, I'm launching a training and see how it goes. In case the problems is because of multi gpu (I only have one gpu), can you try to completely mask other gpus than your first one ?

CUDA_VISIBLE_DEVICES=0 python3 train.py ./data/kitti/kitti_rawdata_formatted/ -b4 -m0.2 -s0.1 --epoch-size 3000 --sequence-length 3 --log-output --checkpoint_dir ./training/checkpoints/

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

It's going on - seeing blank images on tensorboard (training).

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

This problem was there even before you added the multi-gpu training.

from sfmlearner-pytorch.

ClementPinard commented on September 3, 2024

I have the same problem, which is odd since I have a very similar repo where it worked. I'm working on it

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

Can you also update the pretrained model that is obtained using download_model.sh
There was an error when I tried to use this model. I did not try download_model_5frame.sh, so I don't know if it works.

from sfmlearner-pytorch.

ClementPinard commented on September 3, 2024

The problem seems to be with mask loss weight : 0.1 is not enough, meaning that mask will be 0 everywhere, leading to a photometric error of 0. If you keep -m 1, you should not have problem. I just checked with my machine and it worked, you ?

Also, download_model is actually a file from the repo I initially worked on. So you are basically getting tensorflow models. Althought it could be possible to convert them to pytorch, I did nbot take the time to do it. I'm working on a test script, and once I have satisfying scores I'll release a pretrained pytorch network.

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

With this change, I can see some meaningful outputs on tensorboard. Hopefully I'll be able to try inference tomorrow after the training finishes.

In the documentation, you suggest to run with -b4 --epoch-size 3000, but 3000 is not the default epoch-size that is being used for -b4. Sensing that 3000 is not the correct value, I am not specifying it now.

I am now using a batch size of 48 to speed up the training - not sure if this is okay.

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

Finished 72 epochs - I am able to see reasonably good depth output now.

Thank you for your help. Looking forward for the test script.

Note: I also added batch norm layers to the models, which made the convergence faster.

from sfmlearner-pytorch.

ClementPinard commented on September 3, 2024

Test script avalaible ! Please let me know how it goes for you, with and without your posenet (see README for info)

As a reminder, claimed performance by the paper is :

abs_rel	sq_rel	rms	log_rms	d1_all	a1	a2	a3
0.1978	1.8363	6.5645	0.2750	0.0000	0.7176	0.9010	0.9606

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

I got the error:
Traceback (most recent call last):
File "test_disp.py", line 137, in
main()
File "test_disp.py", line 56, in main
test_files =[file.relpathto(dataset_dir) for file in sum([dataset_dir.files('*.{}'.format(ext)) for ext in args.img_exts], [])]
AttributeError: 'Namespace' object has no attribute 'img_exts'

And the I added this line to test_disp.py to fix it:
parser.add_argument("--img-exts", default=['png', 'jpg', 'bmp'], nargs='*', type=str, help="images extensions to glob")

My results are not as good as what you showed in the table. But I used a different set of hyper parameters. I'll re-run training with the recommended settings and see if I get better results.

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

The training command that I used is:
CUDA_VISIBLE_DEVICES=0 python3 train.py ./data/kitti/kitti_rawdata_formatted/ -b 4 -m 1.0 -s 0.1 --sequence-length 3 --log-output --checkpoint_dir ./training/checkpoints/

And the result that I got is:
abs_rel, sq_rel, rms, log_rms, a1, a2, a3
12.3183, 3329.0864, 127.8886, 2.2357, 0.0389, 0.0788, 0.1223

This seems to be much worse than what you showed.

from sfmlearner-pytorch.

ClementPinard commented on September 3, 2024

what command did you use for testing ? I will check if the testing code is right (most of it is actually copied from original tensorflow code)

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

Test command:

dispnet_weights="./training/checkpoints/200epochs,b4,lr0.0002/11-17-09:58/dispnet_model_best.pth.tar" posenet_weights="./training/checkpoints/200epochs,b4,lr0.0002/11-17-09:58/exp_pose_model_best.pth.tar"

python3 test_disp.py --pretrained-dispnet=$dispnet_weights --pretrained-posenet=$posenet_weights --dataset-dir="/data/hdd/datasets/common/other/kitti/kitti_rawdata/data/" --dataset-list="./kitti_eval/test_files_eigen.txt" --output-dir="training/output/"

from sfmlearner-pytorch.

ClementPinard commented on September 3, 2024

Hello, the code has been updated, you might want to try it now !
For the record, my results are not as good as the paper yet, but we'll get there :D
My results :

Results with scale factor determined by PoseNet :

abs_rel	sq_rel	rms	log_rms	a1	a2	a3
0.4685	29.6911	12.7131	0.4897	0.4586	0.6964	0.8236

Results with scale factor determined by GT/prediction ratio (like the original paper) :

abs_rel	sq_rel	rms	log_rms	a1	a2	a3
0.2646	16.8013	10.2972	0.3406	0.6075	0.8520	0.9358

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

I tried training with the latest code, but got the following error:

=> will save everything to checkpoints/100epochs,b4,lr0.0002,p1,m1.0,s0.1/11-29-12:01
=> fetching scenes in './data/kitti/kitti_rawdata_formatted/'
39856 samples found in 65 train scenes
4830 samples found in 13 valid scenes
=> creating model
=> setting adam solver

Traceback (most recent call last):
File "train.py", line 398, in
main()
File "train.py", line 183, in main
train_loss = train(train_loader, disp_net, pose_exp_net, optimizer, args.epoch_size, logger, train_writer)
File "train.py", line 244, in train
loss_1 = photometric_reconstruction_loss(tgt_img_var, ref_imgs_var, intrinsics_var, intrinsics_inv_var, depth, explainability_mask, pose)
File "/home/a0393608/files/work/code/vision/github/ClementPinard/SfmLearner-Pytorch/loss_functions.py", line 43, in photometric_reconstruction_loss
loss += one_scale(tgt_img, ref_imgs, intrinsics, intrinsics_inv, d, mask, pose)
File "/home/a0393608/files/work/code/vision/github/ClementPinard/SfmLearner-Pytorch/loss_functions.py", line 29, in one_scale
diff *= explainability_mask[:,i:i+1].expand_as(diff)
File "/user/a0393608/files/apps/anaconda2/envs/pytorch/lib/python3.5/site-packages/torch/autograd/variable.py", line 833, in imul
return self.mul_(other)
File "/user/a0393608/files/apps/anaconda2/envs/pytorch/lib/python3.5/site-packages/torch/autograd/variable.py", line 347, in mul_
raise RuntimeError("mul_ only supports scalar multiplication")
RuntimeError: mul_ only supports scalar multiplication

I used the following command for training:
python3 train.py ./data/kitti/kitti_rawdata_formatted/ -b 4 -m 1.0 -s 0.1 --sequence-length 3 --log-output --checkpoint_dir ./training/checkpoints/ --epochs 100

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

Replaced the lines

        if explainability_mask is not None:
            diff *= explainability_mask[:,i:i+1].expand_as(diff)

with

        if explainability_mask is not None:
            mask_i = explainability_mask[:,i:i+1].expand_as(diff)
            diff = torch.mul(diff, mask_i)

and it worked. Is this correct?

from sfmlearner-pytorch.

ClementPinard commented on September 3, 2024

Yes i saw the problem, and your solution solved it. Actually the probleme is the expression *= you could also have written diff = diff*mask_i it is equivalent to torch.mul

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

I get an crash when I run the test script:

dispnet_weights="/user/a0393608/files/work/code/vision/github/ClementPinard/SfmLearner-Pytorch/training/backup/100epochs,b60,lr0.0002,p1,m0.2,s0.1/11-29-13:20/dispnet_model_best.pth.tar"
posenet_weights="/user/a0393608/files/work/code/vision/github/ClementPinard/SfmLearner-Pytorch/training/backup/100epochs,b60,lr0.0002,p1,m0.2,s0.1/11-29-13:20/exp_pose_model_best.pth.tar"

python3 test_disp.py --pretrained-dispnet=$dispnet_weights --pretrained-posenet=$posenet_weights --dataset-dir="/data/hdd/datasets/common/other/kitti/kitti_rawdata/data/" --dataset-list="/user/a0393608/files/work/code/vision/github/ClementPinard/SfmLearner-Pytorch/kitti_eval/test_files_eigen.txt" --output-dir="training/output/"

The error reported is:
Traceback (most recent call last):
File "test_disp.py", line 144, in
main()
File "test_disp.py", line 50, in main
pose_net.load_state_dict(weights['state_dict'])
File "/user/a0393608/files/apps/anaconda2/envs/pytorch/lib/python3.5/site-packages/torch/nn/modules/module.py", line 355, in load_state_dict
.format(name))
KeyError: 'unexpected key "upconv5.0.weight" in state_dict'

The inference script runs fine - but I am not sure if the output is correct.

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

Upadated to latest today and got a crash:
if i % args.print_freq == 0:
logger.train_writer.write('Train: Time {} Data {} Loss {}'.format(
batch_time, data_time=data_time, loss=losses))

should be changed to:
if i % args.print_freq == 0:
logger.train_writer.write('Train: Time {} Data {} Loss {}'.format(
batch_time, data_time, losses))

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

There is another error during training - in the validation phase. Please help

=> will save everything to ./training/checkpoints/100epochs,epochSize1,seq3,b48,lr0.0002,p1,m0.25,s0.1/11-30-14:45
=> fetching scenes in './data/kitti/kitti_rawdata_formatted/'
39856 samples found in 65 train scenes
4830 samples found in 13 valid scenes
=> creating model
=> setting adam solver

Traceback (most recent call last):
File "train.py", line 387, in
main()
File "train.py", line 189, in main
valid_photo_loss, valid_exp_loss, valid_total_loss = validate(val_loader, disp_net, pose_exp_net, epoch, logger, output_writers)
File "train.py", line 374, in validate
logger.valid_writer.write('valid: Time {} Loss {}'.format(batch_time, losses))
File "/home/a0393608/files/work/code/vision/github/ClementPinard/SfmLearner-Pytorch/logger.py", line 87, in repr
val = ' '.join(['{:.{}f}'.format(v, self.precision) for v in self.val])
File "/home/a0393608/files/work/code/vision/github/ClementPinard/SfmLearner-Pytorch/logger.py", line 87, in
val = ' '.join(['{:.{}f}'.format(v, self.precision) for v in self.val])
TypeError: unsupported format string passed to Variable.format

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

@ClementPinard I am stuck because of the above error - not able to proceed. Kindly help quickly.

from sfmlearner-pytorch.

ClementPinard commented on September 3, 2024

L339, append .data[0] to the end of line. A fix is coming
EDIT: Done

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

Thanks. It worked.

Also kindly check the problem during test and inference that I reported earlier.

from sfmlearner-pytorch.

ClementPinard commented on September 3, 2024

For next time, can you submit a new issue when the problem is different ? It helps me keep track of what there is to solve, because this issue has been flagged as "solved" some time ago and e.g. I forgot about your test inference problem.

Your problem is now fixed, The load_state_dict is now selected with strict=False option because we don't need exp masks to compute displacement magnitude.

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

Sure. I'll do that. Where did you learn so much about Pytorch?

from sfmlearner-pytorch.

ClementPinard commented on September 3, 2024

Haha thanks, I've been working on it for some time now, and before that I worked on torch.

Also pytorch form and slack channels are quite good resource to learn it in addition to the docs :

https://discuss.pytorch.org/
http://pytorch.org/support/
http://pytorch.org/docs/master/ # Careful, some modules in master are not available in last release, e.g. lr_scheduler and strict option for load_state_dict

from sfmlearner-pytorch.

mathmanu commented on September 3, 2024

Thanks.

from sfmlearner-pytorch.

Blank output after training on KITTI about sfmlearner-pytorch HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent