Giter VIP home page Giter VIP logo

Comments (6)

lpiccinelli-eth avatar lpiccinelli-eth commented on September 20, 2024

Thank you for your appreciation.

In my experience, the training loss is quite high, too. I would double check if the model is using the backbone pretrained on ImageNet, namely, does it print out something like "Encoder is pretrained from..." at the beginning of the training?
Another thing to check might be a mismatch between validation and training GT (for instance the depth_scale, usually for KITTI is 256.0).

Any additional information may be helpful in understanding where the problem lies.
Best.

from idisc.

sunpihai-up avatar sunpihai-up commented on September 20, 2024

Thank you for your appreciation.

In my experience, the training loss is quite high, too. I would double check if the model is using the backbone pretrained on ImageNet, namely, does it print out something like "Encoder is pretrained from..." at the beginning of the training? Another thing to check might be a mismatch between validation and training GT (for instance the depth_scale, usually for KITTI is 256.0).

Any additional information may be helpful in understanding where the problem lies. Best.

Thanks for your reply, I have investigated the code as per your suggestion.
Firstly, I verified that the program correctly loads the swin_large_22k model pretrained on ImageNet.

Since I already have the pre-trained model locally, I modified the code that originally used the URL to load the model, and instead loaded it using the local file path.

# Before Modification
if pretrained:
            print(f"\t-> Encoder is pretrained from: {pretrained}")
            pretrained_state = load_state_dict_from_url(pretrained, map_location="cpu")[
                "model"
            ]
            info = self.load_state_dict(deepcopy(pretrained_state), strict=False)
            print("Loading pretrained info:", info)

# After Modification
if pretrained:
            
            from urllib.parse import urlparse
            def is_url(path):
                # Check pretrained is URL or path
                result = urlparse(path)
                return all([result.scheme, result.netloc])
            
            print(f"\t-> Encoder is pretrained from: {pretrained}")
            if is_url(pretrained):
                pretrained_state = load_state_dict_from_url(pretrained, map_location="cpu")[
                    "model"
                ]
                info = self.load_state_dict(deepcopy(pretrained_state), strict=False)
                print("Loading pretrained info:", info)
            else:
                pretrained_state = torch.load(pretrained, map_location="cpu")["model"]

Therefore, when I run the training program, four prompt messages (from four processes) will be printed: Encoder is pretrained from: /home/sph/data/swin_transformer/swin_large_patch4_window7_224_22k.pth.

Secondly, I addressed the alignment issue you mentioned between the training set and the test set. I used the Eigen splits on KITTI for both the training set and the test set. However, I couldn't find any factors in the program that could cause a mismatch between them. I noticed that loading the training set and the test set uses the same code module (class KITTIDataset). Also, their depth_scale values are both set to 256.

Additionally, I performed tests on the training set and the test set using the weights you provided and the weights I trained myself, respectively (using test.py from the repository).

Training Set: Randomly selected 600 images from the training set on Eigen Split.

Test Set: All 652 valid images from the test set on Eigen Split.

The model weights you provide The model weights I trained
Training Set Test/SILog: 0.38459012309710183
d05 0.9829 (0.9829)
d1 0.997 (0.997)
d2 0.9995 (0.9995)
d3 0.9999 (0.9999)
rmse 1.1298 (1.1298)
rmse_log 0.0408 (0.0408)
abs_rel 0.0259 (0.0259)
sq_rel 0.0404 (0.0404)
log10 0.0111 (0.0111)
silog 3.7325 (3.7325)
Test/SILog: 0.3662095022201537
d05 0.9858 (0.9858)
d1 0.9973 (0.9973)
d2 0.9995 (0.9995)
d3 0.9999 (0.9999)
rmse 0.9945 (0.9945)
rmse_log 0.0371 (0.0371)
abs_rel 0.022 (0.022)
sq_rel 0.0308 (0.0308)
log10 0.0095 (0.0095)
silog 3.6079 (3.6079)
Test Set Test/SILog: 0.7632721244561964
d05 0.8968 (0.8968)
d1 0.9771 (0.9771)
d2 0.9973 (0.9973)
d3 0.9993 (0.9993)
rmse 2.0665 (2.0665)
rmse_log 0.0772 (0.0772)
abs_rel 0.0504 (0.0504)
sq_rel 0.1455 (0.1455)
log10 0.0218 (0.0218)
silog 7.0735 (7.0735)
Test/SILog: 1.2326601183995969
d05 0.7809 (0.7809)
d1 0.9256 (0.9256)
d2 0.9846 (0.9846)
d3 0.9958 (0.9958)
rmse 3.177 (3.177)
rmse_log 0.1249 (0.1249)
abs_rel 0.081 (0.081)
sq_rel 0.3827 (0.3827)
log10 0.0351 (0.0351)
silog 11.2095 (11.2095)

Both models perform similarly on the training set (or my trained model even performs slightly better). However, there is a significant difference in performance between the two models on the test set. This suggests the presence of overfitting. However, during the training process, there was no occurrence of the evaluation metric initially improving and then deteriorating later.

I look forward to hearing your further suggestions. Thank you once again for your reply. Best wishes to you!

from idisc.

lpiccinelli-eth avatar lpiccinelli-eth commented on September 20, 2024

You could try using the provided checkpoint and test it on your data/code and see if the results match the ones provided.
If they match then the problem is the training, if not, the problem might be the data.
Best.

from idisc.

sunpihai-up avatar sunpihai-up commented on September 20, 2024

You could try using the provided checkpoint and test it on your data/code and see if the results match the ones provided. If they match then the problem is the training, if not, the problem might be the data. Best.

Yes, I did exactly that. The table I provided describes this work.
I wanted to evaluate the effectiveness of my training, so I tested it separately on the training set using both the checkpoint you provided and the one I trained.
I also wanted to check my data, so I tested it separately on the test set using both the checkpoint you provided and the one I trained.
However, the results were peculiar. The checkpoint you provided performed well on both the training and test sets. On the other hand, the checkpoint I trained outperformed yours on the training set but performed poorly on the test set.

So, the fact that the checkpoint you provided performs well on both my training and test sets suggests that there might not be an problem with my dataset.
The checkpoint I trained myself performs very well on the training set, which indicates that my training process is effective.
However, the strange thing is that my own trained checkpoint performs even better than the one you provided on the training set, yet it performs very poorly on the test set. This seems to exhibit signs of overfitting, but based on the evaluation metric trends during the training process, it doesn't appear that overfitting occurred.

from idisc.

lpiccinelli-eth avatar lpiccinelli-eth commented on September 20, 2024

Honestly, I do not know, you are not seeing any overfitting, but it does not generalize either since the training metrics are good, but not the validation ones. Moreover, KITTI validation and training are pretty similar, so I wonder why such drop.
In addition, I was able to reproduce the results with SWin-Tiny: I checked validation after the first 1k steps and they matched my original training.

Either the training set is different wrt the one I used (I used the "new" Eigen split, namely the one after 2019) or the configs (i.e., augmentations, training schedule/lr, etc...) have something different.

from idisc.

sunpihai-up avatar sunpihai-up commented on September 20, 2024

Honestly, I do not know, you are not seeing any overfitting, but it does not generalize either since the training metrics are good, but not the validation ones. Moreover, KITTI validation and training are pretty similar, so I wonder why such drop. In addition, I was able to reproduce the results with SWin-Tiny: I checked validation after the first 1k steps and they matched my original training.

Either the training set is different wrt the one I used (I used the "new" Eigen split, namely the one after 2019) or the configs (i.e., augmentations, training schedule/lr, etc...) have something different.

Thank you for your assistance! This situation is indeed perplexing. I believe we can rule out differences in the dataset and configuration since I used the kitti_eigen_test.txt and kitti_eigen_train.txt files provided in the repository. I also verified that the file paths in the loaded dataset match exactly with the split. Furthermore, I haven't made any modifications to the configuration file.

I would like to make some attempts based on your work. Therefore, I will continue to try and debug the issue.

Once again, thank you for your help, and I wish you a pleasant day!

from idisc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.