Giter VIP home page Giter VIP logo

Comments (16)

Liang-Sen avatar Liang-Sen commented on May 22, 2024 5

@windj007
Thanks for your reply.
I just remove the "loss_segm_pl" from the checkpoint and its worked.

Share the remove_checkpoint here:
https://drive.google.com/file/d/1YTiKZ1hQnKvTEbXIxFXjGg61pBAch_N7/view?usp=sharing

from lama.

AchoWu avatar AchoWu commented on May 22, 2024 5

I summed up the experience above and trained big-lama like this. If I made any mistakes, please correct me.
1.modified pytorch_lightning/trainer/connectors/checkpoint_connector.py Line 106:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f9f4853f3663404362c7de8614a504b0403c25b8/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L106

        # restore training state
        self.restore_training_state(checkpoint)

to

        # restore training state
        try:
            self.restore_training_state(checkpoint)
        except KeyError:
            rank_zero_warn(
                "File at `resume_from_checkpoint` Trying to restore training state but checkpoint contains only the model."
            )

2.modified lama-main/saicinpainting/training/trainers/base.py Line 109:

            if self.config.losses.get("resnet_pl", {"weight": 0})['weight'] > 0:
                self.loss_resnet_pl = ResNetPL(**self.config.losses.resnet_pl)

to

            if self.config.losses.get("sege_pl", {"weight": 0})['weight'] > 0:
                self.loss_sege_pl = ResNetPL(**self.config.losses.sege_pl)

3.run

python bin/train.py -cn big-lama location=my_dataset data.batch_size=10 +trainer.kwargs.resume_from_checkpoint=abspath\\to\\big-lama-with-discr-remove-loss_segm_pl.ckpt

https://drive.google.com/file/d/1YTiKZ1hQnKvTEbXIxFXjGg61pBAch_N7/view?usp=sharing
model shared by @Liang-Sen

from lama.

yzhouas avatar yzhouas commented on May 22, 2024

Could you also share the training log or time of big-lama? Thanks so much.

from lama.

windj007 avatar windj007 commented on May 22, 2024

Is the big-lama model trained on places-challenge dataset?

Not exactly Places Challenge - it was trained on a subset of 157 categories from Places Challenge. Please refer to supp.mat for exact list of these categories.

Whether it performs greatly better than a big-lama trained with places2-standard?

The difference is pretty noticeable by a naked eye, but the improvement from standard -> subset-of-challenge is less than the most important contributions from our paper (e.g. masks, architecture and segm-pl).

from lama.

windj007 avatar windj007 commented on May 22, 2024

Could you also share the training log or time of big-lama? Thanks so much.

It took approximately 12 days to train this big-lama on 8xV100 32GB with total batch size of 120 (8 gpus x 15 samples).

from lama.

windj007 avatar windj007 commented on May 22, 2024

Is it possible to release the full checkpoints of the big-lama model, so we can finetune it on other data?

I've just uploaded full checkpoint to https://disk.yandex.ru/d/wJ2Ee0f1HvasDQ subfoler big-lama-with-discr - unlike other checkpoints, this one has discriminator and SegmPL weights included.

Please share your experience with finetuning - does it help and how dramatically.

from lama.

yzhouas avatar yzhouas commented on May 22, 2024

Thanks so much! That is super helpful!

from lama.

windj007 avatar windj007 commented on May 22, 2024

I'll close that issue for now - feel free to reopen if you have any issies with fine-tuning

from lama.

affromero avatar affromero commented on May 22, 2024

Hello,
I am having some issues loading the big-lama-with-discr for finetuning. Please correct me if I am wrong but I notice that the SegmPL weights are loss_segm_pl.impl... in the .ckpt, but the current trainer loads it as loss_resnet_pl.impl... https://github.com/saic-mdal/lama/blob/ede702b19b027ad2c0380419b2b71a90fe90a14f/saicinpainting/training/trainers/base.py#L110

After modifying this, I get the following error:

    'Trying to restore training state but checkpoint contains only the model.'
KeyError: 'Trying to restore training state but checkpoint contains only the model. This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`.'

@yzhouas did you have any success with this? I am wondering if it is just me.

from lama.

affromero avatar affromero commented on May 22, 2024

Apparently, this is a known issue in Pytorch Lightning, and the problem for the suggested Pytorch Lightning 1.2.9 seems to be here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/f9f4853f3663404362c7de8614a504b0403c25b8/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L106

        # restore training state
        self.restore_training_state(checkpoint)

So, a very ugly hack would be to bypass it as:

        # restore training state
        try:
            self.restore_training_state(checkpoint)
        except KeyError:
            rank_zero_warn(
                "File at `resume_from_checkpoint` Trying to restore training state but checkpoint contains only the model."
            )

from lama.

windj007 avatar windj007 commented on May 22, 2024

Hi @affromero !

Yeah, I forgot that we changed the name of this variable already after training big lama... Another possible solution is to just strip loss_segm_pl.impl... from the checkpoint altogether - anyway it is initialized from a fixed ade20k checkpoint.

Trying to restore training state but checkpoint contains only the model.

I have not faced this issue yet. Have you resolved it?

from lama.

marcelsan avatar marcelsan commented on May 22, 2024

Hi @windj007,

I looked into the Supplementary Material but I was not able to find what categories from Places Challenge were used for training Big-Lama. Could you please list these categories? Also, why haven't you used the entire Places Challenge for training Big-Lama?

Thank you

from lama.

Liang-Sen avatar Liang-Sen commented on May 22, 2024

Hi @windj007 ,

I am having the some issue loading the big-lama-with-discr for finetuning, please correct me if I made any mistake.

I run this command:
python bin/train.py -cn big-lama location=my_dataset data.batch_size=10 +trainer.kwargs.resume_from_checkpoint=path\\to\\big-lama-with-discr\\best.ckpt

and got this error message:
RuntimeError: Error(s) in loading state_dict for DefaultInpaintingTrainingModule:
Missing key(s) in state_dict: "loss_resnet_pl.impl.conv1.weight", "loss_resnet_pl......
Unexpected key(s) in state_dict: "loss_segm_pl.impl.conv1.weight", "loss_segm_pl.impl....

I modified base.py Line 109:
From:
if self.config.losses.get("resnet_pl", {"weight": 0})['weight'] > 0: self.loss_resnet_pl = ResNetPL(**self.config.losses.resnet_pl)

To:
if self.config.losses.get("sege_pl", {"weight": 0})['weight'] > 0: self.loss_sege_pl = ResNetPL(**self.config.losses.sege_pl)

And and Missing key error is disappeared, but still have the Unexpected key error message:
Unexpected key(s) in state_dict: "loss_segm_pl.impl.conv1.weight", "loss_segm_pl.impl...._

Do you have any suggestion for this?

from lama.

windj007 avatar windj007 commented on May 22, 2024

@marcelsan The list is there, on page 5.

why haven't you used the entire Places Challenge for training Big-Lama?

Bigger datasets need bigger models - and smaller models work better when the dataset is more focused. And Big-LaMa is not that big in terms of number of trainable parameters.

from lama.

windj007 avatar windj007 commented on May 22, 2024

@Liang-Sen

And and Missing key error is disappeared, but still have the Unexpected key error message:

The quick solution is a couple of comments above:

Another possible solution is to just strip loss_segm_pl.impl... from the checkpoint altogether - anyway it is initialized from a fixed ade20k checkpoint.

I should have fixed and reupploaded the checkpoint, but have not found time yet...

from lama.

windj007 avatar windj007 commented on May 22, 2024

@Liang-Sen thank you!

from lama.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.