Giter VIP home page Giter VIP logo

cassle's Introduction

Self-Supervised Models are Continual Learners

This is the official repository for the paper:

Self-Supervised Models are Continual Learners
Enrico Fini*, Victor Turrisi*, Xavier Alameda-Pineda, Elisa Ricci, Karteek Alahari, Julien Mairal
CVPR 2022

Abstract: Self-supervised models have been shown to produce comparable or better visual representations than their supervised counterparts when trained offline on unlabeled data at scale. However, their efficacy is catastrophically reduced in a Continual Learning (CL) scenario where data is presented to the model sequentially. In this paper, we show that self-supervised loss functions can be seamlessly converted into distillation mechanisms for CL by adding a predictor network that maps the current state of the representations to their past state. This enables us to devise a framework for Continual self-supervised visual representation Learning that (i) significantly improves the quality of the learned representations, (ii) is compatible with several state-of-the-art self-supervised objectives, and (iii) needs little to no hyperparameter tuning. We demonstrate the effectiveness of our approach empirically by training six popular self-supervised models in various CL settings.


Overview of our method and results

NOTE: most of the code in this repository is borrowed from solo-learn

Installation

Use the following commands to create an environment and install the required packages (needs conda):

conda create --name cassle python=3.8
conda activate cassle
conda install pytorch=1.10.2 torchvision cudatoolkit=11.3 -c pytorch
pip install pytorch-lightning==1.5.4 lightning-bolts wandb sklearn einops
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110

Remember to check your cuda version and modify the install commands accorgingly.

OPTIONAL: consider installing pillow-SIMD for faster data loading:

pip uninstall pillow
CC="cc -mavx2" pip install -U --force-reinstall pillow-simd

Commands

Here below you can find a few example commands for running our code. The bash scripts with full training configurations for our continual and linear evaluation experiments can be found in the bash_files folder. Use our job_launcher.py to launch continual self-supervised learning experiments. We also provide example code for launching jobs with SLURM where you can pass the desired configuration for your job (bash script, data directory, number of GPUs, walltime, etc...).

NOTE: each experiment uses a different number of gpus (1 for CIFAR100, 2 for ImageNet100 and 4 for DomainNet). You can change this setting directly in the bash scripts.

Fine-tuning

CIFAR100

E.g. running Barlow Twins:

DATA_DIR=/path/to/data/dir/ CUDA_VISIBLE_DEVICES=0 python job_launcher.py --script bash_files/continual/cifar/barlow.sh

ImageNet100

Class-incremental

E.g. running BYOL:

DATA_DIR=/path/to/data/dir/ CUDA_VISIBLE_DEVICES=0,1 python job_launcher.py --script bash_files/continual/imagenet-100/class/byol.sh

Data-incremental

E.g. running SimCLR:

DATA_DIR=/path/to/data/dir/ CUDA_VISIBLE_DEVICES=0,1 python job_launcher.py --script bash_files/continual/imagenet-100/data/simclr.sh

DomainNet

E.g. running SwAV:

DATA_DIR=/path/to/data/dir/ CUDA_VISIBLE_DEVICES=0,1,2,3 python job_launcher.py --script bash_files/continual/domainnet/swav.sh

CaSSLe

After running fine-tuning, you can also run CaSSLe by just loading the checkpoint of the first task. You will find all the checkpoints in your experiment directory (defaults to "./experiments"). Check the id of your run on WandB to make sure you are loading the correct checkpoint.

CIFAR100

E.g. running Barlow Twins + CaSSLe:

PRETRAINED_PATH=/path/to/task0/checkpoint/ DATA_DIR=/path/to/data/dir/ CUDA_VISIBLE_DEVICES=0 python job_launcher.py --script bash_files/continual/cifar/barlow_distill.sh

ImageNet100

Class-incremental

E.g. running BYOL + CaSSLe:

PRETRAINED_PATH=/path/to/task0/checkpoint/ DATA_DIR=/path/to/data/dir/ CUDA_VISIBLE_DEVICES=0,1 python job_launcher.py --script bash_files/continual/imagenet-100/class/byol_distill.sh

Data-incremental

E.g. running SimCLR + CaSSLe:

PRETRAINED_PATH=/path/to/task0/checkpoint/ DATA_DIR=/path/to/data/dir/ CUDA_VISIBLE_DEVICES=0,1 python job_launcher.py --script bash_files/continual/imagenet-100/data/simclr_distill.sh

DomainNet

E.g. running SwAV + CaSSLe:

PRETRAINED_PATH=/path/to/task0/checkpoint/ DATA_DIR=/path/to/data/dir/ CUDA_VISIBLE_DEVICES=0,1,2,3 python job_launcher.py --script bash_files/continual/domainnet/swav_distill.sh

Linear Evaluation

For linear evaluation you do not need the job launcher. You can simply run the scripts from bash_files/linear, e.g., for VICReg:

PRETRAINED_PATH=/path/to/last/checkpoint/ DATA_DIR=/path/to/data/dir/ bash bash_files/linear/imagenet-100/class/vicreg_linear.sh

Logging

Logging is performed with WandB. Please create an account and specify your --entity YOUR_ENTITY and --project YOUR_PROJECT in the bash scripts. For debugging, or if you do not want all the perks of WandB, you can disable logging by passing --offline in your bash scripts. After training you can always sync an offline run with the following command: wandb sync your/wandb/run/folder.

Citation

If you like our work, please cite our paper:

@inproceedings{fini2021self,
  title={Self-Supervised Models are Continual Learners},
  author={Fini, Enrico and da Costa, Victor G Turrisi and Alameda-Pineda, Xavier and Ricci, Elisa and Alahari, Karteek and Mairal, Julien},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

cassle's People

Contributors

donkeyshot21 avatar vturrisi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cassle's Issues

Some questions about training and evaluation process

Hello,

Thank you for your fantastic project! I have some questions regarding model evaluation.
1)Taking CIFAR10 as an example, if there are 2 tasks, each with 5 classes, is the process shown in the following figure correct?

截屏2023-03-08 12 01 14

2)If it is correct, after the self-supervised continual learning part is completed, a 10-class classifier will be trained. When training this 10-class classifier, will all the data from all categories be used simultaneously?

3)Additionally, what is the overall process for Fine-tuning (using Table 2 as an example, Strategy 1 Fine-tuning)? Is it to replace CaSSLe with non-continual learning SSL method?

截屏2023-03-08 12 13 03

Thanks!

KNN Classifier Issue

Hi,
I found this work very interesting and plan to work on similar topics. However I encounter some issues:
(1) For the fine-tuning example with Barlow Twins and CIFAR-100, should it be barlow.sh instead of barlow_distill.sh? Otherwise, we need to provide the pretrained model in order to successfully run the code.
(2) If I enable the the KNN online evaluation by setting disable_knn_eval = False, there was an issue showing empty test feature and expect argument in base.py line 432. I saw the previous closed issue saying the similar thing but it still appears even if I set a meaningful online_eval_batch_size = 256.
Thanks for your help!

About the Forward Transfer

Hi,
Thanks for your excellent work!
I'm curious about how to calculate the "Forward Transfer" after training. For example, I have successfully re-produced the class-il results for Fine-tuning and CaSSLe (with BYOL) on Cifar-100 but don't know how to directly check the FT results. Does it need a seperate run to obtain the "linear evaluation accuracy of a random network" as the paper stated?
BTW, just to be sure, is it right to directly check the "val_acc1" results of wandb board as the final linear evaluation accuracy?

Why did you load the checkpoint of task 0 before training cassle?

Hello,
Congratulations on your excellent work. I have a question about the training setting.

Why did you load the checkpoint of task 0 before training cassle? I see the first task of cassle is trained without distillers. Then the setting is the same as the first task of finetuning. I think loading the checkpoint is unnecessary.

Looking forward to your reply,
Thanks.

Some question about lower and upper bounds

Hi,

截屏2023-06-02 12 56 14

I have some questions regarding the calculation of upper and lower bounds, taking class incremental learning as an example:

In supervised learning, the lower bound (Fine-tuning) is performed in a task-specific manner, i.e., Task 1 fine-tuning -> Task 2 fine-tuning ...; whereas the upper bound (offline) involves training a model by integrating all the data together.

Regarding SimCLR, my understanding is that the lower bound (Fine-tuning) corresponds to SSL (Self-Supervised Learning) stage, where it undergoes Task 1 SSL -> Task 2 SSL ..., followed by Linear Evaluation. The upper bound (offline) involves performing SSL on the entire dataset and then conducting Linear Evaluation. I'm not sure if my understanding is correct ?

¿Bug in online KNN eval?

Hello,

Congrats for your paper! It touches very interesting questions and I'd love to further study the problem of CSSL!

I am trying to execute your script for training barlow twins python job_launcher.py --script bash_files/continual/cifar/barlow_distill.sh, but I might have encountered a bug: if I train with the WeightedKNNClassifier for performance monitor, your code calls its forward here with only the train_features and target_features provided.
After that, the compute function breaks down here at line 89 because self.test_features is an empty list.

Am I getting something wrong? I am working in a new conda env with setup as specified in your README file.

Muchas gracias!

The data for Linear Evaluation Accuracy.

Hi,

This is an exciting and enlightening work.

I wonder where the data for training the classifier come from, for linear evaluation accuracy.
The training data of the current task?

Question about contrastive distillation loss

Hi,

I have a few questions about the simclr code.

  1. logits = torch.einsum("if, jf -> ij", p, z) / temperature

    It seems that the predicted features (p) are not in the negatives, which is different from what's suggested in the paper (appendix B). I understand that you switch p and z here (for a symmetric loss?)
    distill_loss = (
    simclr_distill_loss_func(p1, p2, frozen_z1, frozen_z2, self.distill_temperature)
    + simclr_distill_loss_func(frozen_z1, frozen_z2, p1, p2, self.distill_temperature)
    ) / 2

    but there is still no comparisons between different samples in p.

  2. In the paper the distillation loss is applied to the two views independently. Based on the code above, does it mean that we should use them jointly to reproduce the result?

  3. logit_mask = torch.ones_like(pos_mask, device=device)
    logit_mask.fill_diagonal_(True)
    logit_mask[:, b:].fill_diagonal_(True)
    logit_mask[b:, :].fill_diagonal_(True)

    The four lines of code here seem to make logit_mask an all-ones matrix. In my understanding we should assign the diagonals to False. Am I missing something?

TIA

Forward Transfer Issue

Hi! We are following your excellent work.

We would like to know more clearly the details of your experiments on CIRAR-100 to calculate Forward Transfer, such as how the accuracy of the random model on each task is obtained.

If we understand correctly, since the random seed is fixed, then the accuracy of the random model should be fixed as well. Is it possible to provide the accuracy of the random model on five tasks for reference.

Thanks!

train data on DomainNet

I'm wondering if you provide the correct training procedure for DomainNet, as I see from main_pretrain.py, you only use the trainer.fit() on validation data, and it seems not a train but a validation. Moreover, is that DomainNet data is in same with the Dali data?

DomainNet dataset version

Hi,

Very interesting work.
Did you use the cleaned version of DomainNet or the original one?
The cleaned version excludes a lot of duplicate images.

Thanks

Need of checkpoints : BT and VicReg

Hi all,

Is there any link where I can access to the checkpoint of model trained using Barlowtwins and VicREG ?

I would like to evaluate this approach using different models and need the trained last checkpoint of these models.

Thanks.

The classifier for Linear Evaluation Accuracy

Hi,

This is an exciting and enlightening work.

I am confused by the number of classifiers for Linear Evaluation Accuracy.

In the paper, you said, "For class-incremental and data-incremental, we use the task-agnostic setting, meaning that at evaluation time we do not assume to know the task ID". As I understand it, this means that you only maintain one classifier and continuously optimize it after learning each task for linear evaluation accuracy.

However, I found in #1 that you said, "as we operate in the class-incremental setting we train one linear classifier per task."

I would appreciate a clearer explanation.

Thanks.

problem with reproducibility

Hi,

thanks for your interesting work.

I have problems reproducing the results.

  1. Did you use dali for all your experiments? Can we trust the results of the regular data loader? I'm getting 6~7% accuracy drops on Imagenet while switching from dali to regular dataloader (I needed to run the regular one for fair comparisons with my method).
  2. Also I think there is a problem here
    args.lr = args.lr * args.batch_size * len(args.gpus) / 256

    why is 256 hardcoded here?
  3. It would be nice to mention in the readme that the batch size needs to be modified based on the number of gpus.

Thanks

Clarification on the Role of the Predictor Network g in the CaSSLe Framework

Hi,

Thank you very much for your excellent work. I have a theoretical question regarding the role of the predictor network g as described in section 5, "The CaSSLe Framework":

Now, our goal is to ensure that z contains at least as much information as (and ideally more than) z̄. Instead of enforcing the two feature vectors to be similar, and hence discouraging the new model from learning new concepts, we propose to use a predictor network g to project the representations from the new feature space to the old one. If the predictor is able to perfectly map from one space to the other, then it implies that z is at least as powerful as z̄.

Could you please clarify how being able to project from the new feature space to the old one using g ensures that z will prevent the loss of information from the old task? Additionally, if this is the case, how do you enforce this without having access to some old data (replay), especially considering the significant visual differences in some datasets, such as DomainNet, where visual features can vary greatly between domains?

Thank you.

How did you train the classifier?

Hello,

I have read your paper. It is very impressive. I got a question for class incremental setting and am wondering to know if you can answer.

Did you train the classifier for each task only in the embedding training process? Or did you re-train all classifiers after all task embedding training processes finish? I see that the embedding of the previous task may change after the next task is trained. How does the old classifier trained by the old embedding format take this changed embedding? Your paper mentioned "a subset, e.g., 10% of the data". Did this mean using 10% of data to retrain the classifier at the very end?

Looking forward to your kind reply.

Thanks.

Difficulty In Reproducing The Paper Results (e.g., Table 4., BYOL reported 66%, measured <<60%)

Hi,

Thanks a lot for your amazing work and releasing the code. I am trying to reproduce your Table 4 for sometime. I directly use the code and the scripts with NO modification.

For example, in this Table, BYOL fine-tuning on ImageNet-100 for 5-class incremental task performance is 66.0. Instead, I measured below <<60.0, at least 6% below. Please see the full results Table below if interested (a 5 x 5 Table).

results.pdf

Any idea what may be causing the gap? Is there any nuances in evaluation method? For example, for average accuracy, I simply take the mean of the below Table across all rows and colums (as also suggested by GEM, as you referenced).

Thanks a lot again for your response and your eye-opening work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.