Giter VIP home page Giter VIP logo

vturrisi / solo-learn Goto Github PK

View Code? Open in Web Editor NEW
1.4K 11.0 182.0 5.17 MB

solo-learn: a library of self-supervised methods for visual representation learning powered by Pytorch Lightning

License: MIT License

Shell 3.00% Python 97.00%
simclr nvidia-dali contrastive-learning pytorch pytorch-lightning barlow-twins self-supervised-learning swav byol moco simsiam vicreg nnclr dino deepcluster ressl vibcreg transformer-models mae masked-input-prediction

solo-learn's Introduction

tests Documentation Status codecov

solo-learn

A library of self-supervised methods for unsupervised visual representation learning powered by PyTorch Lightning. We aim at providing SOTA self-supervised methods in a comparable environment while, at the same time, implementing training tricks. The library is self-contained, but it is possible to use the models outside of solo-learn. More details in our paper.


News

  • [Jan 14 2024]: πŸ‘ Bunch of stability improvements during 2023 :) Also added All4One.
  • [Jan 07 2023]: 🀿 Added results, checkpoints and configs for MAE on ImageNet. Thanks to HuangChiEn.
  • [Dec 31 2022]: 🌠 Shiny new logo! Huge thanks to Luiz!
  • [Sep 27 2022]: πŸ“ Brand new config system using OmegaConf/Hydra. Adds more clarity and flexibility. New tutorials will follow soon!
  • [Aug 04 2022]: πŸ–ŒοΈ Added MAE and supports finetuning the backbone with main_linear.py, mixup, cutmix and random augment.
  • [Jul 13 2022]: πŸ’– Added support for H5 data, improved scripts and data handling.
  • [Jun 26 2022]: πŸ”₯ Added MoCo V3.
  • [Jun 10 2022]: πŸ’£ Improved LARS.
  • [Jun 09 2022]: 🍭 Added support for WideResnet, multicrop for SwAV and equalization data augmentation.
  • [May 02 2022]: πŸ’  Wrapped Dali with a DataModule, added auto resume for linear eval and Wandb run resume.
  • [Apr 12 2022]: 🌈 Improved design of models and added support to train with a fraction of data.
  • [Apr 01 2022]: πŸ” Added the option to use channel last conversion which considerably decreases training times.
  • [Feb 04 2022]: πŸ₯³ Paper got accepted to JMLR.
  • [Jan 31 2022]: πŸ‘οΈ Added ConvNeXt support with timm.
  • [Dec 20 2021]: 🌑️ Added ImageNet results, scripts and checkpoints for MoCo V2+.
  • [Dec 05 2021]: 🎢 Separated SupCon from SimCLR and added runs.
  • [Dec 01 2021]: β›² Added PoolFormer.
  • [Nov 29 2021]: ‼️ Breaking changes! Update your versions!!!
  • [Nov 29 2021]: πŸ“– New tutorials!
  • [Nov 29 2021]: 🏘️ Added offline K-NN and offline UMAP.
  • [Nov 29 2021]: 🚨 Updated PyTorch and PyTorch Lightning versions. 10% faster.
  • [Nov 29 2021]: 🍻 Added code of conduct, contribution instructions, issue templates and UMAP tutorial.
  • [Nov 23 2021]: πŸ‘Ύ Added VIbCReg.
  • [Oct 21 2021]: 😀 Added support for object recognition via Detectron v2 and auto resume functionally that automatically tries to resume an experiment that crashed/reached a timeout.
  • [Oct 10 2021]: πŸ‘Ή Restructured augmentation pipelines to allow more flexibility and multicrop. Also added multicrop for BYOL.
  • [Sep 27 2021]: πŸ• Added NNSiam, NNBYOL, new tutorials for implementing new methods 1 and 2, more testing and fixed issues with custom data and linear evaluation.
  • [Sep 19 2021]: 🦘 Added online k-NN evaluation.
  • [Sep 17 2021]: πŸ€– Added ViT and Swin.
  • [Sep 13 2021]: πŸ“– Improved Docs and added tutorials for pretraining and offline linear eval.
  • [Aug 13 2021]: 🐳 DeepCluster V2 is now available.

Roadmap and help needed

  • Redoing the documentation to improve clarity.
  • Better and up-to-date tutorials.
  • Add performance-related testing to ensure that methods perform the same across updates.
  • Adding new methods (continuous effort).

Methods available


Extra flavor

Backbones

Data

  • Increased data processing speed by up to 100% using Nvidia Dali.
  • Flexible augmentations.

Evaluation

  • Online linear evaluation via stop-gradient for easier debugging and prototyping (optionally available for the momentum backbone as well).
  • Standard offline linear evaluation.
  • Online and offline K-NN evaluation.
  • Automatic feature space visualization with UMAP.

Training tricks

  • All the perks of PyTorch Lightning (mixed precision, gradient accumulation, clipping, and much more).
  • Channel last conversion
  • Multi-cropping dataloading following SwAV:
    • Note: currently, only SimCLR, BYOL and SwAV support this.
  • Exclude batchnorm and biases from weight decay and LARS.
  • No LR scheduler for the projection head (as in SimSiam).

Logging

  • Metric logging on the cloud with WandB
  • Custom model checkpointing with a simple file organization.

Requirements

  • torch
  • torchvision
  • tqdm
  • einops
  • wandb
  • pytorch-lightning
  • lightning-bolts
  • torchmetrics
  • scipy
  • timm

Optional:

  • nvidia-dali
  • matplotlib
  • seaborn
  • pandas
  • umap-learn

Installation

First clone the repo.

Then, to install solo-learn with Dali and/or UMAP support, use:

pip3 install .[dali,umap,h5] --extra-index-url https://developer.download.nvidia.com/compute/redist

If no Dali/UMAP/H5 support is needed, the repository can be installed as:

pip3 install .

For local development:

pip3 install -e .[umap,h5]
# Make sure you have pre-commit hooks installed
pre-commit install

NOTE: if you are having trouble with dali, install it following their guide.

NOTE 2: consider installing Pillow-SIMD for better loading times when not using Dali.

NOTE 3: Soon to be on pip.


Training

For pretraining the backbone, follow one of the many bash files in scripts/pretrain/. We are now using Hydra to handle the config files, so the common syntax is something like:

python3 main_pretrain.py \
    # path to training script folder
    --config-path scripts/pretrain/imagenet-100/ \
    # training config name
    --config-name barlow.yaml
    # add new arguments (e.g. those not defined in the yaml files)
    # by doing ++new_argument=VALUE
    # pytorch lightning's arguments can be added here as well.

After that, for offline linear evaluation, follow the examples in scripts/linear or scripts/finetune for finetuning the whole backbone.

For k-NN evaluation and UMAP visualization check the scripts in scripts/{knn,umap}.

NOTE: Files try to be up-to-date and follow as closely as possible the recommended parameters of each paper, but check them before running.


Tutorials

Please, check out our documentation and tutorials:

If you want to contribute to solo-learn, make sure you take a look at how to contribute and follow the code of conduct


Model Zoo

All pretrained models avaiable can be downloaded directly via the tables below or programmatically by running one of the following scripts zoo/cifar10.sh, zoo/cifar100.sh, zoo/imagenet100.sh and zoo/imagenet.sh.


Results

Note: hyperparameters may not be the best, we will be re-running the methods with lower performance eventually.

CIFAR-10

Method Backbone Epochs Dali Acc@1 Acc@5 Checkpoint
All4One ResNet18 1000 ❌ 93.24 99.88 πŸ”—
Barlow Twins ResNet18 1000 ❌ 92.10 99.73 πŸ”—
BYOL ResNet18 1000 ❌ 92.58 99.79 πŸ”—
DeepCluster V2 ResNet18 1000 ❌ 88.85 99.58 πŸ”—
DINO ResNet18 1000 ❌ 89.52 99.71 πŸ”—
MoCo V2+ ResNet18 1000 ❌ 92.94 99.79 πŸ”—
MoCo V3 ResNet18 1000 ❌ 93.10 99.80 πŸ”—
NNCLR ResNet18 1000 ❌ 91.88 99.78 πŸ”—
ReSSL ResNet18 1000 ❌ 90.63 99.62 πŸ”—
SimCLR ResNet18 1000 ❌ 90.74 99.75 πŸ”—
Simsiam ResNet18 1000 ❌ 90.51 99.72 πŸ”—
SupCon ResNet18 1000 ❌ 93.82 99.65 πŸ”—
SwAV ResNet18 1000 ❌ 89.17 99.68 πŸ”—
VIbCReg ResNet18 1000 ❌ 91.18 99.74 πŸ”—
VICReg ResNet18 1000 ❌ 92.07 99.74 πŸ”—
W-MSE ResNet18 1000 ❌ 88.67 99.68 πŸ”—

CIFAR-100

Method Backbone Epochs Dali Acc@1 Acc@5 Checkpoint
All4One ResNet18 1000 ❌ 72.17 93.35 πŸ”—
Barlow Twins ResNet18 1000 ❌ 70.90 91.91 πŸ”—
BYOL ResNet18 1000 ❌ 70.46 91.96 πŸ”—
DeepCluster V2 ResNet18 1000 ❌ 63.61 88.09 πŸ”—
DINO ResNet18 1000 ❌ 66.76 90.34 πŸ”—
MoCo V2+ ResNet18 1000 ❌ 69.89 91.65 πŸ”—
MoCo V3 ResNet18 1000 ❌ 68.83 90.57 πŸ”—
NNCLR ResNet18 1000 ❌ 69.62 91.52 πŸ”—
ReSSL ResNet18 1000 ❌ 65.92 89.73 πŸ”—
SimCLR ResNet18 1000 ❌ 65.78 89.04 πŸ”—
Simsiam ResNet18 1000 ❌ 66.04 89.62 πŸ”—
SupCon ResNet18 1000 ❌ 70.38 89.57 πŸ”—
SwAV ResNet18 1000 ❌ 64.88 88.78 πŸ”—
VIbCReg ResNet18 1000 ❌ 67.37 90.07 πŸ”—
VICReg ResNet18 1000 ❌ 68.54 90.83 πŸ”—
W-MSE ResNet18 1000 ❌ 61.33 87.26 πŸ”—

ImageNet-100

Method Backbone Epochs Dali Acc@1 (online) Acc@1 (offline) Acc@5 (online) Acc@5 (offline) Checkpoint
All4One ResNet18 400 βœ”οΈ 81.93 - 96.23 - πŸ”—
Barlow Twins πŸš€ ResNet18 400 βœ”οΈ 80.38 80.16 95.28 95.14 πŸ”—
BYOL πŸš€ ResNet18 400 βœ”οΈ 80.16 80.32 95.02 94.94 πŸ”—
DeepCluster V2 ResNet18 400 ❌ 75.36 75.4 93.22 93.10 πŸ”—
DINO ResNet18 400 βœ”οΈ 74.84 74.92 92.92 92.78 πŸ”—
DINO πŸ˜ͺ ViT Tiny 400 ❌ 63.04 TODO 87.72 TODO πŸ”—
MoCo V2+ πŸš€ ResNet18 400 βœ”οΈ 78.20 79.28 95.50 95.18 πŸ”—
MoCo V3 πŸš€ ResNet18 400 βœ”οΈ 80.36 80.36 95.18 94.96 πŸ”—
MoCo V3 πŸš€ ResNet50 400 βœ”οΈ 85.48 84.58 96.82 96.70 πŸ”—
NNCLR πŸš€ ResNet18 400 βœ”οΈ 79.80 80.16 95.28 95.30 πŸ”—
ReSSL ResNet18 400 βœ”οΈ 76.92 78.48 94.20 94.24 πŸ”—
SimCLR πŸš€ ResNet18 400 βœ”οΈ 77.64 TODO 94.06 TODO πŸ”—
Simsiam ResNet18 400 βœ”οΈ 74.54 78.72 93.16 94.78 πŸ”—
SupCon ResNet18 400 βœ”οΈ 84.40 TODO 95.72 TODO πŸ”—
SwAV ResNet18 400 βœ”οΈ 74.04 74.28 92.70 92.84 πŸ”—
VIbCReg ResNet18 400 βœ”οΈ 79.86 79.38 94.98 94.60 πŸ”—
VICReg πŸš€ ResNet18 400 βœ”οΈ 79.22 79.40 95.06 95.02 πŸ”—
W-MSE ResNet18 400 βœ”οΈ 67.60 69.06 90.94 91.22 πŸ”—

πŸš€ methods where hyperparameters were heavily tuned.

πŸ˜ͺ ViT is very compute intensive and unstable, so we are slowly running larger architectures and with a larger batch size. Atm, total batch size is 128 and we needed to use float32 precision. If you want to contribute by running it, let us know!

ImageNet

Method Backbone Epochs Dali Acc@1 (online) Acc@1 (offline) Acc@5 (online) Acc@5 (offline) Checkpoint Finetuned Checkpoint
Barlow Twins ResNet50 100 βœ”οΈ 67.18 67.23 87.69 87.98 πŸ”—
BYOL ResNet50 100 βœ”οΈ 68.63 68.37 88.80 88.66 πŸ”—
MoCo V2+ ResNet50 100 βœ”οΈ 62.61 66.84 85.40 87.60 πŸ”—
MAE ViT-B/16 100 ❌ ~ 81.60 (finetuned) ~ 95.50 (finetuned) πŸ”— πŸ”—

Training efficiency for DALI

We report the training efficiency of some methods using a ResNet18 with and without DALI (4 workers per GPU) in a server with an Intel i9-9820X and two RTX2080ti.

Method Dali Total time for 20 epochs Time for 1 epoch GPU memory (per GPU)
Barlow Twins ❌ 1h 38m 27s 4m 55s 5097 MB
βœ”οΈ 43m 2s 2m 10s (56% faster) 9292 MB
BYOL ❌ 1h 38m 46s 4m 56s 5409 MB
βœ”οΈ 50m 33s 2m 31s (49% faster) 9521 MB
NNCLR ❌ 1h 38m 30s 4m 55s 5060 MB
βœ”οΈ 42m 3s 2m 6s (64% faster) 9244 MB

Note: GPU memory increase doesn't scale with the model, rather it scales with the number of workers.


Citation

If you use solo-learn, please cite our paper:

@article{JMLR:v23:21-1155,
  author  = {Victor Guilherme Turrisi da Costa and Enrico Fini and Moin Nabi and Nicu Sebe and Elisa Ricci},
  title   = {solo-learn: A Library of Self-supervised Methods for Visual Representation Learning},
  journal = {Journal of Machine Learning Research},
  year    = {2022},
  volume  = {23},
  number  = {56},
  pages   = {1-6},
  url     = {http://jmlr.org/papers/v23/21-1155.html}
}

solo-learn's People

Contributors

borda avatar bryant1410 avatar donkeyshot21 avatar froskekongen avatar imagones avatar kaland313 avatar ojss avatar peppesaccardi avatar pre-commit-ci[bot] avatar sauravmaheshkar avatar turian avatar vturrisi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

solo-learn's Issues

Method VICReg output ungrouped runs based on classes during training

As shown below, wandb could not find a way to manage all classes in one run. Is there any way to fix this?
I have tried method Balow Twins and everything looks good.
Screen Shot 2021-10-13 at 4 20 44 PM
My config file looks like this:
python3 main_pretrain.py
--dataset $1
--encoder resnet18
--data_dir ./datasets
--max_epochs 1000
--gpus 0,1,2,3
--precision 16
--optimizer sgd
--lars
--grad_clip_lars
--eta_lars 0.02
--exclude_bias_n_norm
--scheduler warmup_cosine
--lr 0.3
--weight_decay 1e-4
--batch_size 256
--num_workers 4
--crop_size 32
--min_scale 0.2
--brightness 0.4
--contrast 0.4
--saturation 0.2
--hue 0.1
--solarization_prob 0.1
--gaussian_prob 0.0 0.0
--crop_size 32
--num_crops_per_aug 1 1
--name vicreg-$1
--project solo-learn
--entity qiuyanxin
--wandb
--save_checkpoint
--method vicreg
--proj_hidden_dim 2048
--proj_output_dim 2048
--sim_loss_weight 25.0
--var_loss_weight 25.0
--cov_loss_weight 1.0
--accelerator ddp

Why is main_pretrain.py using classification_dataloader and not pretrain_dataloader?

Trying to load a custom dataset without labels only for pretraining (not linear evaluation).

I implemented the necessary changes to the pipeline in pretrain_dataloader.py, but I realized upon running main_pretrain.py (line 88) calls classification_dataloader.py and not pretrain_dataloader.py as I was expecting. Please advise how to address this for loading such a custom dataset.

Use of the target in the base model

Hello, greate work in those models. I have some question in the implementation of those models. In the base model of all self-supervised model, there is a [_shared_step function](https://github.com/vturrisi/solo-learn/blob/main/solo/methods/base.py#:~:text=def%20_shared_step(self,1%20and%20acc) used to forward calculate the 'class_loss' of the network. And the 'class loss' is added to the other constrastive loss later to backprob and update the network. The 'class_loss' is compute use label from the data and the cross entropy function, but in self-supervised learning, we are not allow to use label during the pretrain. Maybe I missunderstand your implementation, if this is the case, please let me know where did I make the mistake.

Some problems when using the well trained model

Hello!

Thank you for your previous answer.
However some errors occurred when i load the well trained simsiam model for cifar10:
ModuleNotFoundError: No module named 'solo.methods'
and moco for cifar100:
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

When I use simclr, byol, moco for cifar10, there is no such problem.

Configuration for BYOL - Imagenet-1000

Hello!

For the BYOL run on Imagenet-1000, I could not find any bash file for the same in the 'main' branch. Can you please provide the configuration for this setting? I see the results table filled up for BYOL on Imagenet-1000 in the readme.

Thanks a lot!

Separate train and validation datasets without and with labels

Hi there! First of all nice work on the library, looks great.

I am in the special situation where I have a large dataset for pretraining but without any labels. I would still like to do linear and k-NN eval online on multiple smaller, related labelled datasets to measure transfer performance. This is different from the standard benchmarks where we train and evaluate on splits of the same (usually labelled) dataset, but I think should be a very common problem. It seems like this is not possible in solo-learn at the moment, is this correct? If yes, what changes would be necessary?

Thank you.

How to use these well trained model

I appreciate your work, but I have some small questions.
The well-trained model you gave on the homepage is the encoder or encoder plus Linear?
How should I use these models?

saving model checkpoints and initializing using pretrained models

  1. Currently, the model checkpoints are saved at the end of validation epoch (

    if epoch % self.frequency == 0:
    self.save(trainer)
    ) and are skipped if only training data is used without validation data. Can the model checkpoint be saved at the end of training epoch when val data is not used?

  2. How to initialize using pretrained (imagenet /other datasets) resenet18 and resnet50 models at the start of training?

Problem about using repo

I tried to install the repo and the provided scripts however met pkg dependencies error. Also tried conda new env with the given requirements file still failed. Any idea about fixing this?

Thanks so much!
Best.

[New method suggestion] Augmentation-Augmented Variational Autoencoders

Ciao ragazzi!

I was wondering whether the new method AAVAEs proposed by William Falcon et al. can interest you and thus be implemented in your library or not.

The authors of the paper claim and suggest that autoencoding is a viable third family of self-supervised learning approach in addition to contrastive and non-contrastive learning although their method failed to outperform or perform comparably to the existing families of self-supervised learning algorithms.

I really enjoyed the idea behind this method and I wanted to strongly suggest it to you!

Pass data augmentation hyperparameters as arguments

Is it possible to add these augmentation hyperparameters ( color_jitter_prob: float = 0.8, gray_scale_prob: float = 0.8, horizontal_flip_prob: float = 0.5) as arguments that can be passed to code similar to other parameters (gaussian_prob and solarization_prob)?

SimSiam results on ImageNet.

Can you share your results on the full Imagenet-1k dataset? I think the results on imagenet-100 is not sufficient to prove that the implementation is right.

My implementation here can reach 65% top-1 acc on ImageNet 1k (using mocov2's linear evaluation protocol),
https://github.com/poodarchu/SelfSup/blob/master/examples/simsiam/SimSiam.res50.imagenet.256bs.224size.100e.lin_cls/README.md, as reported in the paper, this number should be ~66.7%。

Moreover, my results on CIFAR can match the performance in paper exactly.

tutorial

Hi

Is there any tutorial for this repo that one can use?

Thanks a lot :)

Loss curves for VICReg

Hi,

Thank you very much for the excellent work on bringing all the different self supervised models at one place, and high code quality.

I am currently working on the VICReg loss, and was wondering if you had any loss curves for invariance/ covariance/ variance losses with you, and if you could please share it? It would be really helpful for me to understand the behaviour of these losses.

Looking forward to your help.

Thank You.

Best Regards,
Anuj

Pretrain on custom dataset without labels

I am looking to pretrain a model on images for which labels are not available/defined. For that, I'm trying to replicate the DataLoader for other datasets in the catalog (like cifar10), but it seems the methods are implemented to return labels along with the images. Could you please guide on how/what to tweak in order to make the training pipeline work without labels?

barlowtwins train input question?

I'm very glad to see this amazing project, and found it support balowtwins.
so, I read the code, and got one question:

why use out feats for barlowtwins input?

feats1, feats2 = out["feats"]

and why use diffent time feats?

Configuration for linear eval of BYOL on cifar-10/100

Hi!

I had tried a slightly modified version of BYOL on cifar-10/100. The results of additional class-classifier used in pretraining are 1% (cifar-10) and 6% (cifar-100) better than the classifier trained with the final backbone during linear evaluation. This might be due to the fact that we used the IN-100 configuration of BYOL for linear evaluation. The cifar-10/100 configuration for linear evaluation is not provided. It would be very helpful if you can share the respective configurations for cifar.

Thanks!

Which ImageNet-100?

Hi all,

First of all, thank you so much for creating this library, I have found it to be super useful for my own research!

I was wondering if you could provide some details on the ImageNet-100 dataset that you used? I cannot seem to find any "standard" ImageNet-100 dataset for downloading on the internet and the papers that use this dataset (eg. 1 and 2) seem to randomly select 100 classes from the dataset.

Any of your help would be much appreciated!

targets = batch[-1]

Hi, thanks for your sharing.
What is the meaning of the batch?

Thanks

train on custom dataset

Hi I try to train on a custom dataset (~150k images )
with bash_files/pretrain/custom/byol.sh
I change

--train_dir path_to_dir_with_all_images

and add the --no_labels flag
I got failed for CUDA out of memory
and on the train prints I see

1 | classifier         | Linear                | 306 M 

debugging the code and I see inline

self.classifier = nn.Linear(self.features_dim, num_classes)

that num_classes = #num_of_images (~150k)
is that what supposed to be?
in case we have flag no_labels don't we want to remove the classifier layer?
thanks for the help

How to continue the training?

my training process crashed for power problem, now I need to continue the training...
But I don't know how to do it for it is the first time I use the pytorch-lightning

Thank you for help

queue size

hi,thanks for your work,
I have a question about how to set the queue size properly, such as a datasets is 13,000 and the queue size can set as 65536?

Handling of custom datasets

Can I use this code base to run the methods over multispectral images.
i have to take care of transformations differently.
Data will be .tif file rather .png
Backbones will be slightly different as now the first conv layer will have n in_channels compared to 3 in plain images

Fail to run on custom dataset

@vturrisi Hi vturrisi, I have run BYOL method on Imagnet100. Howerver, when I try to run it on custom dataset, it throws the following error:

ValueError: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Traceback (most recent call last):

I use the script custom/byol.sh. The Dataloader seems it has some problem for custom dataset.

The parameters are as follows:

python3 ../../../main_pretrain.py \
    --dataset custom \
    --encoder resnet18 \
    --data_dir /raid/yuanyong/datasets \
    --train_dir imagenet100_data/train \
    --no_labels \
    --max_epochs 400 \
    --gpus 0,1,2,3,4,5,6,7 \
    --distributed_backend ddp \
    --sync_batchnorm \
    --precision 16 \
    --optimizer sgd \
    --lars \
    --grad_clip_lars \
    --eta_lars 0.02 \
    --exclude_bias_n_norm \
    --scheduler warmup_cosine \
    --lr 1.0 \
    --classifier_lr 0.1 \
    --weight_decay 1e-5 \
    --batch_size 128 \
    --num_workers 8 \
    --brightness 0.4 \
    --contrast 0.4 \
    --saturation 0.2 \
    --hue 0.1 \
    --gaussian_prob 1.0 0.1 \
    --solarization_prob 0.0 0.2 \
    --name byol-400ep-custom \
    --entity unitn-mhug \
    --project solo-learn-custom \
    --wandb \
    --method byol \
    --output_dim 256 \
    --proj_hidden_dim 4096 \
    --pred_hidden_dim 8192 \
    --base_tau_momentum 0.99 \
    --final_tau_momentum 1.0

I use the imagenet100 as the custom dataset to validate the training process.

Query

May be very stupid question but i found you always used class loss along with main loss function which is mentioned in paper. Class loss is CE loss between targets and predicted logits.
is it something supervised contrastive learning or am I reading the code wrong.

nvjpeg memory allocation failure

Hi,

I ran into an issue that the pretraining script crashes after 8.5 epochs due to an allocation failure. I am guessing there might be a memory leak somewhere.

Details:

  • Nvidia Titan V GPU (12GB)
  • Using Nvidia Dali
  • commit 85b888a (i will do another pull, rerun and post the result, just to be sure.)
  • arguments: python3 main_pretrain.py --dataset imagenet --encoder resnet50 --data_dir /data --train_dir imagenet/train --val_dir imagenet/val --max_epochs 100 --gpus 0 --distributed_backend ddp --sync_batchnorm --precision 16 --optimizer sgd --scheduler warmup_cosine --lr 0.5 --classifier_lr 0.1 --weight_decay 1e-5 --batch_size 48 --num_workers 12 --brightness 0.4 --contrast 0.4 --saturation 0.4 --hue 0.1 --zero_init_residual --name simsiam-resnet50-100ep-imagenet --dali --entity tomsal --project solo-learn --wandb --method simsiam --proj_hidden_dim 2048 --pred_hidden_dim 512 --output_dim 2048 --amp_level O2 --log_gpu_memory all
  • I did disable the val_loader for unrelated reasons (by val_loader = None just before line 159). No other changes were made.

The error I get is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/configuration_vali
dator.py:101: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop
  rank_zero_warn(f'you defined a {step_name} but have no {loader_name}. Skipping {stage} loop')
Global seed set to 5
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name       | Type       | Params
------------------------------------------
0 | encoder    | ResNet     | 23.5 M 
1 | classifier | Linear     | 2.0 M
2 | projector  | Sequential | 12.6 M
3 | predictor  | Sequential | 2.1 M 
------------------------------------------
40.2 M    Trainable params
2.0 K     Non-trainable params
40.3 M    Total params
161.002   Total estimated model params size (MB)
Global seed set to 5
read 1281167 files from 1000 directories
Epoch 8:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                | 13369/26690 [1:55:18<1:54:53,  1.93it/s, loss=3.67, v_num=ok1z]
Traceback (most recent call last):
  File "main_pretrain.py", line 136, in <module>
    main()
  File "main_pretrain.py", line 130, in main
    trainer.fit(model, val_dataloaders=val_loader)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 460, in fit
    self._run(model)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 758, in _run
    self.dispatch()
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 799, in dispatch
    self.accelerator.start_training(self)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/accelerators/accel
erator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/plugins/training_t
ype/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 809, in run_stage
    return self.run_train()
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
  ", line 871, in run_train           
    self.train_loop.run_training_epoch() 
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/training_l
oop.py", line 491, in run_training_epoch                                                                       
    for batch_idx, (batch, is_last_batch) in train_dataloader:   
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/profiler/profilers
.py", line 112, in profile_iterable                                                                            
    value = next(iterator)                      
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters
.py", line 534, in prefetch_iterator        
    for val in it:                                                                                             
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters
.py", line 464, in __next__                                                                                    
    return self.request_next_batch(self.loader_iters)     
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters
.py", line 478, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)              
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/utilities/apply_fu
nc.py", line 85, in apply_to_collection
    return function(data, *args, **kwargs) 
  File "~/Code/solo-learn/solo/methods/dali.py", line 59, in __next__                                
    batch = super().__next__()                     
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line
 194, in __next__                                                                                              
    outputs = self._get_outputs()                                                                              
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/plugin/base_iterator.py"
, line 255, in _get_outputs                                                                                    
    outputs.append(p.share_outputs())
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 863,
in share_outputs
    return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline:
Error when executing Mixed operator decoders__Image encountered:                                               
Error in thread 2: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_decoder_decoupled_api.h:917] NVJPEG error "5
" : NVJPEG_STATUS_ALLOCATOR_FAILURE n02447366/n02447366_33293.jpg
Stacktrace (7 entries):                                                                                        
[frame 0]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali_operators.so(
+0x4cbbee) [0x7efc6c55dbee]                     
[frame 1]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali_operators.so(
+0x87a63b) [0x7efc6c90c63b]                 
[frame 2]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali_operators.so(
+0x87aa2e) [0x7efc6c90ca2e]                                                                                    
[frame 3]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali.so(dali::Thre
adPool::ThreadMain(int, int, bool)+0x1f0) [0x7efc6b5ed330]
[frame 4]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0x70718f)
 [0x7efc6bb9f18f]                    
[frame 5]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7efd013b96db]
[frame 6]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7efd010e2a3f]                                        
                                       
Current pipeline object is no longer valid.  

After I ran into this the first time, I reran it with GPU memory logging. This is the plot I get:
image

I am a bit confused that there is an increase after 3.5k steps (from 11979 GB to 1201GB). Let me know in case, I should provide more logs, or so.

P.S.: Great work! It is a pleasure to work with! :)

A bug that occurs while using more than 2 num_crops and DALI

I'm trying to train an old method 'Rotation Prediction' by the library which needs 4 crops of images, and then find a bug

in solo.methods.dali , around line 219

the origin code is
output_map = ["large1", "large2", "label"]

In fact, following codes is correct:
output_map = []
for i in range(self.num_crops):
output_map.append(f"large{i+1}")
output_map.append('label')

Implementation of Barlow Twins

Hello, thanks for your great work!
I have some questions about Barlow Twins. I used the official code on Cifar-10, and modified the conv1=nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=2, bias=False), maxpool = nn.Identity(). Then, I got the accuracy @top1: 89.57. Other parameters are followed by https://github.com/facebookresearch/barlowtwins.
Could you give some suggestion? many thanks!

Linear Model missing "patch_size" parameter

When I run the following script

python main_linear.py \
    --dataset cifar10 \
    --encoder resnet18 \
    --data_dir ./datasets \
    --max_epochs 100 \
    --gpus 0 \
    --precision 16 \
    --optimizer sgd \
    --scheduler step \
    --lr 1.0 \
    --lr_decay_steps 60 80 \
    --weight_decay 0 \
    --batch_size 256 \
    --num_workers 5 \
    --name simclr-cifar10-linear-eval \
    --pretrained_feature_extractor  EXTRATOR_PATH \
    --project selfsupervised  \
    --wandb

The program raise Error: 'Namespace' object has no attribute 'patch_size'

It seems Linear Model missing some required parameters

Performance degeneration when batch size is increased(BYOL)

Hi, thanks for your impressive work!

During the BYOL experiment on your codes(Imagenet-100, resnet18, 200epochs, same hyper-params as bash_files), it seems that the performance degenerated a lot if the batch size was increased([val acc1] bsz128: 75.5%, bsz256:70.54%).
Is there anything I need to modify when the batch size is increased?
I don't think it's because of the learning rate

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.