facebookresearch / moco-v3 Goto Github PK

View Code? Open in Web Editor NEW

1.2K 18.0 151.0 52 KB

PyTorch implementation of MoCo v3 https//arxiv.org/abs/2104.02057

License: Other

Python 100.00%

moco-v3's Introduction

MoCo v3 for Self-supervised ResNet and ViT

Introduction

This is a PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

The original MoCo v3 was implemented in Tensorflow and run in TPUs. This repo re-implements in PyTorch and GPUs. Despite the library and numerical differences, this repo reproduces the results and observations in the paper.

Main Results

The following results are based on ImageNet-1k self-supervised pre-training, followed by ImageNet-1k supervised training for linear evaluation or end-to-end fine-tuning. All results in these tables are based on a batch size of 4096.

Pre-trained models and configs can be found at CONFIG.md.

ResNet-50, linear classification

pretrain epochs	pretrain crops	linear acc
100	2x224	68.9
300	2x224	72.8
1000	2x224	74.6

ViT, linear classification

model	pretrain epochs	pretrain crops	linear acc
ViT-Small	300	2x224	73.2
ViT-Base	300	2x224	76.7

ViT, end-to-end fine-tuning

model	pretrain epochs	pretrain crops	e2e acc
ViT-Small	300	2x224	81.4
ViT-Base	300	2x224	83.2

The end-to-end fine-tuning results are obtained using the DeiT repo, using all the default DeiT configs. ViT-B is fine-tuned for 150 epochs (vs DeiT-B's 300ep, which has 81.8% accuracy).

Usage: Preparation

Install PyTorch and download the ImageNet dataset following the official PyTorch ImageNet training code. Similar to MoCo v1/2, this repo contains minimal modifications on the official PyTorch ImageNet code. We assume the user can successfully run the official PyTorch ImageNet code. For ViT models, install timm (timm==0.4.9).

The code has been tested with CUDA 10.2/CuDNN 7.6.5, PyTorch 1.9.0 and timm 0.4.9.

Usage: Self-supervised Pre-Training

Below are three examples for MoCo v3 pre-training.

ResNet-50 with 2-node (16-GPU) training, batch 4096

On the first node, run:

python main_moco.py \
  --moco-m-cos --crop-min=.2 \
  --dist-url 'tcp://[your first node address]:[specified port]' \
  --multiprocessing-distributed --world-size 2 --rank 0 \
  [your imagenet-folder with train and val folders]

On the second node, run the same command with --rank 1. With a batch size of 4096, the training can fit into 2 nodes with a total of 16 Volta 32G GPUs.

ViT-Small with 1-node (8-GPU) training, batch 1024

python main_moco.py \
  -a vit_small -b 1024 \
  --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --stop-grad-conv1 --moco-m-cos --moco-t=.2 \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

ViT-Base with 8-node training, batch 4096

With a batch size of 4096, ViT-Base is trained with 8 nodes:

python main_moco.py \
  -a vit_base \
  --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --stop-grad-conv1 --moco-m-cos --moco-t=.2 \
  --dist-url 'tcp://[your first node address]:[specified port]' \
  --multiprocessing-distributed --world-size 8 --rank 0 \
  [your imagenet-folder with train and val folders]

On other nodes, run the same command with --rank 1, ..., --rank 7 respectively.

Notes:

The batch size specified by -b is the total batch size across all GPUs.
The learning rate specified by --lr is the base lr, and is adjusted by the linear lr scaling rule in this line.
Using a smaller batch size has a more stable result (see paper), but has lower speed. Using a large batch size is critical for good speed in TPUs (as we did in the paper).
In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported. This code is improved to better suit the multi-node setting, and by default uses automatic mixed-precision for pre-training.

Usage: Linear Classification

By default, we use momentum-SGD and a batch size of 1024 for linear classification on frozen features/weights. This can be done with a single 8-GPU node.

python main_lincls.py \
  -a [architecture] --lr [learning rate] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
  [your imagenet-folder with train and val folders]

Usage: End-to-End Fine-tuning ViT

To perform end-to-end fine-tuning for ViT, use our script to convert the pre-trained ViT checkpoint to DEiT format:

python convert_to_deit.py \
  --input [your checkpoint path]/[your checkpoint file].pth.tar \
  --output [target checkpoint file].pth

Then run the training (in the DeiT repo) with the converted checkpoint:

python $DEIT_DIR/main.py \
  --resume [target checkpoint file].pth \
  --epochs 150

This gives us 83.2% accuracy for ViT-Base with 150-epoch fine-tuning.

Note:

We use --resume rather than --finetune in the DeiT repo, as its --finetune option trains under eval mode. When loading the pre-trained model, revise model_without_ddp.load_state_dict(checkpoint['model']) with strict=False.
Our ViT-Small is with heads=12 in the Transformer block, while by default in DeiT it is heads=6. Please modify the DeiT code accordingly when fine-tuning our ViT-Small model.

Model Configs

See the commands listed in CONFIG.md for specific model configs, including our recommended hyper-parameters and pre-trained reference models.

Transfer Learning

See the instructions in the transfer dir.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@Article{chen2021mocov3,
  author  = {Xinlei Chen* and Saining Xie* and Kaiming He},
  title   = {An Empirical Study of Training Self-Supervised Vision Transformers},
  journal = {arXiv preprint arXiv:2104.02057},
  year    = {2021},
}

moco-v3's People

Contributors

Stargazers

Watchers

Forkers

sailfish009 flyingbird93 niais ronghanghu batermj chenyanglei defiler24 limbo0000 liuguoyou pkurainbow taoshss pantheon5100 normster chisyliu atinangrish zyf1040895256 keyu-tian dfan fesianxu moqingxinai wang3702 lucasb-eyer spectreprediction wonlee2019 guglielmocamporese djx2726889 ms903-github chijiujiu zjysnow suddhu sirupli scottclowe derrickwang005 qiaoptdun amazing-doudou zhangxuemiao chenchy yangsenwxy guitaryourself sjiang95 lukangkang123 franklinxzw duyiming321 snorlax-icu yidaoren yuzhoupeng mldl elliotyechankim jongsuk1 zopek atlasgooo2 hungvo304ml heptapodsz xumj82 agatha-ren suzhenwang86 srzhao shijun18 martinmamql zhangyingyue ske159 ckmessi sheethalb jayagami whuhxb wyzdevin ugaqun vieozhu haoheliu yuchongy1 wooyunsec sthalles orashi snoopybingo niuchuangnn samar-khanna rufaelfekadu mbrukman zhkfu watchernyu kunlun-zhu firstelfin jdekun kobeshegu anniepan8215 junjianli106 fountaindream yzhuoning henuzx neurolaboratories jessicadufirst ray005 yusirhhh stanhuo dongyyyyy isl-cv eunjuyang sophieloiz andy0731 vivek9chavan

moco-v3's Issues

The question about the temperature of loss in MocoV3

I see the loss in MocoV2 just like:
loss = nn.CrossEntropyLoss()(logits / self.T, labels)
but the loss in MocoV3 just like:
loss = nn.CrossEntropyLoss()(logits / self.T, labels) * (2 * self.T)

I don't know why the loss should be multiplied by 2*temperature, it really confuses me. If anyone can answer my confusion.

Links for pre-trained models are broken

Hi, I'm trying to download your pre-trained models from CONFIG.ml, but seems like the tar files don't contain any weights after all.

This is what I see in every pre-trained I tried to download.

Does this implementation support non-distributed training?

I found if I didn't use distributed training, i.e. set the --multiprocessing-distributed=False and use single GPU, there seems to be no problems in main_moco.py with

   torch.cuda.set_device(args.gpu)
   model = model.cuda(args.gpu)

However, this error occurred when training started

AssertionError: Default process group is not initialized

This error can be traced back to

File "~/moco-v3/moco/builder.py", line 68, in contrastive_loss
k = concat_all_gather(k)

and

File "~/moco-v3/moco/builder.py", line 178, in concat_all_gather
for _ in range(torch.distributed.get_world_size())]

This error is caused by computation of contrastive_loss, which still relies on distributed training. So I wonder if the non-distributed training is not supported even if we set multiprocessing-distributed=False.

How many epochs for resnet50 end-to-end finetuning?

Hi, thank you for your interesting and novel work. You suggest 150 epochs for vit end-to-end finetuning ? Then how many epochs for resnet50 end-to-end finetuning?

the linear-prob acc1 of ViT-tiny on ImageNet is bad

Hi, thanks for your great work. I found a problem in our experiment:
first, I train a vit-tiny on imagenet in moco-v3
second, I fine-tuning the vit-tiny on imagenet with only train a classifiar (linear-prob)

And, I found the top1 acc only 32%, is right? Anyone has the MoCo-v3 results of vit-tiny on ImageNet?

End-to-end fine-tuning epochs for ViT-S

Hi, you fine-tune ViT-B for 150 epochs using DeiT codebase. How many epochs did you finetune for ViT-S?

MOCO V3 vit_small error: object has no attribute "num_tokens"

When I attempt to pre-train moco v3's vit_small model, I run into the following bug:

raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'VisionTransformerMoCo' object has no attribute 'num_tokens'

After changing the line
vits.py-line-66-LINK to
assert self.num_prefix_tokens == 1, 'Assuming one and only one token, [cls]' I don't see the bug anymore. It seems like the base class timm.models.vision_transformer has an argument named num_prefix_tokens but not num_tokens and hence vit_small is erroring out at the above mentioned line.

The command I used to run the code is:
python main_moco.py \ -a vit_small -b 1024 \ --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \ --epochs=400 --warmup-epochs=40 \ --stop-grad-conv1 --moco-m-cos --moco-t=.2 \ --dist-url 'tcp://localhost:8080' \ --multiprocessing-distributed --world-size 1 --rank 0 \ /data/

Please let me know if this is an accurate fix, or if I missed something. Thanks in advance!

ViT-Base fine-tuned checkpoints

Hey,
Thank you for providing the code and the checkpoints. I may have missed it, but I couldn't find checkpoints for the fine-tuned ViT-Base model. Could you please provide them?

Thanks,
Eliahu

Training with multi-crop

Thank you for your great work! I'm wondering if mocov3 could be further improved by multi-crop trick. Is there any recommended configuration? Thank you very much!

The size of an embedding is 1000, how would i change it?

The size of an embedding is 1000, how would I change it such as 512, 2048, 4096?

Question about linear probe

Hi, I see in your linear probing codes that validation acc is also monitored during training. I wonder what val acc did you report? Is it the best val acc or val acc obtained from the last epoch? Thank you.

Does mocov3 support Coco datasets

May I ask if mocov3 can use coco type datasets for training

How to fine-tune?

大佬您好，想请教一下这个要怎么根据自己的数据集进行微调，好像提供的预训练权重缺少一部分内容
Hello, I would like to ask how to fine-tune this according to my own dataset, it seems that the provided pre-training weights are missing some content.
Thanks!

main_moco.py, line 247, in main_worker optimizer.load_state_dict(checkpoint['optimizer']) optimizer.py, line 137, in load_state_dict saved_groups = state_dict['param_groups'] TypeError: 'NoneType' object is not subscriptable

Any hyper parameter suggestions for other model architectures?

I noticed that this repository only provide the results and experiment settings of ResNet50 and ViT series model.

And when I try to reproduce the results, I found that the final linear probing accuracy is very sensitive to the hyper parameters, such as learning rate, optimizer, augmentations, etc.

Are there any suggestions for training MoCo-v3 on other models, such as EfficientNet, ResNet101, etc. ? And how to adjust the hyper parameters for different model architectures?

Checkpoint release

Hi,

are you going to release the checkpoints?

KNN curve code

Thanks @endernewton for your work! I was wondering if you could kindly share your KNN classifier curve code somewhere, either here or in some other repo/gist?

Thanks again!
Kashif

ViT-Small model number of heads in attention differs from the original paper

Per the DeiT paper and timm's implementation, ViT-S uses 6 heads in the attention blocks. It seems the ViT-S here uses 12 heads. Is there a reason the number of heads is doubled?

release pretrained weights

will you have plan to release your pretrained weights

About Linear Probe Accuracy of Resenet-50

I ran the code using the parameters specified in the CONFIG.md for Resnet-50, 1000 epoch pre-training and then fine-tuned using the linear probe method. All the parameters were kept the same as mentioned in the CONFIG.md file. However, after linear probing, my accuracy can reach only 74.36% and not 74.6%. I am not sure what I might be missing here.

Could you help me out?

Additionally, I ran the official checkpoint provided and that is able to achieve 74.6%.

Cann't load pretrain weights

Resume checkpoint found bug, the key name is not match, ckpt key name have an additional prefix "module".

Hi，An error occurred in torch.multiprocessing.spawn

[libprotobuf FATAL google/protobuf/stubs/common.cc:87] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.17.3). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
terminate called after throwing an instance of 'google::protobuf::FatalException'
what(): This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.17.3). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
Traceback (most recent call last):
File "train.py", line 413, in
main()
File "train.py", line 140, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/wxq/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/wxq/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/wxq/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT

I don't know what error has occurred and ask for help，thank you.
my device: 4 nvidia 1080ti gpus
cuda version : 11.0

how about the training results for vit_conv_?

I want to know how to train vit_conv_small and its performance. according to the paper, is necessary for stop gradient for the four convolution embedding?

# BUG

In main_moco.py, the code
'optimizer.load_state_dict(checkpoint['optimizer'])
scaler.load_state_dict(checkpoint['scaler']) '
report an error.

That means in ViT-Small pretrain files, the attribute of 'optimizer' and 'scaler' are missing.....

队列

这里没有更新队列了么

main.py missing in transfer directory

I can't seem to find main.py in transfer directory. Did you forget to add it? Or is it not going to be shared?

Transfer learning performance of MoCo v3 on more challenging downstream dense prediction tasks.

Thanks for your great work!

I believe a goal of un-/self-supervised learning is to learn transferrable feature representations. I notice that MoCo v3 conduct a study on some smaller image classification datasets such as CIFAR-10/-100, and the performance is quite impressive.

But it seems that the performance of modern neural nets on these image classification datasets is somewhat saturated. I believe the community is more interested in more challenging downstream dense prediction tasks such as object detection and scene parsing. The specific task decoder layers such as DETR (for object detection) and SETR (for semantic segmentation or scene parsing) can be almost used out of the box. I wonder is there a plan on studying the transfer learning performance of MoCo v3 on downstream dense prediction tasks in the future?

Most of the learning time is spent loading data. This makes it impossible to use GPU resources efficiently.

Most of the learning time is spent loading data. This makes it impossible to use GPU resources efficiently.
I wonder if this is the right learning state.
I experiment with 224 sizes of images, and the command is as follows.

python main_moco.py
-a resnet18 -b 1024
--moco-m-cos --crop-min=.2
--dist-url 'tcp://localhost:10001'
--multiprocessing-distributed --world-size 1 --rank 0
../data/

How about the loss converges during training?

During training, I find that the training loss is not monotonically decreasing, is it right? Does the loss number indicate the training situation? If not, when the pretraining finished, how much is the samples matching accuracy should be?

question about batch size

Hi,

For resnet50, the training batch size is 4096. However, I cannot afford to train with so large batch size. Is it expected to achieve similar result as 4096 if I use batch size of 512 or 256 to train?

what is the parameters used for linear classification in resnet50 experiment?

specifically, what are the parameters for

python main_lincls.py \
  -a [architecture] --lr [learning rate] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
  [your imagenet-folder with train and val folders]

in resnet50 experiment?

About the learning rate for resnet-50

I met an issue training resnet-50 with moco-v3. Under the distributed training setting with 16 V100 GPUs (each process only has one gpu, batch size 4096), I can get the training loss at about 27.2 in the 100-th epoch. When I lower the learning to 1.5e-4 (the default one is 0.6), the loss decreases more resonably and it reaches 27.0 in the 100-th epoch. Could you please verify if this is reasonable.

Tensorflow version

Thank you for open source the Pytorch implementation. I wonder if the original tensorflow implementation have been released for the purpose of training on TPUs.

Fine-tuning vs Linear probing

Hi,

I am wondering why there is a significant performance gap between Fine-tuning and Linear probing? Additionally,
why the fine-tuning is not used for ResNet model?

Thank you in advance!

How many TPUs ?

Hi,
In the moco-v3 paper there is a section about the computation time. It says that for the ViT-B, 100 epochs of imagenet take 2.1h hours. It is not clear if it 512 TPU devices or 512 TPU cores. To be precise, there are two types of TPUs available on google cloud: v2-[32,512] and v3-[32-2048]. Which one of them was used in the experiment and how many for each instance ?