Giter VIP home page Giter VIP logo

vit-pytorch's Introduction

Vision Transformer

Pytorch reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.

This paper show that Transformers applied directly to image patches and pre-trained on large datasets work really well on image recognition task.

fig1

Vision Transformer achieve State-of-the-Art in image recognition task with standard Transformer encoder and fixed-size patches. In order to perform classification, author use the standard approach of adding an extra learnable "classification token" to the sequence.

fig2

Usage

1. Download Pre-trained model (Google's Official Checkpoint)

  • Available models: ViT-B_16(85.8M), R50+ViT-B_16(97.96M), ViT-B_32(87.5M), ViT-L_16(303.4M), ViT-L_32(305.5M), ViT-H_14(630.8M)
    • imagenet21k pre-train models
      • ViT-B_16, ViT-B_32, ViT-L_16, ViT-L_32, ViT-H_14
    • imagenet21k pre-train + imagenet2012 fine-tuned models
      • ViT-B_16-224, ViT-B_16, ViT-B_32, ViT-L_16-224, ViT-L_16, ViT-L_32
    • Hybrid Model(Resnet50 + Transformer)
      • R50-ViT-B_16
# imagenet21k pre-train
wget https://storage.googleapis.com/vit_models/imagenet21k/{MODEL_NAME}.npz

# imagenet21k pre-train + imagenet2012 fine-tuning
wget https://storage.googleapis.com/vit_models/imagenet21k+imagenet2012/{MODEL_NAME}.npz

2. Train Model

python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz

CIFAR-10 and CIFAR-100 are automatically download and train. In order to use a different dataset you need to customize data_utils.py.

The default batch size is 512. When GPU memory is insufficient, you can proceed with training by adjusting the value of --gradient_accumulation_steps.

Also can use Automatic Mixed Precision(Amp) to reduce memory usage and train faster

python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz --fp16 --fp16_opt_level O2

Results

To verify that the converted model weight is correct, we simply compare it with the author's experimental results. We trained using mixed precision, and --fp16_opt_level was set to O2.

imagenet-21k

model dataset resolution acc(official) acc(this repo) time
ViT-B_16 CIFAR-10 224x224 - 0.9908 3h 13m
ViT-B_16 CIFAR-10 384x384 0.9903 0.9906 12h 25m
ViT_B_16 CIFAR-100 224x224 - 0.923 3h 9m
ViT_B_16 CIFAR-100 384x384 0.9264 0.9228 12h 31m
R50-ViT-B_16 CIFAR-10 224x224 - 0.9892 4h 23m
R50-ViT-B_16 CIFAR-10 384x384 0.99 0.9904 15h 40m
R50-ViT-B_16 CIFAR-100 224x224 - 0.9231 4h 18m
R50-ViT-B_16 CIFAR-100 384x384 0.9231 0.9197 15h 53m
ViT_L_32 CIFAR-10 224x224 - 0.9903 2h 11m
ViT_L_32 CIFAR-100 224x224 - 0.9276 2h 9m
ViT_H_14 CIFAR-100 224x224 - WIP

imagenet-21k + imagenet2012

model dataset resolution acc
ViT-B_16-224 CIFAR-10 224x224 0.99
ViT_B_16-224 CIFAR-100 224x224 0.9245
ViT-L_32 CIFAR-10 224x224 0.9903
ViT-L_32 CIFAR-100 224x224 0.9285

shorter train

  • In the experiment below, we used a resolution size (224x224).
  • tensorboard
upstream model dataset total_steps /warmup_steps acc(official) acc(this repo)
imagenet21k ViT-B_16 CIFAR-10 500/100 0.9859 0.9859
imagenet21k ViT-B_16 CIFAR-10 1000/100 0.9886 0.9878
imagenet21k ViT-B_16 CIFAR-100 500/100 0.8917 0.9072
imagenet21k ViT-B_16 CIFAR-100 1000/100 0.9115 0.9216

Visualization

The ViT consists of a Standard Transformer Encoder, and the encoder consists of Self-Attention and MLP module. The attention map for the input image can be visualized through the attention score of self-attention.

Visualization code can be found at visualize_attention_map.

fig3

Reference

Citations

@article{dosovitskiy2020,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={arXiv preprint arXiv:2010.11929},
  year={2020}
}

vit-pytorch's People

Contributors

jeonsworld avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vit-pytorch's Issues

Hybrid ViT fails in the constructor for image size = 200

vit = VisionTransformer(CONFIGS['R50-ViT-B_16'], zero_head=False, img_size=200)

leads to "float division by zero" exception:


ZeroDivisionError Traceback (most recent call last)
in
----> 1 vit = VisionTransformer(CONFIGS['R50-ViT-B_16'], zero_head=False, img_size=200)

ViT-pytorch/models/modeling.py in init(self, config, img_size, num_classes, zero_head, vis)
267 self.classifier = config.classifier
268
--> 269 self.transformer = Transformer(config, img_size, vis)
270 self.head = Linear(config.hidden_size, num_classes)
271

ViT-pytorch/models/modeling.py in init(self, config, img_size, vis)
251 def init(self, config, img_size, vis):
252 super(Transformer, self).init()
--> 253 self.embeddings = Embeddings(config, img_size=img_size)
254 self.encoder = Encoder(config, vis)
255

ViT-pytorch/models/modeling.py in init(self, config, img_size, in_channels)
144 width_factor=config.resnet.width_factor)
145 in_channels = self.hybrid_model.width * 16
--> 146 self.patch_embeddings = Conv2d(in_channels=in_channels,
147 out_channels=config.hidden_size,
148 kernel_size=patch_size,

~/.conda/envs/ml-devenv2/lib/python3.8/site-packages/torch/nn/modules/conv.py in init(self, in_channels, out_channels, kernel_size, stride, padding, dilation, groups, bias, padding_mode)
408 padding = _pair(padding)
409 dilation = _pair(dilation)
--> 410 super(Conv2d, self).init(
411 in_channels, out_channels, kernel_size, stride, padding, dilation,
412 False, _pair(0), groups, bias, padding_mode)

~/.conda/envs/ml-devenv2/lib/python3.8/site-packages/torch/nn/modules/conv.py in init(self, in_channels, out_channels, kernel_size, stride, padding, dilation, transposed, output_padding, groups, bias, padding_mode)
81 else:
82 self.register_parameter('bias', None)
---> 83 self.reset_parameters()
84
85 def reset_parameters(self) -> None:

~/.conda/envs/ml-devenv2/lib/python3.8/site-packages/torch/nn/modules/conv.py in reset_parameters(self)
84
85 def reset_parameters(self) -> None:
---> 86 init.kaiming_uniform_(self.weight, a=math.sqrt(5))
87 if self.bias is not None:
88 fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)

~/.conda/envs/ml-devenv2/lib/python3.8/site-packages/torch/nn/init.py in kaiming_uniform_(tensor, a, mode, nonlinearity)
379 fan = _calculate_correct_fan(tensor, mode)
380 gain = calculate_gain(nonlinearity, a)
--> 381 std = gain / math.sqrt(fan)
382 bound = math.sqrt(3.0) * std # Calculate uniform bounds from standard deviation
383 with torch.no_grad():

ZeroDivisionError: float division by zero

apex version?

Getting this error with apex==0.9.10.dev0. What is your apex version?

image

HTTP Error 403: Forbidden

Hi

I tried your notebook but the link is dead I think. I got the error of forbidden.

HTTPError                                 Traceback (most recent call last)
<ipython-input-4-14f159ea9fa3> in <module>
      1 # Test Image
      2 img_url = "https://images.mypetlife.co.kr/content/uploads/2019/04/09192811/welsh-corgi-1581119_960_720.jpg"
----> 3 urlretrieve(img_url, "attention_data/img.jpg")
      4 
      5 # Prepare Model
...
HTTP Error 403: Forbidden

[ Softmax() missing ]

Thanks for sharing the ViT implementation, wonderful work.

I'm wondering why you are not using the Softmax() function after head component -from features to classes-, as you did in the Jupyter notebook example?

Thanks

Not able to load ViT-H_14

I was testing using the provided visualize_attention_map.ipynb

The ViT-B_16-224 loads fine but when I downloaded and was loading ViT-H_14, it gave me the following error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-36-0b02f0ab326a> in <module>
      2 config = CONFIGS["ViT-H_14"]
      3 model = VisionTransformer(config, num_classes=1000, zero_head=False, img_size=224, vis=True)
----> 4 model.load_from(np.load("imagenet21k_ViT-H_14.npz"))
      5 model.eval()

~/Documents/clones/ViT-pytorch/models/modeling.py in load_from(self, weights)
    287                 nn.init.zeros_(self.head.bias)
    288             else:
--> 289                 self.head.weight.copy_(np2th(weights["head/kernel"]).t())
    290                 self.head.bias.copy_(np2th(weights["head/bias"]).t())
    291 

RuntimeError: The size of tensor a (1000) must match the size of tensor b (21843) at non-singleton dimension 0

What do you think might be the error?

--gradient_accumulation_steps

Hello, I would like to ask you about this step in README, The default batch size is 512. When GPU memory is insufficient, you can proceed with training by adjusting the value of --gradient_accumulation_steps.how to do it specifically?

Multi-GPU

How to do multi-GPU training?
Now only one GPU get utilised
Screenshot 2020-12-29 at 9 27 49 PM

Model Architecture For Fine-tuning

In the original paper, the authors state that "we remove the whole head (two linear layers) and replace it by a single, zero-initialized linear layer outputting the number of classes required by the target dataset. We found this to be a little more robust than simply re-initializing the very last layer."

May I know which code snippet is related to this?

Request for pre-trained weights only on Imagenet2012.

Thanks for your hard work! I wonder if there are some pre-trained weights only using Imagenet2012? I found that the pre-trained ResNet provided by torchvision may be pre-trained only on Imagenet2012 so I want to take ViT and ResNet for a fair comparison.

Key error when loading pre-trained weights

Hi, Thank you for your nice implementation. I get the following error when loading the pre-trained weights:

KeyError: 'Transformer/encoderblock_0\MultiHeadDotProductAttention_1/query\kernel is not a file in the archive'

Would you please help me with this?

Parnian

how you save tensorboard?

@jeonsworld
okay this is completely different questions, but I should ask it because i have not seen it anywhere else.
how did you save tensorboard so we can just click on it and see it? I want to do it as well. should I save it in some format or do anything special? please direct me to any link/material that can help me with that.
Thanks :)

ImportError: cannot import name 'UnencryptedCookieSessionFactoryConfig' from 'pyramid.session' (unknown location)

Hi, by executing this

python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz

I encounter the error:

Traceback (most recent call last):
  File "train.py", line 17, in <module>
    from apex import amp
  File "/home/tiger/.local/lib/python3.7/site-packages/apaex/__init__.py", line 13, in <module>
    from pyramid.session import UnencryptedCookieSessionFactoryConfig
ImportError: cannot import name 'UnencryptedCookieSessionFactoryConfig' from 'pyramid.session' (unknown location)

About imagenet-21k

Thanks for your great repo !

I cannot find the link to download the imagenet-21k dataset, so is there any way to download Imagenet-21k now? Thanks a lot~

Docker

Would you like a docker file?

Will send it :)

Errors when use custom data to retrain the Vit-transformer

When use my custom dataset, which contains 6 classes, so I modified the data_utils.py, and change the 'num_classes = 6' in train.py. But I got these errors:

Training (X / X Steps) (loss=X.X): 0%|| 0/33 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1595629416375/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed.
Training (X / X Steps) (loss=X.X): 0%|| 0/33 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_trash.py", line 335, in
main()
File "train_trash.py", line 331, in main
train(args, model)
File "train_trash.py", line 211, in train
loss.backward()
File "/root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/autograd/init.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)
Exception raised from createCublasHandle at /opt/conda/conda-bld/pytorch_1595629416375/work/aten/src/ATen/cuda/CublasHandlePool.cpp:8 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f533ff7077d in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0xcfc185 (0x7f53410d2185 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: at::cuda::getCurrentCUDABlasHandle() + 0xb75 (0x7f53410d3065 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xcef217 (0x7f53410c5217 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::(anonymous namespace)::addmm_out_cuda_impl(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar, c10::Scalar) + 0xf7e (0x7f534242985e in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::mm_cuda(at::Tensor const&, at::Tensor const&) + 0xb3 (0x7f534242b353 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xd14ea0 (0x7f53410eaea0 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x7b1990 (0x7f5372b9b990 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x7f5373383c7c in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::mm(at::Tensor const&, at::Tensor const&) + 0x4b (0x7f53732d4b0b in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0x2c2be8f (0x7f5375015e8f in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #11: + 0x7b1990 (0x7f5372b9b990 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x7f5373383c7c in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::Tensor::mm(at::Tensor const&) const + 0x4b (0x7f537346a10b in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0x2a6d094 (0x7f5374e57094 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::generated::AddmmBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x2d5 (0x7f5374e5d055 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x30d1017 (0x7f53754bb017 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&) + 0x1400 (0x7f53754b6860 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) + 0x451 (0x7f53754b7401 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #19: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x89 (0x7f53754af579 in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7f53797de13a in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #21: + 0xc819d (0x7f537c30f19d in /root/anaconda3/envs/agr/lib/python3.6/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #22: + 0x76db (0x7f53a0e6c6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #23: clone + 0x3f (0x7f53a01e8a3f in /lib/x86_64-linux-gnu/libc.so.6)

I guess this error is caused by the labels crossing the boundary, but I can't find where to modify it. Could you please help me fix this problem?

Thank you!

Train from scratch

Thanks for your work.

I have a question concerning training from scratch. I checked the source code, and it seems that there is no implementation of position embedding. One can only load position embedding from the pretrained models. If I want to train from scratch, should I implement position embedding by myself, or is there something I overlooked? Any other things I should be careful with if training from scratch?

Which GPU did you use?

Sorry, there is training-time show in your experiment. I wonder which GPU did you use, and how many of them?

How to adapt arbitrary image size?

The length of learnable position embedding should be specified when it is initialized, so it is impossible to process images of other sizes. Is there any way to solve this problem?

Training accuracy much lower than validation accuracy

Thanks for creating and uploading this easily usable repo!

In addition to the validation accuracy on the entire validation set that is printed out by default, we printed out the training accuracies of the model and we observe that the training accuracy is 6-8% lower than the validation accuracy. Is that reasonable/accurate since we usually expect the training accuracy to be higher than the validation accuracy?

This was for a ViT-B_16 model, pretrained on ImageNet-21k and during the fine-tuning phase on CIFAR 10. To get the training accuracies, we used model(x)[0] to get the logits, loss and predictions for each batch and used the AverageMeter() to calculate the running accuracies. Additionally, to get the accurate training accuracy over the entire training set, we passed the training set to a copy valid() (with only changes to print statements). Both the running training accuracy and the training accuracy over the entire training set was lower than the validation accuracy by 6-8%. For instance, after 10k steps, training accuracy was 92.9% (over entire train set) and validation accuracy was 98.7%. We used most of the default hyperparameters (besides batch size and fp_16) and did not make other changes to the code.

Please let us know if this lower training accuracy is expected or if its calculation is incorrect. Thanks in advance.

pre-trained weight

thanks for your nice work!
I have a problem about the filename extension of the pre-trained weight.
what does the filename extension "-224" mean?
for example, ViT-B_16-224 and ViT-B_16, Were they trained with the different input size?

another question, that the pre-trained weight trained on the JFT dataset can be provided?

Why the kernel is normalized in StdConv2d?

I noticed that you used

class StdConv2d(nn.Conv2d):

def forward(self, x):
    w = self.weight
    v, m = torch.var_mean(w, dim=[1, 2, 3], keepdim=True, unbiased=False)
    w = (w - m) / torch.sqrt(v + 1e-5)
    return F.conv2d(x, w, self.bias, self.stride, self.padding,
                    self.dilation, self.groups)

Why 'w' is normalized here? Any special consideration for implementing in this way? Thanks

Tensors do not match?

File "/Users/chaoyanghe/sourcecode/FedML/fedml_api/model/cv/transformer/vit/vision_transformer.py", line 258, in forward
x, attn_weights = self.transformer(x)
File "/Users/chaoyanghe/opt/anaconda3/envs/fedml/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/chaoyanghe/sourcecode/FedML/fedml_api/model/cv/transformer/vit/vision_transformer.py", line 242, in forward
embedding_output = self.embeddings(input_ids)
File "/Users/chaoyanghe/opt/anaconda3/envs/fedml/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/chaoyanghe/sourcecode/FedML/fedml_api/model/cv/transformer/vit/vision_transformer.py", line 151, in forward
embeddings = x + self.position_embeddings
RuntimeError: The size of tensor a (5) must match the size of tensor b (197) at non-singleton dimension 1

I am training using CIFAR10 but got the above issue? may I know why?

Loss can't drop

Thank you so much for sharing your codes. I try to employ Vit as the encoder and follow a common decoder to build a segmentation network. I train it from scratch but found the loss can't drop since the beginning of training, and the results keep near 0. Is there any trick for training Vit correctly? Is it very important to load the pre-train model to fine-tune?
Here is my configuration:
patch_size=16 hidden_size=16*16*3 mlp_dim = 3072 dropout_rate = 0.1 num_heads = 12 num_layers = 12 lr=3e-4 opt=Adam weight_decay=0.0

A bug when using Apex DDP

Hi jeonsworld, thank you for providing this awesome repo of Vision Transformer! I tried to use it but met a problem when I use distributed training, although the problem seems to be around Apex, but do you know the reason? I will appreciate it a lot if you could help me with it.

This is the command that I used:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz

And this is the Error information:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Killing subprocess 37950
Killing subprocess 37951
Killing subprocess 37952
Killing subprocess 37953
Traceback (most recent call last):
File "/opt/miniconda/envs/vit/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/miniconda/envs/vit/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/miniconda/envs/vit/lib/python3.9/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/opt/miniconda/envs/vit/lib/python3.9/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/miniconda/envs/vit/lib/python3.9/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/miniconda/envs/vit/bin/python', '-u', 'train.py', '--local_rank=3', '--name', 'cifar10-100_500', '--dataset', 'cifar10', '--model_type', 'ViT-B_16', '--pretrained_dir', 'checkpoint/ViT-B_16.npz']' returned non-zero exit status 1.

Get Attention weights

Hi, is there an easy way to get/extract the attention weights in order to visualise the attention map ?

Thanks !

Loss doesn't drop in the example

Hi, thanks for releasing this code.

I have tried to run the CIFAR-10 (as well as CIFAR-100) example, but in both cases the validation (and training) loss do not decrease, and the validation accuracy gets stuck in 0.01. Is there any hyper-parameter that I need to change from the example code?

Thanks!

image

Why is the addition of convolution useless

I added data enhancement methods such as translation, rotation, and scaling to the test data sample, hoping to use the inductive bias of CNN, but R50+ViT did not achieve the expected effect. Under what circumstances will R50+ViT be better than ordinary ViT

Why the model gives the same logits for both the classes?

Hi,
I am using ViT-H_14 pre-trained to perform binary classification of biomedical images. The dataset I have available is very small: I use about 300 images to perform fine tuning and about 30 images for validation. The goal is to classify the images based on the aggressiveness of the tumor represented (Low grade (0) - High grade(1)).
However, I noticed that during the prediction, each image is always associated with the label 0, and going to look on the logits, i found that are always produced logits identical pairs (eg [[ 6.877057e-10 -6.877057e-10]]), which are translated into probability pairs of about (0.49,0.51).

Searching the various forums I found many different tips: vary the learning rate (which I decreased to 1e-8), decrease the batch size (from 8 to 2), etc.. Unfortunately none of this works. The last thing I want to try is to increase considerably the number of epochs (at the moment I have trained for only 100 epochs), but before doing so I wanted to see if someone had a more specific suggestion, or even if someone can tell me if this architecture is too much for a dataset so small.

Thanks a lot in advance

Low training speed on RTX 3090

Training on the 3090 gets slower and slower as time goes on but the 2080ti doesn't have this problem

torch 1.8.0.dev20201130+cu110
torchvision 0.9.0.dev20201130+cu110
NVIDIA-SMI 455.23.04 Driver Version: 455.23.04 CUDA Version: 11.1

How to convert Pytorch model checkpoint in .bin -> .npz ?

Hello,

In "visualize_attention_map.ipynb", the trained model is loaded in the following line:
model.load_from(np.load("attention_data/Vit-B_16-224.npz"))

I used your train.py to finetune the Vit-B_16-224.npz with my custom data, which produced a Pytorch model checkpoint my-model.bin.
How do I perform the model.load_from for checkpoint my-model.bin? Do I need to convert the my-model.bin to a .npz format model? How can I convert it?

Thanks in advance!

KeyError: 'Transformer/encoderblock_0\\MultiHeadDotProductAttention_1/query\\kernel is not a file in the archive'

when i used code,the error occurs
error location:

 models\modeling.py", line 195, in load_from
query_weight = np2th(weights[pjoin(ROOT, ATTENTION_Q, "kernel")]).view(self.hidden_size, self.hidden_size).t()
File "d:\Anaconda3\lib\site-packages\numpy\lib\npyio.py", line 259, in __getitem__
raise KeyError("%s is not a file in the archive" % key)
KeyError: 'Transformer/encoderblock_0\\MultiHeadDotProductAttention_1/query\\kernel is not a file in the archive'

I would like to ask where should I put this ViT-H_14.npz ?
I created a checkpint folder and just put the ViT-H_14.npz in there,but I got this error。
the INFO:01/12/2021 19:51:55 - INFO - models.modeling - load_pretrained: resized variant: torch.Size([1, 257, 1280]) to torch.Size([1, 730, 1280])
my input: imgsize(384*384),batch.size(64){train.batch=eval.batch}.
Is there anything I haven't modified?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.