Giter VIP home page Giter VIP logo

br-idl / paddlevit Goto Github PK

View Code? Open in Web Editor NEW
1.2K 10.0 317.0 31.77 MB

:robot: PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+

Home Page: https://github.com/BR-IDL/PaddleViT

License: Apache License 2.0

Python 99.54% Shell 0.46%
cv computer-vision paddlepaddle vit mlp transformer encoder-decoder classification detection segmentation gan deep-learning semantic-segmentation object-detection

paddlevit's Introduction

English | 简体中文

PaddlePaddle Vision Transformers

GitHub CodeFactor CLA assistant GitHub Repo stars

State-of-the-art Visual Transformer and MLP Models for PaddlePaddle

🤖 PaddlePaddle Visual Transformers (PaddleViT or PPViT) is a collection of vision models beyond convolution. Most of the models are based on Visual Transformers, Visual Attentions, and MLPs, etc. PaddleViT also integrates popular layers, utilities, optimizers, schedulers, data augmentations, training/validation scripts for PaddlePaddle 2.1+. The aim is to reproduce a wide variety of state-of-the-art ViT and MLP models with full training/validation procedures. We are passionate about making cuting-edge CV techniques easier to use for everyone.

🤖 PaddleViT provides models and tools for multiple vision tasks, such as classifications, object detection, semantic segmentation, GAN, and more. Each model architecture is defined in standalone python module and can be modified to enable quick research experiments. At the same time, pretrained weights can be downloaded and used to finetune on your own datasets. PaddleViT also integrates popular tools and modules for custimized dataset, data preprocessing, performance metrics, DDP and more.

🤖 PaddleViT is backed by popular deep learning framework PaddlePaddle, we also provide tutorials and projects on Paddle AI Studio. It's intuitive and straightforward to get started for new users.

Quick Links

PaddleViT implements model architectures and tools for multiple vision tasks, go to the following links for detailed information.

We also provide tutorials:

Features

  1. State-of-the-art

    • State-of-the-art transformer models for multiple CV tasks
    • State-of-the-art data processings and training methods
    • We keep pushing it forward.
  2. Easy-to-use tools

    • Easy configs for model vairants
    • Modular design for utiliy functions and tools
    • Low barrier for educators and practitioners
    • Unified framework for all the models
  3. Easily customizable to your needs

    • Examples for each model to reproduce the results
    • Model implementations are exposed for you to customize
    • Model files can be used independently for quick experiments
  4. High Performance

    • DDP (multiprocess training/validation where each process runs on a single GPU).
    • Mixed-precision support (AMP)

Model architectures

Image Classification (Transformers)

  1. ViT (from Google), released with paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.

  2. DeiT (from Facebook and Sorbonne), released with paper Training data-efficient image transformers & distillation through attention, by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.

  3. Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

  4. VOLO (from Sea AI Lab and NUS), released with paper VOLO: Vision Outlooker for Visual Recognition, by Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan.

  5. CSwin Transformer (from USTC and Microsoft), released with paper CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.

  6. CaiT (from Facebook and Sorbonne), released with paper Going deeper with Image Transformers, by Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, Hervé Jégou.

  7. PVTv2 (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper PVTv2: Improved Baselines with Pyramid Vision Transformer, by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.

  8. Shuffle Transformer (from Tencent), released with paper Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu.

  9. T2T-ViT (from NUS and YITU), released with paper Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , by Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan.

  10. CrossViT (from IBM), released with paper CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.

  11. BEiT (from Microsoft Research), released with paper BEiT: BERT Pre-Training of Image Transformers, by Hangbo Bao, Li Dong, Furu Wei.

  12. Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.

  13. Mobile-ViT (from Apple), released with paper MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer, by Sachin Mehta, Mohammad Rastegari.

  14. ViP (from National University of Singapore), released with Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition, by Qibin Hou and Zihang Jiang and Li Yuan and Ming-Ming Cheng and Shuicheng Yan and Jiashi Feng.

  15. XCiT (from Facebook/Inria/Sorbonne), released with paper XCiT: Cross-Covariance Image Transformers, by Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou.

  16. PiT (from NAVER/Sogan University), released with paper Rethinking Spatial Dimensions of Vision Transformers, by Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh.

  17. HaloNet, (from Google), released with paper Scaling Local Self-Attention for Parameter Efficient Visual Backbones, by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.

  18. PoolFormer, (from Sea AI Lab/NUS), released with paper MetaFormer is Actually What You Need for Vision, by Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan.

  19. BoTNet, (from UC Berkeley/Google), released with paper Bottleneck Transformers for Visual Recognition, by Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani.

  20. CvT (from McGill/Microsoft), released with paper CvT: Introducing Convolutions to Vision Transformers, by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang

  21. HvT (from Monash University), released with paper Scalable Vision Transformers with Hierarchical Pooling, by Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai.

  22. TopFormer (from HUST/Tencent/Fudan/ZJU), released with paper TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, by Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen, Xinggang Wang, Wenyu Liu, Gang Yu, Chunhua Shen.

  23. ConvNeXt (from FAIR/UCBerkeley), released with paper A ConvNet for the 2020s, by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.

  24. CoaT (from UCSD), released with paper Co-Scale Conv-Attentional Image Transformers, by Weijian Xu, Yifan Xu, Tyler Chang, Zhuowen Tu.

  25. ResT (from NJU), released with paper ResT: An Efficient Transformer for Visual Recognition, by Qinglong Zhang, Yubin Yang.

  26. ResTV2 (from NJU), released with paper ResT V2: Simpler, Faster and Stronger, by Qinglong Zhang, Yubin Yang.

Image Classification (MLP & others)

  1. MLP-Mixer (from Google), released with paper MLP-Mixer: An all-MLP Architecture for Vision, by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
  2. ResMLP (from Facebook/Sorbonne/Inria/Valeo), released with paper ResMLP: Feedforward networks for image classification with data-efficient training, by Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou.
  3. gMLP (from Google), released with paper Pay Attention to MLPs, by Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le.
  4. FF Only (from Oxford), released with paper Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet, by Luke Melas-Kyriazi.
  5. RepMLP (from BNRist/Tsinghua/MEGVII/Aberystwyth), released with paper RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition, by Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding.
  6. CycleMLP (from HKU/SenseTime), released with paper CycleMLP: A MLP-like Architecture for Dense Prediction, by Shoufa Chen, Enze Xie, Chongjian Ge, Ding Liang, Ping Luo.
  7. ConvMixer (from Anonymous), released with Patches Are All You Need?, by Anonymous.
  8. ConvMLP (from UO/UIUC/PAIR), released with ConvMLP: Hierarchical Convolutional MLPs for Vision, by Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi.
  9. RepLKNet (from Tsinghua/MEGVII/Aberystwyth), released with Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs , by Xiaohan Ding, Xiangyu Zhang, Yizhuang Zhou, Jungong Han, Guiguang Ding, Jian Sun.
  10. MobileOne (from Apple), released with An Improved One millisecond Mobile Backbone, by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, Anurag Ranjan.

Detection

  1. DETR (from Facebook), released with paper End-to-End Object Detection with Transformers, by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
  2. Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
  3. PVTv2 (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper PVTv2: Improved Baselines with Pyramid Vision Transformer, by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.

Coming Soon:

  1. Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
  2. UP-DETR (from Tencent), released with paper UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, by Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen.

Semantic Segmentation

Now:

  1. SETR (from Fudan/Oxford/Surrey/Tencent/Facebook), released with paper Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, by Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang.
  2. DPT (from Intel), released with paper Vision Transformers for Dense Prediction, by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
  3. Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
  4. Segmenter (from Inria), realeased with paper Segmenter: Transformer for Semantic Segmentation, by Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid.
  5. Trans2seg (from HKU/Sensetime/NJU), released with paper Segmenting Transparent Object in the Wild with Transformer, by Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu, Ding Liang, Ping Luo.
  6. SegFormer (from HKU/NJU/NVIDIA/Caltech), released with paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
  7. CSwin Transformer (from USTC and Microsoft), released with paper CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
  8. TopFormer (from HUST/Tencent/Fudan/ZJU), released with paper TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

Coming Soon:

  1. FTN (from Baidu), released with paper Fully Transformer Networks for Semantic Image Segmentation, by Sitong Wu, Tianyi Wu, Fangjian Lin, Shengwei Tian, Guodong Guo.
  2. Shuffle Transformer (from Tencent), released with paper Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu
  3. Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao. ](https://arxiv.org/abs/2107.00652), by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.

GAN

  1. TransGAN (from Seoul National University and NUUA), released with paper TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up, by Yifan Jiang, Shiyu Chang, Zhangyang Wang.
  2. Styleformer (from Facebook and Sorbonne), released with paper Styleformer: Transformer based Generative Adversarial Networks with Style Vector, by Jeeseung Park, Younggeun Kim.

Coming Soon:

  1. ViTGAN (from UCSD/Google), released with paper ViTGAN: Training GANs with Vision Transformers, by Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, Ce Liu.

Installation

Prerequistites

  • Linux/MacOS/Windows
  • Python 3.6/3.7
  • PaddlePaddle 2.1.0+
  • CUDA10.2+

Note: It is recommended to install the latest version of PaddlePaddle to avoid some CUDA errors for PaddleViT training. For PaddlePaddle, please refer to this link for stable version installation and this link for develop version installation.

Installation

  1. Create a conda virtual environment and activate it.

    conda create -n paddlevit python=3.7 -y
    conda activate paddlevit
  2. Install PaddlePaddle following the official instructions, e.g.,

    conda install paddlepaddle-gpu==2.1.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/

    Note: please change the paddlepaddle version and cuda version accordingly to your environment.

  3. Install dependency packages

    • General dependencies:
      pip install yacs pyyaml
      
    • Packages for Segmentation:
      pip install cityscapesScripts
      
      Install detail package:
      git clone https://github.com/ccvl/detail-api
      cd detail-api/PythonAPI
      make
      make install
    • Packages for GAN:
      pip install lmdb
      
  4. Clone project from GitHub

    git clone https://github.com/BR-IDL/PaddleViT.git 
    

Results (Model Zoo)

Image Classification

Model Acc@1 Acc@5 #Params FLOPs Image Size Crop pct Interp Link
vit_base_patch32_224 80.68 95.61 88.2M 4.4G 224 0.875 bicubic google/baidu(ubyr)
vit_base_patch32_384 83.35 96.84 88.2M 12.7G 384 1.0 bicubic google/baidu(3c2f)
vit_base_patch16_224 84.58 97.30 86.4M 17.0G 224 0.875 bicubic google/baidu(qv4n)
vit_base_patch16_384 85.99 98.00 86.4M 49.8G 384 1.0 bicubic google/baidu(wsum)
vit_large_patch16_224 85.81 97.82 304.1M 59.9G 224 0.875 bicubic google/baidu(1bgk)
vit_large_patch16_384 87.08 98.30 304.1M 175.9G 384 1.0 bicubic google/baidu(5t91)
vit_large_patch32_384 81.51 96.09 306.5M 44.4G 384 1.0 bicubic google/baidu(ieg3)
swin_t_224 81.37 95.54 28.3M 4.4G 224 0.9 bicubic google/baidu(h2ac)
swin_s_224 83.21 96.32 49.6M 8.6G 224 0.9 bicubic google/baidu(ydyx)
swin_b_224 83.60 96.46 87.7M 15.3G 224 0.9 bicubic google/baidu(h4y6)
swin_b_384 84.48 96.89 87.7M 45.5G 384 1.0 bicubic google/baidu(7nym)
swin_b_224_22kto1k 85.27 97.56 87.7M 15.3G 224 0.9 bicubic google/baidu(6ur8)
swin_b_384_22kto1k 86.43 98.07 87.7M 45.5G 384 1.0 bicubic google/baidu(9squ)
swin_l_224_22kto1k 86.32 97.90 196.4M 34.3G 224 0.9 bicubic google/baidu(nd2f)
swin_l_384_22kto1k 87.14 98.23 196.4M 100.9G 384 1.0 bicubic google/baidu(5g5e)
deit_tiny_distilled_224 74.52 91.90 5.9M 1.1G 224 0.875 bicubic google/baidu(rhda)
deit_small_distilled_224 81.17 95.41 22.4M 4.3G 224 0.875 bicubic google/baidu(pv28)
deit_base_distilled_224 83.32 96.49 87.2M 17.0G 224 0.875 bicubic google/baidu(5f2g)
deit_base_distilled_384 85.43 97.33 87.2M 49.9G 384 1.0 bicubic google/baidu(qgj2)
volo_d1_224 84.12 96.78 26.6M 6.6G 224 1.0 bicubic google/baidu(xaim)
volo_d1_384 85.24 97.21 26.6M 19.5G 384 1.0 bicubic google/baidu(rr7p)
volo_d2_224 85.11 97.19 58.6M 13.7G 224 1.0 bicubic google/baidu(d82f)
volo_d2_384 86.04 97.57 58.6M 40.7G 384 1.0 bicubic google/baidu(9cf3)
volo_d3_224 85.41 97.26 86.2M 19.8G 224 1.0 bicubic google/baidu(a5a4)
volo_d3_448 86.50 97.71 86.2M 80.3G 448 1.0 bicubic google/baidu(uudu)
volo_d4_224 85.89 97.54 192.8M 42.9G 224 1.0 bicubic google/baidu(vcf2)
volo_d4_448 86.70 97.85 192.8M 172.5G 448 1.0 bicubic google/baidu(nd4n)
volo_d5_224 86.08 97.58 295.3M 70.6G 224 1.0 bicubic google/baidu(ymdg)
volo_d5_448 86.92 97.88 295.3M 283.8G 448 1.0 bicubic google/baidu(qfcc)
volo_d5_512 87.05 97.97 295.3M 371.3G 512 1.15 bicubic google/baidu(353h)
cswin_tiny_224 82.81 96.30 22.3M 4.2G 224 0.9 bicubic google/baidu(4q3h)
cswin_small_224 83.60 96.58 34.6M 6.5G 224 0.9 bicubic google/baidu(gt1a)
cswin_base_224 84.23 96.91 77.4M 14.6G 224 0.9 bicubic google/baidu(wj8p)
cswin_base_384 85.51 97.48 77.4M 43.1G 384 1.0 bicubic google/baidu(rkf5)
cswin_large_224 86.52 97.99 173.3M 32.5G 224 0.9 bicubic google/baidu(b5fs)
cswin_large_384 87.49 98.35 173.3M 96.1G 384 1.0 bicubic google/baidu(6235)
cait_xxs24_224 78.38 94.32 11.9M 2.2G 224 1.0 bicubic google/baidu(j9m8)
cait_xxs36_224 79.75 94.88 17.2M 33.1G 224 1.0 bicubic google/baidu(nebg)
cait_xxs24_384 80.97 95.64 11.9M 6.8G 384 1.0 bicubic google/baidu(2j95)
cait_xxs36_384 82.20 96.15 17.2M 10.1G 384 1.0 bicubic google/baidu(wx5d)
cait_s24_224 83.45 96.57 46.8M 8.7G 224 1.0 bicubic google/baidu(m4pn)
cait_xs24_384 84.06 96.89 26.5M 15.1G 384 1.0 bicubic google/baidu(scsv)
cait_s24_384 85.05 97.34 46.8M 26.5G 384 1.0 bicubic google/baidu(dnp7)
cait_s36_384 85.45 97.48 68.1M 39.5G 384 1.0 bicubic google/baidu(e3ui)
cait_m36_384 86.06 97.73 270.7M 156.2G 384 1.0 bicubic google/baidu(r4hu)
cait_m48_448 86.49 97.75 355.8M 287.3G 448 1.0 bicubic google/baidu(imk5)
pvtv2_b0 70.47 90.16 3.7M 0.6G 224 0.875 bicubic google/baidu(dxgb)
pvtv2_b1 78.70 94.49 14.0M 2.1G 224 0.875 bicubic google/baidu(2e5m)
pvtv2_b2 82.02 95.99 25.4M 4.0G 224 0.875 bicubic google/baidu(are2)
pvtv2_b2_linear 82.06 96.04 22.6M 3.9G 224 0.875 bicubic google/baidu(a4c8)
pvtv2_b3 83.14 96.47 45.2M 6.8G 224 0.875 bicubic google/baidu(nc21)
pvtv2_b4 83.61 96.69 62.6M 10.0G 224 0.875 bicubic google/baidu(tthf)
pvtv2_b5 83.77 96.61 82.0M 11.5G 224 0.875 bicubic google/baidu(9v6n)
shuffle_vit_tiny 82.39 96.05 28.5M 4.6G 224 0.875 bicubic google/baidu(8a1i)
shuffle_vit_small 83.53 96.57 50.1M 8.8G 224 0.875 bicubic google/baidu(xwh3)
shuffle_vit_base 83.95 96.91 88.4M 15.5G 224 0.875 bicubic google/baidu(1gsr)
t2t_vit_7 71.68 90.89 4.3M 1.0G 224 0.9 bicubic google/baidu(1hpa)
t2t_vit_10 75.15 92.80 5.8M 1.3G 224 0.9 bicubic google/baidu(ixug)
t2t_vit_12 76.48 93.49 6.9M 1.5G 224 0.9 bicubic google/baidu(qpbb)
t2t_vit_14 81.50 95.67 21.5M 4.4G 224 0.9 bicubic google/baidu(c2u8)
t2t_vit_19 81.93 95.74 39.1M 7.8G 224 0.9 bicubic google/baidu(4in3)
t2t_vit_24 82.28 95.89 64.0M 12.8G 224 0.9 bicubic google/baidu(4in3)
t2t_vit_t_14 81.69 95.85 21.5M 4.4G 224 0.9 bicubic google/baidu(4in3)
t2t_vit_t_19 82.44 96.08 39.1M 7.9G 224 0.9 bicubic google/baidu(mier)
t2t_vit_t_24 82.55 96.07 64.0M 12.9G 224 0.9 bicubic google/baidu(6vxc)
t2t_vit_14_384 83.34 96.50 21.5M 13.0G 384 1.0 bicubic google/baidu(r685)
cross_vit_tiny_224 73.20 91.90 6.9M 1.3G 224 0.875 bicubic google/baidu(scvb)
cross_vit_small_224 81.01 95.33 26.7M 5.2G 224 0.875 bicubic google/baidu(32us)
cross_vit_base_224 82.12 95.87 104.7M 20.2G 224 0.875 bicubic google/baidu(jj2q)
cross_vit_9_224 73.78 91.93 8.5M 1.6G 224 0.875 bicubic google/baidu(mjcb)
cross_vit_15_224 81.51 95.72 27.4M 5.2G 224 0.875 bicubic google/baidu(n55b)
cross_vit_18_224 82.29 96.00 43.1M 8.3G 224 0.875 bicubic google/baidu(xese)
cross_vit_9_dagger_224 76.92 93.61 8.7M 1.7G 224 0.875 bicubic google/baidu(58ah)
cross_vit_15_dagger_224 82.23 95.93 28.1M 5.6G 224 0.875 bicubic google/baidu(qwup)
cross_vit_18_dagger_224 82.51 96.03 44.1M 8.7G 224 0.875 bicubic google/baidu(qtw4)
cross_vit_15_dagger_384 83.75 96.75 28.1M 16.4G 384 1.0 bicubic google/baidu(w71e)
cross_vit_18_dagger_384 84.17 96.82 44.1M 25.8G 384 1.0 bicubic google/baidu(99b6)
beit_base_patch16_224_pt22k 85.21 97.66 87M 12.7G 224 0.9 bicubic google/baidu(fshn)
beit_base_patch16_384_pt22k 86.81 98.14 87M 37.3G 384 1.0 bicubic google/baidu(arvc)
beit_large_patch16_224_pt22k 87.48 98.30 304M 45.0G 224 0.9 bicubic google/baidu(2ya2)
beit_large_patch16_384_pt22k 88.40 98.60 304M 131.7G 384 1.0 bicubic google/baidu(qtrn)
beit_large_patch16_512_pt22k 88.60 98.66 304M 234.0G 512 1.0 bicubic google/baidu(567v)
Focal-T 82.03 95.86 28.9M 4.9G 224 0.875 bicubic google/baidu(i8c2)
Focal-T (use conv) 82.70 96.14 30.8M 4.9G 224 0.875 bicubic google/baidu(smrk)
Focal-S 83.55 96.29 51.1M 9.4G 224 0.875 bicubic google/baidu(dwd8)
Focal-S (use conv) 83.85 96.47 53.1M 9.4G 224 0.875 bicubic google/baidu(nr7n)
Focal-B 83.98 96.48 89.8M 16.4G 224 0.875 bicubic google/baidu(8akn)
Focal-B (use conv) 84.18 96.61 93.3M 16.4G 224 0.875 bicubic google/baidu(5nfi)
mobilevit_xxs 70.31 89.68 1.32M 0.44G 256 1.0 bicubic google/baidu(axpc)
mobilevit_xs 74.47 92.02 2.33M 0.95G 256 1.0 bicubic google/baidu(hfhm)
mobilevit_s 76.74 93.08 5.59M 1.88G 256 1.0 bicubic google/baidu(34bg)
mobilevit_s $\dag$ 77.83 93.83 5.59M 1.88G 256 1.0 bicubic google/baidu(92ic)
vip_s7 81.50 95.76 25.1M 7.0G 224 0.875 bicubic google/baidu(mh9b)
vip_m7 82.75 96.05 55.3M 16.4G 224 0.875 bicubic google/baidu(hvm8)
vip_l7 83.18 96.37 87.8M 24.5G 224 0.875 bicubic google/baidu(tjvh)
xcit_nano_12_p16_224_dist 72.32 90.86 0.6G 3.1M 224 1.0 bicubic google/baidu(7qvz)
xcit_nano_12_p16_384_dist 75.46 92.70 1.6G 3.1M 384 1.0 bicubic google/baidu(1y2j)
xcit_large_24_p16_224_dist 84.92 97.13 35.9G 189.1M 224 1.0 bicubic google/baidu(kfv8)
xcit_large_24_p16_384_dist 85.76 97.54 105.5G 189.1M 384 1.0 bicubic google/baidu(ffq3)
xcit_nano_12_p8_224_dist 76.33 93.10 2.2G 3.0M 224 1.0 bicubic google/baidu(jjs7)
xcit_nano_12_p8_384_dist 77.82 94.04 6.3G 3.0M 384 1.0 bicubic google/baidu(dmc1)
xcit_large_24_p8_224_dist 85.40 97.40 141.4G 188.9M 224 1.0 bicubic google/baidu(y7gw)
xcit_large_24_p8_384_dist 85.99 97.69 415.5G 188.9M 384 1.0 bicubic google/baidu(9xww)
pit_ti 72.91 91.40 4.8M 0.5G 224 0.9 bicubic google/baidu(ydmi)
pit_ti_distill 74.54 92.10 5.1M 0.5G 224 0.9 bicubic google/baidu(7k4s)
pit_xs 78.18 94.16 10.5M 1.1G 224 0.9 bicubic google/baidu(gytu)
pit_xs_distill 79.31 94.36 10.9M 1.1G 224 0.9 bicubic google/baidu(ie7s)
pit_s 81.08 95.33 23.4M 2.4G 224 0.9 bicubic google/baidu(kt1n)
pit_s_distill 81.99 95.79 24.0M 2.5G 224 0.9 bicubic google/baidu(hhyc)
pit_b 82.44 95.71 73.5M 10.6G 224 0.9 bicubic google/baidu(uh2v)
pit_b_distill 84.14 96.86 74.5M 10.7G 224 0.9 bicubic google/baidu(3e6g)
halonet26t 79.10 94.31 12.5M 3.2G 256 0.95 bicubic google/baidu(ednv)
halonet50ts 81.65 95.61 22.8M 5.1G 256 0.94 bicubic google/baidu(3j9e)
poolformer_s12 77.24 93.51 11.9M 1.8G 224 0.9 bicubic google/baidu(zcv4)
poolformer_s24 80.33 95.05 21.3M 3.4G 224 0.9 bicubic google/baidu(nedr)
poolformer_s36 81.43 95.45 30.8M 5.0G 224 0.9 bicubic google/baidu(fvpm)
poolformer_m36 82.11 95.69 56.1M 8.9G 224 0.95 bicubic google/baidu(whfp)
poolformer_m48 82.46 95.96 73.4M 11.8G 224 0.95 bicubic google/baidu(374f)
botnet50 77.38 93.56 20.9M 5.3G 224 0.875 bicubic google/baidu(wh13)
CvT-13-224 81.59 95.67 20M 4.5G 224 0.875 bicubic google/baidu(vev9)
CvT-21-224 82.46 96.00 32M 7.1G 224 0.875 bicubic google/baidu(t2rv)
CvT-13-384 83.00 96.36 20M 16.3G 384 1.0 bicubic google/baidu(wswt)
CvT-21-384 83.27 96.16 32M 24.9G 384 1.0 bicubic google/baidu(hcem)
CvT-13-384-22k 83.26 97.09 20M 16.3G 384 1.0 bicubic google/baidu(c7m9)
CvT-21-384-22k 84.91 97.62 32M 24.9G 384 1.0 bicubic google/baidu(9jxe)
CvT-w24-384-22k 87.58 98.47 277M 193.2G 384 1.0 bicubic google/baidu(bbj2)
HVT-Ti-1 69.45 89.28 5.7M 0.6G 224 0.875 bicubic google/baidu(egds)
HVT-S-0 80.30 95.15 22.0M 4.6G 224 0.875 bicubic google/baidu(hj7a)
HVT-S-1 78.06 93.84 22.1M 2.4G 224 0.875 bicubic google/baidu(tva8)
HVT-S-2 77.41 93.48 22.1M 1.9G 224 0.875 bicubic google/baidu(bajp)
HVT-S-3 76.30 92.88 22.1M 1.6G 224 0.875 bicubic google/baidu(rjch)
HVT-S-4 75.21 92.34 22.1M 1.6G 224 0.875 bicubic google/baidu(ki4j)
mlp_mixer_b16_224 76.60 92.23 60.0M 12.7G 224 0.875 bicubic google/baidu(xh8x)
mlp_mixer_l16_224 72.06 87.67 208.2M 44.9G 224 0.875 bicubic google/baidu(8q7r)
resmlp_24_224 79.38 94.55 30.0M 6.0G 224 0.875 bicubic google/baidu(jdcx)
resmlp_36_224 79.77 94.89 44.7M 9.0G 224 0.875 bicubic google/baidu(33w3)
resmlp_big_24_224 81.04 95.02 129.1M 100.7G 224 0.875 bicubic google/baidu(r9kb)
resmlp_12_distilled_224 77.95 93.56 15.3M 3.0G 224 0.875 bicubic google/baidu(ghyp)
resmlp_24_distilled_224 80.76 95.22 30.0M 6.0G 224 0.875 bicubic google/baidu(sxnx)
resmlp_36_distilled_224 81.15 95.48 44.7M 9.0G 224 0.875 bicubic google/baidu(vt85)
resmlp_big_24_distilled_224 83.59 96.65 129.1M 100.7G 224 0.875 bicubic google/baidu(4jk5)
resmlp_big_24_22k_224 84.40 97.11 129.1M 100.7G 224 0.875 bicubic google/baidu(ve7i)
gmlp_s16_224 79.64 94.63 19.4M 4.5G 224 0.875 bicubic google/baidu(bcth)
ff_only_tiny (linear_tiny) 61.28 84.06 224 0.875 bicubic google/baidu(mjgd)
ff_only_base (linear_base) 74.82 91.71 224 0.875 bicubic google/baidu(m1jc)
repmlp_res50_light_224 77.01 93.46 87.1M 3.3G 224 0.875 bicubic google/baidu(b4fg)
cyclemlp_b1 78.85 94.60 15.1M 224 0.9 bicubic google/baidu(mnbr)
cyclemlp_b2 81.58 95.81 26.8M 224 0.9 bicubic google/baidu(jwj9)
cyclemlp_b3 82.42 96.07 38.3M 224 0.9 bicubic google/baidu(v2fy)
cyclemlp_b4 82.96 96.33 51.8M 224 0.875 bicubic google/baidu(fnqd)
cyclemlp_b5 83.25 96.44 75.7M 224 0.875 bicubic google/baidu(s55c)
convmixer_1024_20 76.94 93.35 24.5M 9.5G 224 0.96 bicubic google/baidu(qpn9)
convmixer_768_32 80.16 95.08 21.2M 20.8G 224 0.96 bicubic google/baidu(m5s5)
convmixer_1536_20 81.37 95.62 51.8M 72.4G 224 0.96 bicubic google/baidu(xqty)
convmlp_s 76.76 93.40 9.0M 2.4G 224 0.875 bicubic google/baidu(3jz3)
convmlp_m 79.03 94.53 17.4M 4.0G 224 0.875 bicubic google/baidu(vyp1)
convmlp_l 80.15 95.00 42.7M 10.0G 224 0.875 bicubic google/baidu(ne5x)
topformer_tiny 65.98 87.32 1.5M 0.13G 224 0.875 bicubic google/baidu
topformer_small 72.44 91.17 3.1M 0.24G 224 0.875 bicubic google/baidu
topformer_base 75.25 92.67 5.1M 0.37G 224 0.875 bicubic google/baidu

Object Detection

Model backbone box_mAP Model
DETR ResNet50 42.0 google/baidu(n5gk)
DETR ResNet101 43.5 google/baidu(bxz2)
Mask R-CNN Swin-T 1x 43.7 google/baidu(qev7)
Mask R-CNN Swin-T 3x 46.0 google/baidu(m8fg)
Mask R-CNN Swin-S 3x 48.4 google/baidu(hdw5)
Mask R-CNN pvtv2_b0 38.3 google/baidu(3kqb)
Mask R-CNN pvtv2_b1 41.8 google/baidu(k5aq)
Mask R-CNN pvtv2_b2 45.2 google/baidu(jh8b)
Mask R-CNN pvtv2_b2_linear 44.1 google/baidu(8ipt)
Mask R-CNN pvtv2_b3 46.9 google/baidu(je4y)
Mask R-CNN pvtv2_b4 47.5 google/baidu(n3ay)
Mask R-CNN pvtv2_b5 47.4 google/baidu(jzq1)

Semantic Segmentation

Pascal Context

Model Backbone Batch_size mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_large 16 52.06 52.57 google/baidu(owoj) google/baidu(xdb8) config
SETR_PUP ViT_large 16 53.90 54.53 google/baidu(owoj) google/baidu(6sji) config
SETR_MLA ViT_Large 8 54.39 55.16 google/baidu(owoj) google/baidu(wora) config
SETR_MLA ViT_large 16 55.01 55.87 google/baidu(owoj) google/baidu(76h2) config

Cityscapes

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_Large 8 40k 76.71 79.03 google/baidu(owoj) google/baidu(g7ro) config
SETR_Naive ViT_Large 8 80k 77.31 79.43 google/baidu(owoj) google/baidu(wn6q) config
SETR_PUP ViT_Large 8 40k 77.92 79.63 google/baidu(owoj) google/baidu(zmoi) config
SETR_PUP ViT_Large 8 80k 78.81 80.43 google/baidu(owoj) baidu(f793) config
SETR_MLA ViT_Large 8 40k 76.70 78.96 google/baidu(owoj) baidu(qaiw) config
SETR_MLA ViT_Large 8 80k 77.26 79.27 google/baidu(owoj) baidu(6bgj) config

ADE20K

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_Large 16 160k 47.57 48.12 google/baidu(owoj) baidu(lugq) config
SETR_PUP ViT_Large 16 160k 49.12 49.51 google/baidu(owoj) baidu(udgs) config
SETR_MLA ViT_Large 8 160k 47.80 49.34 google/baidu(owoj) baidu(mrrv) config
DPT ViT_Large 16 160k 47.21 - google/baidu(owoj) baidu(ts7h) config
Segmenter ViT_Tiny 16 160k 38.45 - TODO baidu(1k97) config
Segmenter ViT_Small 16 160k 46.07 - TODO baidu(i8nv) config
Segmenter ViT_Base 16 160k 49.08 - TODO baidu(hxrl) config
Segmenter ViT_Large 16 160k 51.82 - TODO baidu(wdz6) config
Segmenter_Linear DeiT_Base 16 160k 47.34 - TODO baidu(5dpv) config
Segmenter DeiT_Base 16 160k 49.27 - TODO baidu(3kim) config
Segformer MIT-B0 16 160k 38.37 - TODO baidu(ges9) config
Segformer MIT-B1 16 160k 42.20 - TODO baidu(t4n4) config
Segformer MIT-B2 16 160k 46.38 - TODO baidu(h5ar) config
Segformer MIT-B3 16 160k 48.35 - TODO baidu(g9n4) config
Segformer MIT-B4 16 160k 49.01 - TODO baidu(e4xw) config
Segformer MIT-B5 16 160k 49.73 - TODO baidu(uczo) config
UperNet Swin_Tiny 16 160k 44.90 45.37 - baidu(lkhg) config
UperNet Swin_Small 16 160k 47.88 48.90 - baidu(vvy1) config
UperNet Swin_Base 16 160k 48.59 49.04 - baidu(y040) config
UperNet CSwin_Tiny 16 160k 49.46 baidu(l1cp) baidu(y1eq) config
UperNet CSwin_Small 16 160k 50.88 baidu(6vwk) baidu(fz2e) config
UperNet CSwin_Base 16 160k 50.64 baidu(0ys7) baidu(83w3) config
TopFormer TopFormer_Base 16 160k 38.3 - google/baidu google/baidu(ufxt) config
TopFormer TopFormer_Base 32 160k 39.2 - google/baidu google/baidu(ufxt) config
TopFormer TopFormer_Small 16 160k 36.5 - google/baidu google/baidu(ufxt) config
TopFormer TopFormer_Small 32 160k 37.0 - google/baidu google/baidu(ufxt) config
TopFormer TopFormer_Tiny 16 160k 33.6 - google/baidu google/baidu(ufxt) config
TopFormer TopFormer_Tiny 32 160k 34.6 - google/baidu google/baidu(ufxt) config
TopFormer TopFormer_Tiny 16 160k 32.5 - google/baidu google/baidu(ufxt) config
TopFormer TopFormer_Tiny 32 160k 33.4 - google/baidu google/baidu(ufxt) config
Trans2seg_Medium Resnet50c 32 160k 36.81 - google/baidu(4dd5) google/baidu(i2nt) config

Trans10kV2

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
Trans2seg_Medium Resnet50c 16 16k 75.97 - google/baidu(4dd5) google/baidu(w25r) config

GAN

Model FID Image Size Crop_pct Interpolation Model
styleformer_cifar10 2.73 32 1.0 lanczos google/baidu(ztky)
styleformer_stl10 15.65 48 1.0 lanczos google/baidu(i973)
styleformer_celeba 3.32 64 1.0 lanczos google/baidu(fh5s)
styleformer_lsun 9.68 128 1.0 lanczos google/baidu(158t)

*The results are evaluated on Cifar10, STL10, Celeba and LSUNchurch dataset, using fid50k_full metric.

Quick Demo for Image Classification

To use the model with pretrained weights, go to the specific subfolder e.g., /image_classification/ViT/, then download the .pdparam weight file and change related file paths in the following python scripts. The model config files are located in ./configs.

Assume the downloaded weight file is stored in ./vit_base_patch16_224.pdparams, to use the vit_base_patch16_224 model in python:

from config import get_config
from visual_transformer import build_vit as build_model
# config files in ./configs/
config = get_config('./configs/vit_base_patch16_224.yaml')
# build model
model = build_model(config)
# load pretrained weights
model_state_dict = paddle.load('./vit_base_patch16_224.pdparams')
model.set_dict(model_state_dict)

🤖 See the README file in each model folder for detailed usages.

Evaluation

To evaluate ViT model performance on ImageNet2012 with a single GPU, run the following script using command line:

sh run_eval.sh

or

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
    -cfg=./configs/vit_base_patch16_224.yaml \
    -dataset=imagenet2012 \
    -batch_size=16 \
    -data_path=/path/to/dataset/imagenet/val \
    -eval \
    -pretrained=/path/to/pretrained/model/vit_base_patch16_224  # .pdparams is NOT needed
Run evaluation using multi-GPUs:
sh run_eval_multi.sh

or

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg=./configs/vit_base_patch16_224.yaml \
    -dataset=imagenet2012 \
    -batch_size=16 \
    -data_path=/path/to/dataset/imagenet/val \
    -eval \
    -pretrained=/path/to/pretrained/model/vit_base_patch16_224   # .pdparams is NOT needed

Training

To train the ViT model on ImageNet2012 with single GPU, run the following script using command line:

sh run_train.sh

or

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
  -cfg=./configs/vit_base_patch16_224.yaml \
  -dataset=imagenet2012 \
  -batch_size=32 \
  -data_path=/path/to/dataset/imagenet/train
Run training using multi-GPUs:
sh run_train_multi.sh

or

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg=./configs/vit_base_patch16_224.yaml \
    -dataset=imagenet2012 \
    -batch_size=16 \
    -data_path=/path/to/dataset/imagenet/train

Contributing

  • We encourage and appreciate your contribution to PaddleViT project, please refer to our workflow and work styles by CONTRIBUTING.md

Licenses

  • This repo is under the Apache-2.0 license.

Contact

  • Please raise an issue on GitHub.

paddlevit's People

Contributors

178vit avatar br-idl avatar chuliut avatar cjh3020889729 avatar ddxdaniel avatar defensetongxue avatar emiyaning avatar fl77n avatar guoquanhao avatar h1063135843 avatar jarygrace avatar jonny4929 avatar libertatis avatar liuhui0401 avatar lmk123568 avatar rangeking avatar siguremo avatar skpig avatar toscanagogithub avatar wflrz123 avatar wutianyirosun avatar xperzy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paddlevit's Issues

按照semantic_segmentation/readme教程安装环境失败。

在最后一步
cd PaddleViT/semantic_segmentation pip3 install -r requirements.txt
报错,提示如下
(paddlevit) D:\PyWorkspace\PaddleViT\PaddleViT\semantic_segmentation>pip3 install -r requirements.txt Collecting cityscapesScripts==2.2.0 Using cached cityscapesScripts-2.2.0-py3-none-any.whl (472 kB) ERROR: Could not find a version that satisfies the requirement detail==4.0 (from versions: none) ERROR: No matching distribution found for detail==4.0
尝试直接安装detail包
(paddlevit) D:\PyWorkspace\PaddleViT\PaddleViT\semantic_segmentation>pip3 install detail==4.0 ERROR: Could not find a version that satisfies the requirement detail==4.0 (from versions: none) ERROR: No matching distribution found for detail==4.0 (paddlevit) D:\PyWorkspace\PaddleViT\PaddleViT\semantic_segmentation>pip3 install detail ERROR: Could not find a version that satisfies the requirement detail (from versions: none) ERROR: No matching distribution found for detail
更新pip
(paddlevit) D:\PyWorkspace\PaddleViT\PaddleViT\semantic_segmentation>pip install --user --upgrade pip Requirement already satisfied: pip in d:\anaconda\envs\paddlevit\lib\site-packages (21.3.1)
仍然无法安装
(paddlevit) D:\PyWorkspace\PaddleViT\PaddleViT\semantic_segmentation>pip3 install detail ERROR: Could not find a version that satisfies the requirement detail (from versions: none) ERROR: No matching distribution found for detail
百度没有找到相关包
最后安装了details-0.2.0的工具包,不知道如何使用。

training loss not decrease

I train the landcover dataset in the image segment model you provide, but the loss does not decrease.
image

关于 ViT Transformer Encoder 中 encoder_layer 深拷贝的疑问

PaddleViT/image_classification/ViT/transformer.py Encoder 初始化创建 encoder_layer 时,深拷贝的意义是什么呢?

class Encoder(nn.Layer):

    def __init__(self,
                 embed_dim,
                 num_heads,
                 depth,
                 qkv_bias=True,
                 mlp_ratio=4.0,
                 dropout=0.,
                 attention_dropout=0.,
                 droppath=0.):
        super(Encoder, self).__init__()
        # stochatic depth decay
        depth_decay = [x.item() for x in paddle.linspace(0, droppath, depth)]
        layer_list = []
        for i in range(depth):
            encoder_layer = EncoderLayer(embed_dim,
                                         num_heads,
                                         qkv_bias=qkv_bias,
                                         mlp_ratio=mlp_ratio,
                                         dropout=dropout,
                                         attention_dropout=attention_dropout,
                                         droppath=depth_decay[i])
            layer_list.append(copy.deepcopy(encoder_layer))  # 这里对encoder_layer做深拷贝的意义是什么呢?
        self.layers = nn.LayerList(layer_list)
……

for 循环创建 encoder_layer,每一个 encoder_layer 都是不同的对象,也不存在参数共享的问题,这里对 encoder_layer 做深拷贝是基于什么样的考量呢?
我个人认为这里对 encoder_layer 的深拷贝是不必要的:

layer_list.append(encoder_layer)

可能是我想的还不够深,期待官方的答疑解惑~

[fixed] Training Error when using larger batch size

Error:
Training will fail when using larger batch:
SystemError: (Fatal) Operator set_value raises an thrust::system::system_error exception. The exception content is :parallel_for failed: cudaErrorInvalidConfiguration: invalid configuration argument. (at /paddle/paddle/fluid/imperative/tracer.cc:192)

Reason:
The reason is explained by the following issues from PaddlePaddle:
PaddlePaddle/Paddle#33057 (comment)

In short, this error is raised because of cuda thrust bug, which is ignored in newer version cuda.

Solution:
install paddle dev version will fix the problem.
You will find the following instructions of how to install it:
https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html

In detail, the problem is fixed by the following patch:
https://github.com/PaddlePaddle/Paddle/pull/33748/files/617e3eda9dfcd76cb6a7ebaa1535340f1023d3f1

ViT模型验证时要求数据集必须有训练集和验证集不太合理

Describe the bug
在使用ViT基于Imagenet2012数据集进行模型验证时,因为数据集比较大,仅下载了验证数据集。执行模型验证命令,报错,提示没有训练集数据。

经排查,
main_single_gpu.py 在进行模型验证前,需要先加载训练集和验证集。

建议单独增加一个模型验证的脚本。

Add LINEAR_SCALE_LR options

Describe your feature request
Add linear_scale_lr arguments in config and control the batch_size for linear lr scale

Describe the reference code or paper
N/A

Describe the possible solution
Add argument in config.py
Add if-condition in main_single_gpu.py and main_multi_gpu.py

Additional context
N/A

[Image Classification][Model Training] VOLO Token Labeling

Describe your feature request
Token labeling is used in VOLO model training, implement related classes and methods.

Describe the reference code or paper

  1. The soft target loss and token label loss (official code: https://github.com/sail-sg/volo/blob/main/loss/cross_entropy.py)
  2. Token Labeling in main (official code: https://github.com/sail-sg/volo/blob/main/main.py)
  3. Token Label related classes and methods (official code: https://github.com/zihangJiang/TokenLabeling/tree/main/tlt/data)
  4. The dense label map download link (9.2G) is listed in VOLO official repo: https://github.com/sail-sg/volo#4-train

Describe the possible solution
Implemented according to the official code

Additional context
N/A

模型权重文件下载建议

现有的预训练参数文件都是在GitHub的项目链接里,这种方式虽然可视化效果好,但实际应用时存在诸多不便:

使用AiStudio或者本地开发时,需要将所有的权重文件提前下载(谷歌云盘或者百度网盘),之后再将该权重文件上传至指定位置,程序较繁琐;

image

建议

是否可以效仿PaddleSeg在保留GitHub项目链接的同时,在具体的项目配置文件中指定pretrain-model等预训练权重参数文件的下载链接,这样在模型训练时,可以依据配置文件自动下载权重,例如下图:

image

[Image Classification][Model Training] CrossViT training settings

Describe your feature request
Check and modify the training settings for CrossViT models, add missing processing and training methods.

Describe the reference code or paper

Describe the possible solution

  • Paper:
    • "... based on DeiT, and apply their default hyper-parameter for training..."
    • Augmentation: rand augmentation (m9n2), mixup(0.8), cutmix(1.0), random erasing(0.25)
    • Droppath: 0.1
    • label smoothing: 0.1
    • Epochs: 300 (30 warm-ups)
    • 32 GPUs
    • Batch size: 4096
    • cosine lr decay, linear warmup
    • init lr: 0.004
    • weight decay: 0.05
    • optimizer: AdamW
    • warmup start lr: 1e-6
  • Code:
    • dropout: 0.0
    • droppath: 0.1
    • clip-grad: None
    • min lr: 1e-5
  • Model weights init:
    • linear layer: trunc_normal (.02)
    • layernorm layer: constant weight (1.0), bias (0.0)
    • official code here

Additional context
Add any other context or screenshots about the feature request here.

ValueError: The ``path`` (./vit_base_patch16_224) to load model not exists.

当我在 AIStudio (经典版)运行 README.md "Quick Demo for Image Classification" 中的示例:

%cd PaddleViT/image_classification/ViT/

import paddle

from config import get_config
from transformer import build_vit as build_model


# config files in ./configs/
config = get_config('./configs/vit_base_patch16_224.yaml')
# build model
model = build_model(config)
# load pretrained weights, .pdparams is NOT needed
model_state_dict = paddle.load('./vit_base_patch16_224')
model.set_dict(model_state_dict)

加载预训练权重时,出现了如下错误:

/home/aistudio/PaddleViT/image_classification/ViT
merging config from ./configs/vit_base_patch16_224.yaml
W1123 07:14:56.871081  9894 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1123 07:14:56.875849  9894 device_context.cc:465] device: 0, cuDNN Version: 7.6.
---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)/tmp/ipykernel_9894/1201834454.py in <module>
     12 model = build_model(config)
     13 # load pretrained weights, .pdparams is NOT needed
---> 14 model_state_dict = paddle.load('./vit_base_patch16_224')
     15 model.set_dict(model_state_dict)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/framework/io.py in load(path, **configs)
    983 
    984     else:
--> 985         load_result = _legacy_load(path, **configs)
    986 
    987     return load_result
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/framework/io.py in _legacy_load(path, **configs)
   1001     else:
   1002         # file prefix and directory are compatible cases
-> 1003         model_path, config = _build_load_path_and_config(path, config)
   1004         # check whether model file exists
   1005         if config.model_filename is None:
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/framework/io.py in _build_load_path_and_config(path, config)
    159                 "example, it should be written as `paddle.load('model.pdparams')` instead of " \
    160                 "`paddle.load('model')`."
--> 161         raise ValueError(error_msg % path)
    162     else:
    163         if prefix_format_exist:
ValueError: The ``path`` (./vit_base_patch16_224) to load model not exists. If you want to load the results saved by `fluid.save_dygraph`, please specify the full file name, not just the file name prefix. For example, it should be written as `paddle.load('model.pdparams')` instead of `paddle.load('model')`.

错误信息显示,加载预训练权重时,需要指定预训练文件的完整文件名,即 .pdparams 是必须的。
加载预训练权重时,加上 .pdparams 后缀就没有问题了:

model_state_dict = paddle.load('./vit_base_patch16_224.pdparams')

Add Recompute Feature in Model training

Describe your feature request
Add recompute for dygraph model training, which aims enlarge the batchsize during the training by free intermediate memories.

Describe the reference code or paper
N/A

@jarygrace Let's add this feature asap, thanks!

Allow users to choose the imagenet mean in config file

Describe your feature request
Now dataset.py has hard coded imagenet mean and var, which is not flexible and not easy to find for new users.
This can be set with an argument in config file, users can set the default mean, var values.

[Image Classification] CvT Model

Describe your feature request
Reproduce the CvT model

Describe the reference code or paper
Paper: https://arxiv.org/pdf/2103.15808.pdf
official repo: https://github.com/VITA-Group/TransGAN

Describe the possible solution

  1. implement the model (refer to official impl. here)
  2. Add configs for different model (refer to official impl. here)
  3. port official pretrained weights (refer to official impl. here)
  4. Debug and evaluate PaddleViT impl.
  5. Impl model training

Additional context
N/A

训练loss不下降,精度很差

为什么我用自己的数据集按照配置单卡训练ViT,效果很差。同一份预料,我在paddle-resnet上可以达到99%精度
Q1

关于 ViT Transformer Encoder 中存在硬编码的问题

PaddleViT/image_classification/ViT/transformer.pyL300 创建 EncoderLayer 时,参数 qkv_bias, mlp_ratio, dropout, attention_dropout存在硬编码的问题,导致 Encoder__init__ 方法传入的参数形同虚设:

class Encoder(nn.Layer):
    """Transformer encoder
    Encoder encoder contains a list of EncoderLayer, and a LayerNorm.
    Attributes:
        layers: nn.LayerList contains multiple EncoderLayers
        encoder_norm: nn.LayerNorm which is applied after last encoder layer
    """
    def __init__(self,
                 embed_dim,
                 num_heads,
                 depth,
                 qkv_bias=True,
                 mlp_ratio=4.0,
                 dropout=0.,
                 attention_dropout=0.,
                 droppath=0.):
        super(Encoder, self).__init__()
        # stochatic depth decay
        depth_decay = [x.item() for x in paddle.linspace(0, droppath, depth)]
        layer_list = []
        for i in range(depth):
            encoder_layer = EncoderLayer(embed_dim,
                                         num_heads,
                                         qkv_bias=True,
                                         mlp_ratio=4.,
                                         dropout=0.,
                                         attention_dropout=0.,
                                         droppath=depth_decay[i])
            layer_list.append(copy.deepcopy(encoder_layer))
        self.layers = nn.LayerList(layer_list)
……

应该改成:

class Encoder(nn.Layer):
    """Transformer encoder
    Encoder encoder contains a list of EncoderLayer, and a LayerNorm.
    Attributes:
        layers: nn.LayerList contains multiple EncoderLayers
        encoder_norm: nn.LayerNorm which is applied after last encoder layer
    """
    def __init__(self,
                 embed_dim,
                 num_heads,
                 depth,
                 qkv_bias=True,
                 mlp_ratio=4.0,
                 dropout=0.,
                 attention_dropout=0.,
                 droppath=0.):
        super(Encoder, self).__init__()
        # stochatic depth decay
        depth_decay = [x.item() for x in paddle.linspace(0, droppath, depth)]
        layer_list = []
        for i in range(depth):
            encoder_layer = EncoderLayer(embed_dim,
                                         num_heads,
                                         qkv_bias=qkv_bias,
                                         mlp_ratio=mlp_ratio,
                                         dropout=dropout,
                                         attention_dropout=attention_dropout,
                                         droppath=depth_decay[i])
            layer_list.append(copy.deepcopy(encoder_layer))
        self.layers = nn.LayerList(layer_list)
……

以上~

Seg result is less than expected

https://github.com/BR-IDL/PaddleViT/blob/develop/semantic_segmentation/configs/upernet_swin/upernet_swin_base_patch4_windown7_512x512_160k_ade20k.yaml

train with 8 cards, but the result is less than expected:

#--------------------------------------------------
2021-11-18 14:47:19 [INFO] [EVAL] #Images: 2000 mIoU: 0.0108 Acc: 0.3663 Kappa: 0.2675
2021-11-18 14:47:19 [INFO] [EVAL] Class IoU:
[2.903e-01 3.102e-01 6.999e-01 0.000e+00 3.243e-01 0.000e+00 2.000e-04
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00 0.000e+00 0.000e+00]
2021-11-18 14:47:19 [INFO] [EVAL] Class Acc:
[0.3054 0.3275 0.7344 0. 0.3642 0. 1. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
#--------------------------------------------------

Add onecycle scheduler

Describe your feature request
Add onecycle scheduler, which is used in convmixer

Describe the reference code or paper
N/A

Describe the possible solution
refered to onecycle impl in timm

Additional context
N/A

small issue

Steps to reproduce the behavior:

  1. Go to 'PaddleViT/object_detection/Swin/'
  2. Run 'main_single_gpu.py'

Additional context

issue
github-PaddleViT/object_detection/Swin/中
main_single_gpu.py : line391 缺少train_loss定义
utils.py : 缺少import math

Will Mobile-Former of PaddleViT come soon ?

Describe your feature request
Will PaddleViT recently add Mobile-Former and release pretrained weights on ImageNet?

Describe the reference code or paper
Paper -> Mobile-Former: Bridging MobileNet and Transformer

Describe the possible solution

Additional context
Add any other context or screenshots about the feature request here.

关于 ViT Transformer VisualTransformer 模型输出的疑问

PaddleViT/image_classification/ViT/transformer.py VisualTransformer 模型的输出是不是少了一个 attn:

class VisualTransformer(nn.Layer):
    ……
    def forward(self, x):
        x = self.patch_embedding(x)
        x, attn = self.encoder(x)
        logits = self.classifier(x[:, 0]) # take only cls_token as classifier
        return logits

我个人认为,模型的输出应该同时返回 attn的:

class VisualTransformer(nn.Layer):
    ……
    def forward(self, x):
        x = self.patch_embedding(x)
        x, attn = self.encoder(x)
        logits = self.classifier(x[:, 0]) # take only cls_token as classifier
        return logits, attn

理由如下:

  • 其一,每层注意力注意力权重 attn,在 AttentionEncoderLayerEncoder 中一直都是由返回的,如果模型输出不返回 attn,那么前面几个类的返回将会是多余的,可能毫无意义。
  • 其二,每层注意力权重 attn,在后期的可视化中可能回用到。我猜前面几个类返回每层注意力权重,这样的设计可能也是基于可视化的考量的。

综上,建议模型输出同时返回每层的注意力权重 attn~

PaddleViT-Seg教程文档中存在的小问题

文档中多处还使用“SETR”,但是目录文件夹是小写
FireShot Capture 009 - PaddleViT_semantic_segmentation_configs_setr at develop · BR-IDL_Padd_ - github com
【原文】
CUDA_VISIBLE_DEVICES=0 python3 train.py
--config ./configs/SETR/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml
【应修改】
CUDA_VISIBLE_DEVICES=0 python3 train.py
--config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml

LabelSmoothingCrossEntropyLoss增加可选参数

Describe your feature request
LabelSmoothingCrossEntropyLoss增加可选参数
losses.py中(以DeiT为例)LabelSmoothingCrossEntropyLoss可选参数较少

class LabelSmoothingCrossEntropyLoss(nn.Layer):
""" cross entropy loss for label smoothing
Args:
smoothing: float, smoothing rate
x: tensor, predictions (before softmax) with shape [N, num_classes]
target: tensor, target label with shape [N]
Return:
loss: float, cross entropy loss value
"""
def __init__(self, smoothing=0.1):
super().__init__()
assert 0 <= smoothing < 1.0
self.smoothing = smoothing
self.confidence = 1 - smoothing
def forward(self, x, target):
log_probs = F.log_softmax(x) # [N, num_classes]
# target_index is used to get prob for each of the N samples
target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
target_index[:, 0] = paddle.arange(x.shape[0])
target_index[:, 1] = target
nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
smooth_loss = -log_probs.mean(axis=-1)
loss = self.confidence * nll_loss + self.smoothing * smooth_loss
return loss.mean()

Describe the reference code or paper

Describe the possible solution
直接调用paddle.nn.functional.cross_entropy计算,可以设置更多参数。
调用前先利用paddle.nn.functional.one_hot将标签转为on-hot形式,再用 paddle.nn.functional.label_smooth将标签平滑,最后将paddle.nn.functional.cross_entropy的soft_label设为True即可实现。
代码如下:

class LabelSmoothingCrossEntropyLoss(nn.Layer):
    def __init__(self,
                 smoothing=0.1,
                 weight=None,
                 ignore_index=-100,
                 reduction='mean',
                 soft_label=True,
                 axis=-1,
                 use_softmax=True,
                 name=None):
        super(LabelSmoothingCrossEntropyLoss, self).__init__()
        assert 0 <= smoothing < 1.0
        self.smoothing = smoothing
        self.weight = weight
        self.reduction = reduction
        self.ignore_index = ignore_index
        self.soft_label = soft_label
        self.axis = axis
        self.use_softmax = use_softmax
        self.name = name

    def forward(self, input, label):
        label = paddle.nn.functional.one_hot(label, num_classes=input.shape[1])
        label = paddle.nn.functional.label_smooth(label, epsilon=self.smoothing)        
        ret = paddle.nn.functional.cross_entropy(
            input,
            label,            
            weight=self.weight,
            ignore_index=self.ignore_index,
            reduction=self.reduction,
            soft_label=self.soft_label,
            axis=self.axis,
            use_softmax=self.use_softmax,
            name=self.name)
        return ret

目前,经过简单测试结果和现有方法计算结果一致。

Additional context
Paddle ViT课程训练ResNet18作业中发现,过拟合比较严重,所以想尝试利用Label Smoothing方法缓解。但是搜索paddle api官方文档后发现没有专门的LabelSmoothingCrossEntropyLoss,利用paddle现成的one_hot和label_smooth函数实现了一下。

单机多卡并行部分代码不理解

如果是多卡训练,则需要初始化多卡训练环境。

if nranks > 1:
    # Initialize parallel environment if not done.
    if not paddle.distributed.parallel.parallel_helper._is_parallel_ctx_initialized():
        logger.info("using dist training")
        # 初始化动态图模式下的并行训练环境,目前同时初始化NCCL和GLOO上下文用于通信。
        paddle.distributed.init_parallel_env()
        ddp_model = paddle.DataParallel(model)
    else:
        ddp_model = paddle.DataParallel(model)

不理解:if not paddle.distributed.parallel.parallel_helper._is_parallel_ctx_initialized(): 这个判断是什么意思,难道paddle.distributed.init_parallel_env()不应该是必须初始化的吗?

另外,还有如果要运行多机多卡需要修改什么代码吗?
谢谢

关于 ViT Transformer Attention 添加 attn_head_size 参数的建议

vit transformer 的实现中(ViT Transformer Attention),多头注意力的 attn_head_size 的计算是由传入的 embed_dimnum_heads 计算得到的:

self.attn_head_size = int(embed_dim / self.num_heads)

我认为这里的实现至少有两个问题:

  • 其一,没有对embed_dim是否能num_heads整除做检查。当embed_dim不能被num_heads整除,或者num_heads > embed_dim时,transpose_multihead 的操作会出现异常:
    def transpose_multihead(self, x):
        new_shape = x.shape[:-1] + [self.num_heads, self.attn_head_size]
        x = x.reshape(new_shape)
        x = x.transpose([0, 2, 1, 3])
        return x
  • 其二,attn_head_size 的大小受到 embed_dimnum_heads 的限制,当预训练模型时,不能随意设置 attn_head_size 的大小,代码不够灵活。

解决上述问题的办法,就是为 Attention__init__ 方法添加一个 attn_head_size 的参数,这样即不影响现有预训练模型的加载,又可以在预训练时,灵活设置 attn_head_size 的大小。由于 attn_head_size 与输入维度 embed_dim 无关,也不需要验证 embed_dim 是否能被 num_heads 整除。
目前主流框架中,两种实现都有:
第一种,由 embed_dimnum_heads 参数计算 attn_head_size 的实现,包括:
PaddlePaddle: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/nn/layer/transformer.py#L109
PyTorch: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/transformer.py
transformers: https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py#L226
第二种,将 attn_head_size 作为参数传入的实现,包括:
TensorFlow: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/multi_head_attention.py#L126
TensorFlow Addons: https://github.com/tensorflow/addons/blob/master/tensorflow_addons/layers/multihead_attention.py
我个人非常推荐第二种实现方式,API 使用起来更加灵活,代码看起来也非常顺畅,更加合理。
比如,原实现中 all_head_size 的定义:

self.all_head_size = self.attn_head_size * self.num_heads

all_head_size == embed_dim,完全没有必要定义。这个变量,只在 __init__

        self.qkv = nn.Linear(embed_dim,
                             self.all_head_size*3,  # weights for q, k, and v
                             weight_attr=w_attr_1,
                             bias_attr=b_attr_1 if qkv_bias else False)

forward

new_shape = z.shape[:-2] + [self.all_head_size]

中用到。__init__ 中的 qkv 映射的输出维度 self.all_head_size*3 可改为 embed_dim*3forward中的 new_shape 用到的 self.all_head_size,可以在方法的开始,取出输入 x 的维度,修改如下:

embed_dim = x.shape[-1]
……
new_shape = z.shape[:-2] + [embed_dim]

以上是我对源码中定义 self.all_head_size 的质疑。
还有最后输出加一层 Linear Layer 的必要性:

        self.out = nn.Linear(embed_dim,
                             embed_dim,
                             weight_attr=w_attr_2,
                             bias_attr=b_attr_2)

forward 中,最后输出执行线性映射操作的上面由一行注释 reshape

        z = z.reshape(new_shape)
        # reshape
        z = self.out(z)

意思应该是将维度映射回输入维度 embed_dim,方面后面的残差连接。不过既然 all_head_size == embed_dim,那何来 reshape?
所以,我认为这里对输出的线性映射是不必要的。
不过,如果我们使用第二种方式实现,将 attn_head_size 作为参数传入,不依赖 embed_sizenum_heads 来计算,以上代码看起来就顺畅多了,合理多了。
第二种实现,将 attn_head_size 作为参数传入,只需在源代码基础上更改几行代码即可,实现如下:

from typing import Tuple, Union

import paddle
import paddle.nn as nn
from paddle import ParamAttr
from paddle import Tensor


class Attention(nn.Layer):
    """ Attention module

    Attention module for ViT, here q, k, v are assumed the same.
    The qkv mappings are stored as one single param.

    Attributes:
        num_heads: number of heads
        attn_head_size: feature dim of single head
        all_head_size: feature dim of all heads
        qkv: a nn.Linear for q, k, v mapping
        scales: 1 / sqrt(single_head_feature_dim)
        out: projection of multi-head attention
        attn_dropout: dropout for attention
        proj_dropout: final dropout before output
        softmax: softmax op for attention
    """
    def __init__(self,
                 embed_dim: int,
                 num_heads: int,
                 attn_head_size: int,
                 qkv_bias: Union[bool, ParamAttr],
                 dropout: float = 0.,
                 attention_dropout: float = 0.):
        super().__init__()
        """
        增加了一个attn_head_size的参数,attn_head_size和num_heads的大小不受embed_dim的限制,使API的使用更灵活。
        """
        self.num_heads = num_heads
        # self.attn_head_size = int(embed_dim / self.num_heads)
        self.attn_head_size = attn_head_size
        self.all_head_size = self.attn_head_size * self.num_heads  # Attention Layer's hidden_size

        w_attr_1, b_attr_1 = self._init_weights()
        self.qkv = nn.Linear(embed_dim,
                             self.all_head_size*3,  # weights for q, k, and v
                             weight_attr=w_attr_1,
                             bias_attr=b_attr_1 if qkv_bias else False)

        self.scales = self.attn_head_size ** -0.5

        w_attr_2, b_attr_2 = self._init_weights()
        # self.out = nn.Linear(embed_dim,
        #                      embed_dim,
        #                      weight_attr=w_attr_2,
        #                      bias_attr=b_attr_2)
        # 汇总多头注意力信息,并将维度映射回输入维度embed_dim,方便残差连接
        self.out = nn.Linear(self.all_head_size,
                             embed_dim,
                             weight_attr=w_attr_2,
                             bias_attr=b_attr_2)

        self.attn_dropout = nn.Dropout(attention_dropout)
        self.proj_dropout = nn.Dropout(dropout)
        self.softmax = nn.Softmax(axis=-1)

    def _init_weights(self) -> Tuple[ParamAttr, ParamAttr]:
        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
        return weight_attr, bias_attr

    def transpose_multihead(self, x: Tensor) -> Tensor:
        new_shape = x.shape[:-1] + [self.num_heads, self.attn_head_size]
        x = x.reshape(new_shape)
        x = x.transpose([0, 2, 1, 3])
        return x

    def forward(self, x: Tensor) -> Tuple[Tensor, Tensor]:
        qkv = self.qkv(x).chunk(3, axis=-1)
        q, k, v = map(self.transpose_multihead, qkv)

        attn = paddle.matmul(q, k, transpose_y=True)
        attn = attn * self.scales
        attn = self.softmax(attn)
        attn_weights = attn
        attn = self.attn_dropout(attn)

        z = paddle.matmul(attn, v)
        z = z.transpose([0, 2, 1, 3])
        new_shape = z.shape[:-2] + [self.all_head_size]
        z = z.reshape(new_shape)
        # 汇总多头注意力信息,并将维度映射回输入维度embed_dim,方便残差连接
        z = self.out(z)
        z = self.proj_dropout(z)
        return z, attn_weights

测试:

def main():
    t = paddle.randn([4, 16, 96])     # [batch_size, num_patches, embed_dim]
    print('input shape = ', t.shape)

    model = Attention(embed_dim=96,
                      num_heads=8,
                      attn_head_size=128,
                      qkv_bias=False,
                      dropout=0.,
                      attention_dropout=0.)

    print(model)

    out, attn_weights = model(t)
    print(out.shape)
    print(attn_weights.shape)

    for name, param in model.named_parameters():
        print(f'param name: {name},\tparam shape: {param.shape} ')


if __name__ == "__main__":
    main()

输出:

input shape =  [4, 16, 96]
Attention(
  (qkv): Linear(in_features=96, out_features=3072, dtype=float32)
  (out): Linear(in_features=1024, out_features=96, dtype=float32)
  (attn_dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
  (proj_dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
  (softmax): Softmax(axis=-1)
)
[4, 16, 96]
[4, 8, 16, 16]
param name: qkv.weight,	param shape: [96, 3072] 
param name: out.weight,	param shape: [1024, 96] 
param name: out.bias,	param shape: [96] 

以上是我个人的一点儿不成熟的小建议,望官方评估采纳~

small issue

Describe the bug
resume training error
AttributeError: 'Momentum' object has no attribute 'set_dict'

To Reproduce
Steps to reproduce the behavior:
1.Go to 'PaddleViT/object_detection/Swin/'
2.Run 'python main_single_gpu.py -resume='./output/train-20211210-09-50-43/Swin-Epoch-45'

The recovery of model can pass

Screenshots
Traceback (most recent call last): File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.2.2\plugins\python-ce\helpers\pydev\pydevd.py", line 1483, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.2.2\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "F:/***/pp_swin/main_single_gpu.py", line 400, in <module> main() File "F:/***/pp_swin/main_single_gpu.py", line 313, in main optimizer.set_dict(opt_state) AttributeError: 'Momentum' object has no attribute 'set_dict'

Version (please complete the following information):

  • Paddle Version: [ 2.2.0]
  • Python Version [3.6]
  • GPU/CPU mode [ Gpu]

Support validation without create the training dataset and dataloader

Describe your feature request
Current validation (config.EVAL is True) mode still create and load training dataset and dataloader, which is not flexible when users only have val set.

So main method needs support val mode without touching training set.

Describe the reference code or paper
N/A

Describe the possible solution
I have a fix in ViT model, which can be used to other classification model.
Please refer to commit 9a7c105 for details.

关于README中命令行参数和Usage模型加载的问题

我发现 PaddleViT 所有模型中的 README.md 都存在两个问题(以下均以 PaddleViT/image_classification/BEiT/ BEiT 模型的 README.md 为例):

  • 其一,Usage 示例代码中,加载预训练权重时少了后缀 .pdparams,而且注释中提到 .pdparams is NOT needed 也是不对的,应该是在下面的命令行参数中 -pretrained 的值是不需要 .pdparams,二者搞混了。
from config import get_config
from beit import build_beit as build_model
# config files in ./configs/
config = get_config('./configs/beit_base_patch16_224.yaml')
# build model
model = build_model(config)
# load pretrained weights, .pdparams is NOT needed
model_state_dict = paddle.load('./beit_base_patch16_224_ft22kto1k')
model.set_dict(model_state_dict)

应该讲注释注释中的 , .pdparams is NOT needed 删去,并在模型加载时,加上后缀 .pdparams

from config import get_config
from beit import build_beit as build_model
# config files in ./configs/
config = get_config('./configs/beit_base_patch16_224.yaml')
# build model
model = build_model(config)
# load pretrained weights
model_state_dict = paddle.load('./beit_base_patch16_224_ft22kto1k.')
model.set_dict(model_state_dict)
  • 其二,在 EvaluationTraining 的命令行参数值多加了一个单引号,如果在终端直接执行,会出现 FileNotFoundError 错误:
FileNotFoundError: [Errno 2] No such file or directory: "'./configs/beit_base_patch16_224.yaml'"

我之前在终端预训模型训练和验证的命令时,出现过这个错误,群里也有其他同学出现了这样的问题。出现这个错误的原因是因为 argparse 在解析命令行参数时,为字符串类型的参数值自动加上了一个双引号。所以,在为命令行参数赋值时,不需要加上引号。所以,应该去掉 EvaluationTraining 命令行参数值中的单引号。
GPU 验证:

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
    -cfg='./configs/beit_base_patch16_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \
    -eval \
    -pretrained='./beit_base_patch16_224_ft22kto1k'

我修改为:

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
    -cfg=./configs/beit_base_patch16_224.yaml \
    -dataset=imagenet2012 \
    -batch_size=16 \
    -data_path=/path/to/dataset/imagenet/val \
    -eval \
    -pretrained=/path/to/pretrained/model/beit_base_patch16_224_ft22kto1k  # .pdparams is NOT needed

GPU 验证:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg='./configs/beit_base_patch16_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \
    -eval \
    -pretrained='./beit_base_patch16_224_ft22kto1k'

我修改为:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg=./configs/beit_base_patch16_224.yaml \
    -dataset=imagenet2012 \
    -batch_size=16 \
    -data_path=/path/to/dataset/imagenet/val \
    -eval \
    -pretrained=/path/to/pretrained/model/beit_base_patch16_224_ft22kto1k  # .pdparams is NOT needed

GPU 训练:

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
  -cfg='./configs/beit_base_patch16_224.yaml' \
  -dataset='imagenet2012' \
  -batch_size=32 \
  -data_path='/dataset/imagenet' \

我修改为:

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
  -cfg=./configs/beit_base_patch16_224.yaml \
  -dataset=imagenet2012 \
  -batch_size=32 \
  -data_path=/path/to/dataset/imagenet/train \

GPU 训练:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg='./configs/beit_base_patch16_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \ 

我修改为:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg=./configs/beit_base_patch16_224.yaml \
    -dataset=imagenet2012 \
    -batch_size=16 \
    -data_path=/path/to/dataset/imagenet/train \ 

一会儿,我再提交个 PR,请官方审查~

[Image Classification] MobileViT

Describe your feature request
Add and debug the multi scale sampler

Describe the reference code or paper
Refer to the source code listed in the original paper, from here

Describe the possible solution
Now has an initial version, needs debug and test

Additional context
N/A

[Object Detection] Cascade Mask R-CNN

Describe your feature request
Cascade Mask R-CNN

Describe the reference code or paper
Swin detection official code (from mmdet) here

Describe the possible solution
Mask R-CNN is already implemented in PaddleViT here

Additional context
Add any other context or screenshots about the feature request here.

Whether a class number parameter is missing in the mixup of training code?

Describe the question
In the classification directory, when MIXUP prerequisites are performed in main_single_gpu / main_multi_gpu.py, the NUM_CLASSES parameter is not passed, resulting in a non-IMAGENET1K data set to load, the existing main_single_gpu / main_multi_gpu.py will result in an error——since the number of categories of the current model is not equal to 1000, the Mixup default classification number is still 1000, which will cause loss calculations that cannot be performed at this time.

Expected behavior
Introducing the class number parameters, so that the training code is easier to use.

Screenshots
mixup实例时为初始化类别数

PaddleViT-Seg安装requirements.txt依赖库时出现问题

Describe the bug
PaddleViT-Seg的requirements.txt文档中需要安装的依赖库有如下:

  • cityscapesScripts==2.2.0
  • detail==4.0
  • numpy==1.20.3
  • opencv-python==4.5.2.52
  • scipy==1.6.3
  • yacs==0.1.8

问题一:
没有detail 这个库
微信截图_20211202203427
自己解决方式:删除detail

问题二:
opencv-python库没有4.5.2.52这个版本
微信截图_20211202203715
自己解决方式:更换opencv-python版本

训练自己的数据集

代码中说Dataset related classes and methods for ViT training and validation
Cifar10, Cifar100 and ImageNet2012 are supported。
我想问下,我想训练自己的数据集需要自己重新写datase类吗?

TransGAN code has some error

Describe the bug
I found that the code was copied from the source author's code and some pytorch's code was transformed to paddle error.

To Reproduce
1:
VIT_custom.py line59

return input * paddle.rsqrt(paddle.mean(input ** 2, dim=2, keepdim=True) + 1e-8)

should be

return input * paddle.rsqrt(paddle.mean(input ** 2, axis=2, keepdim=True) + 1e-8)

2:
VIT_custom.py line88

class CustomAct(nn.Layer):
    """ CustomAct layer
    Custom act method set, defalut "gelu"
    """
    def __init__(self, act_layer):
        super().__init__()
        if act_layer == "gelu":
            self.act_layer = gelu
        elif act_layer == "leakyrelu":
            self.act_layer = leakyrelu
        else:
            self.act_layer = gelu

which

leakyrelu has not been defined

The model ema loading for resume training raises error

Describe the bug
For the classification model which has EMA training,
error occurs when start resuming training.

The EMA model name and loading is not correct.

To Reproduce
Steps to reproduce the behavior:

  1. Start model training with -resume enabled and EMA = True
  2. In the main_single_gpu.py/main_multi_gpu.py, model_ema name and loading raises error

Expected behavior
Model EMA should be loaded without error

Additional context

  1. Model name can be solved by using default names for EMA models, and add exist check
  2. Model loading can be solved by used model_ema.modual.set_state_dict.

LERT

do you have any model for this paper "line Segment detection using transformer without edges"

关于loss的计算

交叉熵loss应该计算预测分布概率与真实分布之间的距离,按道理是应该在softmax计算之后再进行Loss计算的,但是训练代码中都是直接计算loss loss=criterion(output, label),后面才再计算softmax。请问不应该是先计算softmax再计算Loss,还是其实这两者的顺序对训练没有区别

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.