Giter VIP home page Giter VIP logo

zeroi2v's Introduction

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

This repo is the official implementation of "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video"

If you find our work useful in your research, please cite:

@article{li2023zeroi2v,
  title={ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video},
  author={Li, Xinhao and Wang, Limin},
  journal={arXiv preprint arXiv:2310.01324},
  year={2023}
}

We will publish our source codes and pretrained model weights after the review process.

Introduction

In this paper, we present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the adapted models during inference).

image-20231004113411368

Models

Kinetics 400

Backbone Pretrain GFLOPs Param New Param (M) acc@1 Views Checkpoint Checkpoint before reparam
ViT-B/16 CLIP 422 86 0 83.0 8x1x3 checkpoint checkpoint
ViT-L/14 CLIP 7783 304 0 87.2 32x1x3 checkpoint checkpoint

Something Something V2

Backbone Pretrain GFLOPs Param New Param (M) acc@1 Views Checkpoint Checkpoint before reparam
ViT-B/16 CLIP 422 86 0 67.7 8x3x1 checkpoint checkpoint
ViT-L/14 CLIP 7783 304 0 72.2 32x3x1 checkpoint checkpoint

Installation

pip install -U openmim
mim install mmengine 'mmcv>=2.0.0rc1'
mim install "mmdet>=3.0.0rc5"
mim install "mmpose>=1.0.0rc0"
git clone https://github.com/leexinhao/ZeroI2V.git
cd ZeroI2V
pip install -v -e .
# install CLIP
pip install git+https://github.com/openai/CLIP.git

Our project is based on MMAction2. Please refer to install.md for more detailed instructions.

Data Preparation

All the datasets (K400, SSv2, UCF101 and HMDB51) used in this work are supported in MMAction2.

Training

The training configs of different experiments are provided in configs/recognition/. To run experiments, please use the following command. PATH/TO/CONFIG is the training config you want to use. The default training setting is 8GPU with a batchsize of 64.

bash tools/dist_train.sh <PATH/TO/CONFIG> <NUM_GPU>

We also provide a training script in run_exp.sh. You can simply change the training config to train different models.

Evaluation

The code will do the evaluation after training. If you would like to evaluate a model only, please use the following command,

bash tools/dist_test.sh <PATH/TO/CONFIG> <CHECKPOINT_FILE> <NUM_GPU> --eval top_k_accuracy

Reparameterize the linear adapter

Please refer to tools\weight_reparam.py.

Test speed and throughput

Please refer to tools\test_speed.py and tools\test_throughput.py.

TODO

  • Release a part of source codes
  • Release source codes
  • Pretrained model weights

zeroi2v's People

Contributors

leexinhao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

zeroi2v's Issues

About splitting heads for temporal and spatial modeling

Hi, could I learn about whether your implementation is based official openai CLIP script? Because I noticed that they implement the multi-head attention with inner pytorch function so that the modificaton of multi-head attention operation is a little complex. Many thanks.

Reproduce results: About the configuration file of HMDB/UCF

Hello, I apologize for disturbing you during the weekend, but the following issues have been a headache for me and I really need your help, thank you very much!

I have made some changes to the K400 configuration file you provided in order to reproduce the results on the hmdb dataset. But the results I obtained from the training and testing sets of the first two splits are as follows(VIT-B/16, clip_len=32,frame_interval=4,epoch=50):
Split1:acc/top1: 0.6804 acc/top5: 0.9229 acc/mean1: 0.6804 data_time: 0.0339 time: 1.3507
Split2:acc/top1: 0.6510 acc/top5: 0.9131 acc/mean1: 0.6510 data_time: 0.0055 time: 0.1720
The above results are far less than the results you reported(VIT-B/16 acc:73.7%).

May I ask if there is anything special to note about the training configuration file for hmdb, or could you provide configuration files for the hmdb51/ucf101 datasets so that I can reproduce the results of hmdb and ucf in your paper. Thank you for your reply!

Hmdb use reparameterization by linear adapter?

Hello, I would like to ask if the results of vit-B/16 in hmdb(73.7%) in your paper have been reparameterized by linear adapter?

Because, my reproduction result is over 73% without reparameterization, while the reparameterized result is around 55%.

Code Release Plan

Thanks for your great work! I think it will be of great help to my future research.

Any specific time to release the code? I can't wait to follow your research.

Thank You!

数据输入问题

你好,我对您的这项工作非常感兴趣,我仔细看了您的文章,在文章中你对多头注意力进行了拆分,让k个头关注空间信息的提取,n-k个头关注时间信息的提取,我好奇的是您使用的是图像模型,那么视频数据是如何输入的?

Missing file.

Hello, the work you have done is very meaningful and I am very interested. But it seems that some training and testing scripts are missing, such as dist_test.sh, dist_train.sh, run_exp.sh. If you have time, could you please add it? Thank you very much!

Questions about the dimension of W_up and W_down

Thanks for your great work, ZeroI2V! The idea is quite impressive. I just have a few quick questions about your linear adapter.

You have mentioned in your paper:
W_new = W_Adapter * W_old = (I + W_up * W_down)*W_old

So assume W_old's dimension is m * n
Your W_new is also m * n, is that correct?

Then in this case W_up should be m * k and W_down should k * m and W_Adapter is a square matrix here, is this correct?

Why does LoRA require more tuning parameters comparing with linear adapter?

Really appreciate your help and your work! Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.