leexinhao / zeroi2v Goto Github PK

Official implementation of "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video"

License: Apache License 2.0

Python 99.73% Shell 0.27%

zeroi2v's Introduction

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

This repo is the official implementation of "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video"

If you find our work useful in your research, please cite:

@article{li2023zeroi2v,
  title={ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video},
  author={Li, Xinhao and Wang, Limin},
  journal={arXiv preprint arXiv:2310.01324},
  year={2023}
}

We will publish our source codes and pretrained model weights after the review process.

Introduction

In this paper, we present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the adapted models during inference).

Models

Kinetics 400

Backbone	Pretrain	GFLOPs	Param	New Param (M)	acc@1	Views	Checkpoint	Checkpoint before reparam
ViT-B/16	CLIP	422	86	0	83.0	8x1x3	checkpoint	checkpoint
ViT-L/14	CLIP	7783	304	0	87.2	32x1x3	checkpoint	checkpoint

Something Something V2

Backbone	Pretrain	GFLOPs	Param	New Param (M)	acc@1	Views	Checkpoint	Checkpoint before reparam
ViT-B/16	CLIP	422	86	0	67.7	8x3x1	checkpoint	checkpoint
ViT-L/14	CLIP	7783	304	0	72.2	32x3x1	checkpoint	checkpoint

Installation

pip install -U openmim
mim install mmengine 'mmcv>=2.0.0rc1'
mim install "mmdet>=3.0.0rc5"
mim install "mmpose>=1.0.0rc0"
git clone https://github.com/leexinhao/ZeroI2V.git
cd ZeroI2V
pip install -v -e .
# install CLIP
pip install git+https://github.com/openai/CLIP.git

Our project is based on MMAction2. Please refer to install.md for more detailed instructions.

Data Preparation

All the datasets (K400, SSv2, UCF101 and HMDB51) used in this work are supported in MMAction2.

Training

The training configs of different experiments are provided in configs/recognition/. To run experiments, please use the following command. PATH/TO/CONFIG is the training config you want to use. The default training setting is 8GPU with a batchsize of 64.

bash tools/dist_train.sh <PATH/TO/CONFIG> <NUM_GPU>

We also provide a training script in run_exp.sh. You can simply change the training config to train different models.

Evaluation

The code will do the evaluation after training. If you would like to evaluate a model only, please use the following command,

bash tools/dist_test.sh <PATH/TO/CONFIG> <CHECKPOINT_FILE> <NUM_GPU> --eval top_k_accuracy

Reparameterize the linear adapter

Please refer to tools\weight_reparam.py.

Test speed and throughput

Please refer to tools\test_speed.py and tools\test_throughput.py.

TODO

Release a part of source codes
Release source codes
Pretrained model weights

zeroi2v's People

Contributors

Stargazers

Watchers

zeroi2v's Issues

About splitting heads for temporal and spatial modeling

Hi, could I learn about whether your implementation is based official openai CLIP script? Because I noticed that they implement the multi-head attention with inner pytorch function so that the modificaton of multi-head attention operation is a little complex. Many thanks.

Reproduce results: About the configuration file of HMDB/UCF

Hello, I apologize for disturbing you during the weekend, but the following issues have been a headache for me and I really need your help, thank you very much！

I have made some changes to the K400 configuration file you provided in order to reproduce the results on the hmdb dataset. But the results I obtained from the training and testing sets of the first two splits are as follows(VIT-B/16, clip_len=32,frame_interval=4,epoch=50):
Split1：acc/top1: 0.6804 acc/top5: 0.9229 acc/mean1: 0.6804 data_time: 0.0339 time: 1.3507
Split2：acc/top1: 0.6510 acc/top5: 0.9131 acc/mean1: 0.6510 data_time: 0.0055 time: 0.1720
The above results are far less than the results you reported（VIT-B/16 acc:73.7%）.

May I ask if there is anything special to note about the training configuration file for hmdb, or could you provide configuration files for the hmdb51/ucf101 datasets so that I can reproduce the results of hmdb and ucf in your paper. Thank you for your reply!

Hmdb use reparameterization by linear adapter？

Hello, I would like to ask if the results of vit-B/16 in hmdb(73.7%) in your paper have been reparameterized by linear adapter？

Because, my reproduction result is over 73% without reparameterization, while the reparameterized result is around 55%.

Code Release Plan

Thanks for your great work! I think it will be of great help to my future research.

Any specific time to release the code? I can't wait to follow your research.

Thank You!

数据输入问题

你好，我对您的这项工作非常感兴趣，我仔细看了您的文章，在文章中你对多头注意力进行了拆分，让k个头关注空间信息的提取，n-k个头关注时间信息的提取，我好奇的是您使用的是图像模型，那么视频数据是如何输入的？

Missing file.

Hello, the work you have done is very meaningful and I am very interested. But it seems that some training and testing scripts are missing, such as dist_test.sh, dist_train.sh, run_exp.sh. If you have time, could you please add it? Thank you very much!

Questions about the dimension of W_up and W_down

Thanks for your great work, ZeroI2V! The idea is quite impressive. I just have a few quick questions about your linear adapter.

You have mentioned in your paper:
W_new = W_Adapter * W_old = (I + W_up * W_down)*W_old

So assume W_old's dimension is m * n
Your W_new is also m * n, is that correct?

Then in this case W_up should be m * k and W_down should k * m and W_Adapter is a square matrix here, is this correct?

Why does LoRA require more tuning parameters comparing with linear adapter?

Really appreciate your help and your work! Thanks!