whwu95 / bike Goto Github PK

【CVPR'2023】Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Home Page: https://arxiv.org/abs/2301.00182

License: MIT License

Python 98.81% Shell 1.19%

action-recognition cross-modal-learning video-language-understanding video-recognition video-understanding

bike's Introduction

Hi, I'm Wenhao Wu 👋

Wenhao Wu (吴文灏🇨🇳) is a Ph.D. student in the School of Computer Science at The University of Sydney, supervised by Prof. Wanli Ouyang. I have a close collaboration with Department of Computer Vision Technology (VIS) at Baidu led by Dr. Jingdong Wang (IEEE Fellow). I received my M.S.E degree from Multimedia Laboratory (MMLab@SIAT), University of Chinese Academy of Sciences, supervised by Prof. Shifeng Chen and Prof. Yu Qiao. I was also fortunate to intern/RA at MMLab@CUHK, Baidu, iQIYI, SenseTime, Samsung Research and Chinese Academy of Sciences. I am honored to be awarded the 11th Baidu Scholarship (2023).

My current research interest includes Cross-Modal Learning and Video Understanding. I have published 20+ papers at the top international CV/AI conferences or journals such as CVPR/ICCV/ECCV/AAAI/IJCAI/ACMMM/IJCV.

🔭 Research Interest

My research interests broadly lie in the areas of Computer Vision and Deep Learning, including:

Cross-Modal Learning (2022-Present): Video-Language Matching, Multimodal Large Language Model (MLLM)
Video Foundation Model (2017-Present): Video Recognition, Efficient Video Tuning
Video-related Applications (2017-2022): Video Sampler, Temporal Action Detection, Anomaly Detction in Video
Self-supervised Learning (2021-2022): Contrastive Video Learning, Masked Video Modeling
Low-level Vision (2021-2022): Image Colorization, Style Transfer, Image Rescaling

🔥 News

2024.01: I am honored to receive the 11th🎖Baidu Scholarship🎖, a prestigious fellowship awarding 200,000 RMB (about $30,000) to a select 10 PhD students worldwide in Artificial Intelligence, selected from thousands of applicants.
2023.11: We release GPT4Vis , which provides a Quantitative Evaluation of GPT-4 for Visual Understanding across images, videos and point clouds, spinning on 16 popular datasets.
2023.11: We release Side4Video , a Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning, which significantly reduces the training memory cost for action recognition (↓75%) and text-video retrieval (↓30%).
2023.08: The extension of Text4Vis has been accepted by IJCV.
2023.07: Two First-author papers (Temporal Modeling: ATM , Cross-Modal Retrieval: UA ) are accepted by ICCV2023.
2023.02: Two First-author papers for video understanding (BIKE , Cap4Video ) are accepted by CVPR 2023. Cap4Video involves GPT to enhance text-video learning, is selected as a 🎉Highlight paper🎉 (Top 2.5%).
2022.11: Two papers (Video Recognition: Text4Vis , Style Transfer: AdaCM) are accepted by AAAI 2023.
2022.07: Three papers (Video Sampling: NSNet, TSQNet, Cross-Modal Learning: CODER) are accepted by ECCV 2022.
2022.06: Our MaMiCo, a new video self-supervised learning work, is accepted by ACMMM 2022 (🎉Oral Presentation🎉).

bike's People

Contributors

Stargazers

Watchers

Forkers

leewlving leesunfreshing whuhxb ed-fish zaku-zaku tang-juan cerviny mistyr0se wensiyuansix luluchou farmingtong staccats daiguangzhao eltociear axin1301

bike's Issues

Frozen label encoder

What is the difference and connection between Frozen label encoder and category encoder, I see in table6(a) it shows that adding frozen label encoder improves the result. And what is (technical) Transf?

CUDA out of memory

I am using 4 3090ti cards, and I have set the batch size to very small, but this situation occurs every time the first epoch is clicked

Traceback (most recent call last):
File "train.py", line 522, in
main(args)
File "train.py", line 316, in main
prec1, output_list, labels_list = validate(epoch, val_loader, classes, device, model, video_head, config, n_class, logger, save_score)
File "train.py", line 433, in validate
cls_feature, text_features = model.module.encode_text(text_inputs, return_token=True) # [n_cls, feat_dim]
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 443, in encode_text
x = self.transformer(x)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 253, in forward
x = checkpoint(r, x)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 227, in forward
x = x + self.drop_path(self.attention(self.ln_1(x)))
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 219, in attention
return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1153, in forward
attn_output, attn_output_weights = F.multi_head_attention_forward(
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 5131, in multi_head_attention_forward
v = v.contiguous().view(v.shape[0], bsz * num_heads, head_dim).transpose(0, 1)
RuntimeError: CUDA out of memory. Tried to allocate 1.96 GiB (GPU 0; 23.70 GiB total capacity; 15.94 GiB already allocated; 657.56 MiB free; 16.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In the case of ViT-B/16 (8x224), what are the FLOPs and Param of BIKE?

About ActivityNet datasets

Could you share the frames of ActivityNetdataset.

About testing charades: the result of the test should use the last recorded mAP ?

Hi!

I noticed that AverageMeter() is used in calculating the mAP and the output is the global avg, and this is not a problem when calculating accuracy.

However, the result of the mAP calculation should be obtained using all the predictions. Therefore the value of the last maper.value().numpy() should be used directly as the final result.

Am I understanding this correctly?

About：wanrning：“None of the inputs have requires_grad=True. Gradients will be None“

I'm reaching out regarding an issue I've encountered while working with the model. I'm receiving a warning that says, 'None of the inputs have requires_grad=True. Gradients will be None.' I've been trying to troubleshoot this and was wondering if you might have insights into resolving this particular warning.

From my understanding, it seems like there might be an issue related to the setting of the requires_grad attribute for the inputs, leading to the absence of gradients during the training process. But the inputs(images) should not be set "requires_grad=True"

Could you kindly offer guidance or share any specific steps or considerations to address this warning? Any advice or direction you could provide would be greatly appreciated.

Thank you very much for your time and assistance.

i can't open your onedrive to download the model

How to prepare Charades dataset?

The link in data preparation to MVFNet dose not introduce the data preparation of Charades dataset.

Questions about the pre-generated attributes on ucf101/hmdb51. The original intention was that I couldn't replicate your results on UCF101

Congratulation to your excellent work! And I saw that you provided the attribute JSON file for k400. May I ask if you could provide the train/val JSON files for ucf and hmdb51? The main reason is that when conducting few shot training on the UCF101 dataset (shot=1/2/5), I cannot reproduce your results (95.2/96.1/96.5), while my results are (90.22/94.15/95.32). I have trained multiple times and have taken the highest value. Of course, the above are all the results of the video branch and do not use the attribute branch mentioned in your paper. Also, I wish you a happy National Day! Looking forward to your reply！Thank you!

Questions about the pre-generated attributes

Congratulation to your excellent work! It seems that current published code doesn't include the 'pre-generated attributes' which should be in a json file , and I also find that the test code doesn't include the attributes branch now. I wonder that will the complete code be published soon ?

question about Kinetics400 88.7

ViT-L/14; 16x336 4x3 top1= 88.7 on Kinetics400. However, the accuracy in the corresponding log link is not 88.7 but only 87.9.

Lack of file (K400_val_vit_L14_attributes.json)

Hello!

I have not found the necessary file "lists/k400/K400_val_vit_L14_attributes.json" that is written in "k400_train_video_attr_vitb-32-f8.yaml".

Can you upload the file? Thanks!

Onedrive checkpoints are not available

404 file not found.

Why the parameters of BIKE is smaller than origin CLIP ViT-L/14

The parameters of origin CLIP ViT-L/14 is 303M，and the BIKE ViT-L/14 is 230M.

about inference

Hi! I want to know whether attributes branch is used when inference?

key models

Congratulations on your work. May I ask if you can upload some key models to the cloud for download (e.g. k400-vit-l-14-f16.pt, etc.)

num_sample=1——>4

When changing num_sample from 1 to 4, how is the function in train modified: in particular, the following code
images = images.view((-1,config.data.num_segments,3)+images.size()[-2:]) # bt 3 h w b,t,c,h,w = images.size() images= images.view(-1,c,h,w)
and
image_embedding, cls_embedding, text_embedding, logit_scale = model(images, texts, return_token=True)
After the incoming parameters of the model are changed from a tensor to a list

The parameters of BIKE are smaller than the original CLIP ViT-L/14 is that in the BIKE model.

          The reason why the parameters of BIKE are smaller than the original CLIP ViT-L/14 is that in the BIKE model, we only utilize the vision encoder from CLIP and do not include the parameters of CLIP's text encoder.

Originally posted by @whwu95 in #3 (comment)

In fact, the parameters of visual encoder is 303M for ViT-L/16, which excludes text encoder.

Onedrive link to ViT-L/14* model is expired

Hi, I would like to try the zero-shot inference but the link to Onedrive to download the ViT-L/14* checkpoint is expired. Could you please update it? Thank you!