whwu95 / bike Goto Github PK

【CVPR'2023】Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Home Page: https://arxiv.org/abs/2301.00182

License: MIT License

Python 98.81% Shell 1.19%

action-recognition cross-modal-learning video-language-understanding video-recognition video-understanding

bike's Issues

CUDA out of memory

I am using 4 3090ti cards, and I have set the batch size to very small, but this situation occurs every time the first epoch is clicked

Traceback (most recent call last):
File "train.py", line 522, in
main(args)
File "train.py", line 316, in main
prec1, output_list, labels_list = validate(epoch, val_loader, classes, device, model, video_head, config, n_class, logger, save_score)
File "train.py", line 433, in validate
cls_feature, text_features = model.module.encode_text(text_inputs, return_token=True) # [n_cls, feat_dim]
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 443, in encode_text
x = self.transformer(x)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 253, in forward
x = checkpoint(r, x)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 227, in forward
x = x + self.drop_path(self.attention(self.ln_1(x)))
File "/home/chenshengyi/depthstudy/gesturecode/BIKE-main/clip/model.py", line 219, in attention
return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1153, in forward
attn_output, attn_output_weights = F.multi_head_attention_forward(
File "/opt/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 5131, in multi_head_attention_forward
v = v.contiguous().view(v.shape[0], bsz * num_heads, head_dim).transpose(0, 1)
RuntimeError: CUDA out of memory. Tried to allocate 1.96 GiB (GPU 0; 23.70 GiB total capacity; 15.94 GiB already allocated; 657.56 MiB free; 16.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Frozen label encoder

What is the difference and connection between Frozen label encoder and category encoder, I see in table6(a) it shows that adding frozen label encoder improves the result. And what is (technical) Transf?

key models

Congratulations on your work. May I ask if you can upload some key models to the cloud for download (e.g. k400-vit-l-14-f16.pt, etc.)

How to generate attribute file for my own dataset

Thanks for your excellent work! And I saw that you have provided the attribute JSON file for k400, may I ask how to generate attribute file for my own dataset?

i can't open your onedrive to download the model

Questions about the pre-generated attributes on ucf101/hmdb51. The original intention was that I couldn't replicate your results on UCF101

Congratulation to your excellent work! And I saw that you provided the attribute JSON file for k400. May I ask if you could provide the train/val JSON files for ucf and hmdb51? The main reason is that when conducting few shot training on the UCF101 dataset (shot=1/2/5), I cannot reproduce your results (95.2/96.1/96.5), while my results are (90.22/94.15/95.32). I have trained multiple times and have taken the highest value. Of course, the above are all the results of the video branch and do not use the attribute branch mentioned in your paper. Also, I wish you a happy National Day! Looking forward to your reply！Thank you!

Questions about the pre-generated attributes

Congratulation to your excellent work! It seems that current published code doesn't include the 'pre-generated attributes' which should be in a json file , and I also find that the test code doesn't include the attributes branch now. I wonder that will the complete code be published soon ?

Onedrive link to ViT-L/14* model is expired

Hi, I would like to try the zero-shot inference but the link to Onedrive to download the ViT-L/14* checkpoint is expired. Could you please update it? Thank you!

Lack of file (K400_val_vit_L14_attributes.json)

Hello!

I have not found the necessary file "lists/k400/K400_val_vit_L14_attributes.json" that is written in "k400_train_video_attr_vitb-32-f8.yaml".

Can you upload the file? Thanks!

Onedrive checkpoints are not available

404 file not found.

About ActivityNet datasets

Could you share the frames of ActivityNetdataset.

num_sample=1——>4

When changing num_sample from 1 to 4, how is the function in train modified: in particular, the following code
images = images.view((-1,config.data.num_segments,3)+images.size()[-2:]) # bt 3 h w b,t,c,h,w = images.size() images= images.view(-1,c,h,w)
and
image_embedding, cls_embedding, text_embedding, logit_scale = model(images, texts, return_token=True)
After the incoming parameters of the model are changed from a tensor to a list

How to prepare Charades dataset?

The link in data preparation to MVFNet dose not introduce the data preparation of Charades dataset.

In the case of ViT-B/16 (8x224), what are the FLOPs and Param of BIKE?

Why the parameters of BIKE is smaller than origin CLIP ViT-L/14

The parameters of origin CLIP ViT-L/14 is 303M，and the BIKE ViT-L/14 is 230M.

About：wanrning：“None of the inputs have requires_grad=True. Gradients will be None“

I'm reaching out regarding an issue I've encountered while working with the model. I'm receiving a warning that says, 'None of the inputs have requires_grad=True. Gradients will be None.' I've been trying to troubleshoot this and was wondering if you might have insights into resolving this particular warning.

From my understanding, it seems like there might be an issue related to the setting of the requires_grad attribute for the inputs, leading to the absence of gradients during the training process. But the inputs(images) should not be set "requires_grad=True"

Could you kindly offer guidance or share any specific steps or considerations to address this warning? Any advice or direction you could provide would be greatly appreciated.

Thank you very much for your time and assistance.

About testing charades: the result of the test should use the last recorded mAP ?

Hi!

I noticed that AverageMeter() is used in calculating the mAP and the output is the global avg, and this is not a problem when calculating accuracy.

However, the result of the mAP calculation should be obtained using all the predictions. Therefore the value of the last maper.value().numpy() should be used directly as the final result.

Am I understanding this correctly?

question about Kinetics400 88.7

ViT-L/14; 16x336 4x3 top1= 88.7 on Kinetics400. However, the accuracy in the corresponding log link is not 88.7 but only 87.9.

The parameters of BIKE are smaller than the original CLIP ViT-L/14 is that in the BIKE model.

          The reason why the parameters of BIKE are smaller than the original CLIP ViT-L/14 is that in the BIKE model, we only utilize the vision encoder from CLIP and do not include the parameters of CLIP's text encoder.

Originally posted by @whwu95 in #3 (comment)

In fact, the parameters of visual encoder is 303M for ViT-L/16, which excludes text encoder.

about inference

Hi! I want to know whether attributes branch is used when inference?

whwu95 / bike Goto Github PK

bike's Issues

Recommend Projects

Recommend Topics

Recommend Org