whwu95 / cap4video Goto Github PK

View Code? Open in Web Editor NEW

220.0 9.0 16.0 8.77 MB

【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Home Page: https://arxiv.org/abs/2301.00184

License: MIT License

Shell 0.47% Python 99.53%

cross-modal-learning video-language-understanding video-text-retrieval video-understanding

cap4video's Issues

Checkpoint weights

Hi,

Great work and thanks for sharing your code!
Just wondering, do you have plans to release the checkpoints weights of the models you already trained so that we can directly do inference on them?

Thanks!

FileNotFoundError: [Errno 2] No such file or directory: 'data/MSRVTT_test_website_titles.json'

请问一下MSRVTT_test_website_titles在哪里下载呀

The data link becomes 404

Hi，
The preprocessed frames link you provided is invalid.

Some questions about the file.

Good afternoon, I'm reading and trying to run your code with the file you have uploaded. But sorry, I didn't successfully run this code, maybe it's because of file “sim_matrix” is not provided. Besides, may I ask when the pre-extracted video frame features will be uploaded? Hope for your reply and may you good journey.

Training script for MSVD, DiDeMo, VATEX

How do I train for MSVD, DiDeMo, VATEX datasets?

Caption encoder and query encoder share weights?

I am very confused: the caption encoder and query encoder share weights, so what are the optimized parameters for calculating QC matching? Why do we need to pass the capture embedding of CxD through MHA and multiply it with query embedding

> 在我们的论文中，查询视频分支和查询标题分支是分开训练的。我们首先训练查询视频分支 5 个周期。一旦训练了该分支，我们就继续训练查询标题分支。

          > 在我们的论文中，查询视频分支和查询标题分支是分开训练的。我们首先训练查询视频分支 5 个周期。一旦训练了该分支，我们就继续训练查询标题分支。

我看了你的代码，我发现在train_video.py中就已经使用到了字幕caption，那么此时我该如何理解你所说的前5轮是训练查询-视频分支的？（在我的理解中，你前5个epoch为了训练查询-视频分支，那么就不该出现字幕，因为如果存在字幕，就会导致查询编码器也处理字幕信息了，那么此时不就没有所谓的前五轮训练查询-视频分支的吗？）
我不知道我的理解正确不？我对着一部分很困惑，期望得到你的回复

Originally posted by @shams2023 in #4 (comment)

can you provide inference code using text query ?

Thanks for your great work.

Could you share the inference and visualization code that ranks videos based on a text query, as mentioned in the visualization? *(ex. infer.py --query 'a person is discussing a car')

I look forward to your response. Thank you.

Instructions to run MSVD, DiDeMo datasets

Hi,

Congratulations on the amazing work. Can you release scripts to run on MSVD, DiDeMo and VATEX datasets.

Thank you.

Sample inference code

Hi,
Do you have a sample inference code to load the model, pre-process video and text, and get the similarity score ?

Thanks !

Low R1 performance in the 2nd stage

Thanks for sharing your code. Is it normal to get R1=30 with train_titles.py? After running the score fusion, the title matrix does not improve the video matrix.

Questions about [SEP] token

In the code, both the query-video branch and the query-caption branch use [SEP] embedding as the global feature of query or caption, but the paper mentions [CLS] embedding. So should I use [SEP] embedding or [CLS] embedding? Thank you.

是否可以把co_attention_transformer_module.py模块移植到（图像-文本对的交互上面）

Which part of the code is the interaction module implemented in?

I didn't see where the video-caption interaction was actually implemented

Python, Pytorch, Torchvision, Cudatoolkit versions

Unable to train the model, I think the library versions are not compatible.
I'm getting the following error:
unrecognized arguments: --local_rank

When will the caption files for other datasets provided?

I have only found the caption files for MSRVTT in the releases. When will the caption files for other datasets (MSVD, VATEX etc.) be provided?

How do I know if the video features will be better after interacting with caption?

lr=optimizer.get_lr()[0] IndexError: list index out of range

The parameter batch_first=True is causing an error

When I run the code with batch_first=True, I encounter an error. Upon researching, it seems that this parameter is only available in Torch 1.9.0. How can I resolve this?

Two branch or two loss

In paper, you mentioned that “To reduce conflict between the two branches, the query-video branch is trained first, followed by the query-caption branch”。However, you also mentioned that "The total loss L is the sum of Query-Video loss L_{QV} and Query-Caption loss L_{QC} ". Are the two branches trained separately？ My question is that what is the loss when first train query-video branch the loss and secondly train query-caption branch. In addition, how much epochs do it take to train first query-video branch.

Training requirements

What is the GPUs used for training?

Resume training

Hi, the resume training only loads the optimizer state and also the loss doesnt start from where it stopped

在其他数据集上训练

您好，您论文中所有的数据集caption都为多条，请问如果我的数据集caption只有1条会影响实验结果吗？目前，我在自己的数据集上进行训练，训练结果正常，但测试时结果很差，R@1=1.0（测试数量为100）

Preprocess for other datasets

Hi, it's a good work!
I want to run the code on other video retrieval datasets, therefore I wonder know that how to convert the raw video into frame.
Such as the sample rate, size of every frames in your code.

Inference on pretrained model

Hi,

your work is very nice.

I wonder if it would be possible for you to share the pre-trained model and instructions on how to use it. At least the instructions on how to use the pre-trained model on a test set of data. For example given a set of 20 videos, their captions and a query text, retrive the first video closer to the query text.

Thank you!

How the entire dataset is converted into captions

Thank you very much for your work!
How do you convert a video from an entire dataset into captions? I currently want to convert all the images or videos in the entire dataset into captions, but the code involved in the article [ZeroCap: Zero Shot Image to Text Generation for Visual Semantic Arithmetic] only works by converting one image into captions, so I really want to know what I need to do if I want to convert an entire dataset?
I really hope to receive your guidance. Thank you again

Question of the caption file.

Hi,

Thanks for releasing the code and data.
I have checked the provided caption data and I found that there are two additional keys in the dataset, 'title' and 'titles'.
Can you provide some explanation about them? Like how did you get the data, from url or captioning model? And what is the difference between the two sets?

Thank you!

Requesting code to generate the frames from the MSRVTT dataset

Hi, I've been trying to unzip the frames_30fps.zip for the past 2 days but its very slow and often gets disrupted. Could you please provide me the code to generate the frames_30fps using the MSRVTT video dataset as input

1 some questions

这里面的C指的是这个视频生成的辅助字幕的数量吗？
我看最后做消融实验的时候说最好的结果是1个辅助字幕就会得到很好的结果
所以这个C该如何理解

Question about implementation details.

Hello, I admit that this is a good job.
However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ).
I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

In preprocessing, which part of the video captions code is in your folder?

Thank you for your work, it's a great job!
I cannot find the code for generating video captions in your code. May I know which part of the code is in which folder? I really need this part of the code to verify my idea. Thank you for your help!

whwu95 / cap4video Goto Github PK

cap4video's Issues

Recommend Projects

Recommend Topics

Recommend Org