Giter VIP home page Giter VIP logo

cap4video's Issues

Checkpoint weights

Hi,

Great work and thanks for sharing your code!
Just wondering, do you have plans to release the checkpoints weights of the models you already trained so that we can directly do inference on them?

Thanks!

Some questions about the file.

Good afternoon, I'm reading and trying to run your code with the file you have uploaded. But sorry, I didn't successfully run this code, maybe it's because of file “sim_matrix” is not provided. Besides, may I ask when the pre-extracted video frame features will be uploaded? Hope for your reply and may you good journey.

Caption encoder and query encoder share weights?

I am very confused: the caption encoder and query encoder share weights, so what are the optimized parameters for calculating QC matching? Why do we need to pass the capture embedding of CxD through MHA and multiply it with query embedding

> 在我们的论文中,查询视频分支和查询标题分支是分开训练的。我们首先训练查询视频分支 5 个周期。一旦训练了该分支,我们就继续训练查询标题分支。

          > 在我们的论文中,查询视频分支和查询标题分支是分开训练的。我们首先训练查询视频分支 5 个周期。一旦训练了该分支,我们就继续训练查询标题分支。

我看了你的代码,我发现在train_video.py中就已经使用到了字幕caption,那么此时我该如何理解你所说的前5轮是训练查询-视频分支的?(在我的理解中,你前5个epoch为了训练查询-视频分支,那么就不该出现字幕,因为如果存在字幕,就会导致查询编码器也处理字幕信息了,那么此时不就没有所谓的前五轮训练查询-视频分支的吗?)
我不知道我的理解正确不?我对着一部分很困惑,期望得到你的回复

Originally posted by @shams2023 in #4 (comment)

can you provide inference code using text query ?

Thanks for your great work.

Could you share the inference and visualization code that ranks videos based on a text query, as mentioned in the visualization? *(ex. infer.py --query 'a person is discussing a car')

I look forward to your response. Thank you.

Sample inference code

Hi,
Do you have a sample inference code to load the model, pre-process video and text, and get the similarity score ?

Thanks !

Low R1 performance in the 2nd stage

Thanks for sharing your code. Is it normal to get R1=30 with train_titles.py? After running the score fusion, the title matrix does not improve the video matrix.

Questions about [SEP] token

In the code, both the query-video branch and the query-caption branch use [SEP] embedding as the global feature of query or caption, but the paper mentions [CLS] embedding. So should I use [SEP] embedding or [CLS] embedding? Thank you.

Two branch or two loss

In paper, you mentioned that “To reduce conflict between the two branches, the query-video branch is trained first, followed by the query-caption branch”。However, you also mentioned that "The total loss L is the sum of Query-Video loss L_{QV} and Query-Caption loss L_{QC} ". Are the two branches trained separately? My question is that what is the loss when first train query-video branch the loss and secondly train query-caption branch. In addition, how much epochs do it take to train first query-video branch.

Resume training

Hi, the resume training only loads the optimizer state and also the loss doesnt start from where it stopped

在其他数据集上训练

您好,您论文中所有的数据集caption都为多条,请问如果我的数据集caption只有1条会影响实验结果吗?目前,我在自己的数据集上进行训练,训练结果正常,但测试时结果很差,R@1=1.0(测试数量为100)

Preprocess for other datasets

Hi, it's a good work!
I want to run the code on other video retrieval datasets, therefore I wonder know that how to convert the raw video into frame.
Such as the sample rate, size of every frames in your code.

Inference on pretrained model

Hi,

your work is very nice.

I wonder if it would be possible for you to share the pre-trained model and instructions on how to use it. At least the instructions on how to use the pre-trained model on a test set of data. For example given a set of 20 videos, their captions and a query text, retrive the first video closer to the query text.

Thank you!

How the entire dataset is converted into captions

Thank you very much for your work!
How do you convert a video from an entire dataset into captions? I currently want to convert all the images or videos in the entire dataset into captions, but the code involved in the article [ZeroCap: Zero Shot Image to Text Generation for Visual Semantic Arithmetic] only works by converting one image into captions, so I really want to know what I need to do if I want to convert an entire dataset?
I really hope to receive your guidance. Thank you again

Question of the caption file.

Hi,

Thanks for releasing the code and data.
I have checked the provided caption data and I found that there are two additional keys in the dataset, 'title' and 'titles'.
Can you provide some explanation about them? Like how did you get the data, from url or captioning model? And what is the difference between the two sets?

Thank you!

some questions

image
这里面的C指的是这个视频生成的辅助字幕的数量吗?
我看最后做消融实验的时候说最好的结果是1个辅助字幕就会得到很好的结果
所以这个C该如何理解

Question about implementation details.

Hello, I admit that this is a good job.
However, in the code, you set batch_size=256, but the paper states that it is 128 ( Maybe the version of the paper I downloaded is wrong? I downloaded it from arxiv ).
I reproduced the code and found that when batch_size=256, the accuracy on msrvtt-1ka is equivalent to that in the paper, but when batch_size=128, it is only about 47%.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.