ecoxial2007 / lgva_videoqa Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 2.0 6.92 MB

Language-Guided Visual Aggregation for Video Question Answering

Python 98.30% Shell 1.70%

lgva_videoqa's People

Stargazers

Watchers

Forkers

barron429 wangdi-xidian

lgva_videoqa's Issues

Processing of GLIP

Hi! Can you explain the specific steps of GLIP in detail?
rFeature = item_dict['bbox_features'][:, :, 0, :, :]
'bbox_features' does not look like features extracted through the image-encoder branch of CLIP, but rather looks like features extracted directly by GLIP

the code for multi-gpu training

hi, can you release the code for distributed training using multi-gpu?

Incomplete extract_embedding.py code?

Where is the bbox handling in clip?

Clarification on Feature Representations from text_features_all.h5 and text_features_clip.h5

I'm currently working with the datasets stored in text_features_all.h5 and text_features_clip.h5 and have come across three specific features extracted from these files: text_query_features, text_query_token_features, and text_cands_features.

Could you provide a detailed explanation of what each of these three features represents within the context of the data?
Specifically, how do text_query_features、text_query_token_features and text_cands_features differ in their representation of text data, and what role does each play in the overall model?

I'm also curious about the extraction process for these features. Thank you very much for your support.

Missing h5 extracted feature files

Thanks for your excellent code. However, there are some missing files for --data_path and --feature_path , Do you have any further open-source plans?

the meaning of qid in next_train_qa.json

hi, I find the qid in next_train_qa.json is not the same as that in original train.csv, and if you could explain the meaning of qid ?

feature extraction flies

can you release the code of video and text feature extraction? many thanks to you!

About bbox_features in NeXt-clip-bbox-features.zip

Thank you very much for the publicly available source code and dataset.

I have two questions that I hope to receive your response to:

In NeXt-clip-bbox-features. zip, the shape of each h5 file is (64, 2, 10, 768). I am curious what (2) and (10) represent? I see that in your model. py, I see that the author uses this in model.py: rFeature=item_dict ['bbox_features'] [:,:, 0,:,:]. So, could you explain the meaning referred to by (64, 0, 10, 768), (64, 1, 10, 768),)?
The NExT-QA dataset seems to have a total of 5,440 videos, but there are 9,454 h5 files in both NeXt-clip-features and NeXt-clip-bbox-features files.、

Looking forward to and thank you very much for your reply！

code reproduction problem

hi, when I reproduce your code on Nextqa dataset, I find that the shape of cFeature (batch_size, num_frames, dim) is (64,16, 768.). But num_frame should be 64 for the dataset. So I want to know if the file text_features_blip_caption.h5 is not corrrect?

About CLIP’s text-encoder.

Hello, I still have some doubts about using CLIP to extract features of the problem. By modifying the original code of CLIP, we can obtain local question features with the shape [bs, 77, 512], but it is not clear how to obtain the global question features you mentioned in the paper. Can you give me some advice?

The extraction embedding coding of "bbox_features"

Thanks for doing such a great job！
two issues:

Is item_dict['video_features'] supposed to be obtained via a pre-trained CLIP?
Is item_dict['bbox_features'] obtained via pre-trained GLIP?

When will the extraction feature embedding code for 'bbox_features' using pre-trained GLIP be available?

Processing of candidate words

Hi, I have some doubts about the processing part of the candidates, can you help me?

Suppose we have 40 candidate words, which we put into the clip's text-encoder according to the approach in the paper. Do we put them into the text-encoder one by one to get a [40, 512] shaped feature? Or do we splice all the candidate words into one sentence and get a feature of shape [1, 512]?

ecoxial2007 / lgva_videoqa Goto Github PK

lgva_videoqa's People

Stargazers

Watchers

Forkers

lgva_videoqa's Issues

Processing of GLIP

the code for multi-gpu training

Incomplete extract_embedding.py code?

Clarification on Feature Representations from text_features_all.h5 and text_features_clip.h5

Missing h5 extracted feature files

the meaning of qid in next_train_qa.json

feature extraction flies

About bbox_features in NeXt-clip-bbox-features.zip

code reproduction problem

About CLIP’s text-encoder.

The extraction embedding coding of "bbox_features"

Processing of candidate words

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent