ecoxial2007 / lgva_videoqa Goto Github PK
View Code? Open in Web Editor NEWLanguage-Guided Visual Aggregation for Video Question Answering
Language-Guided Visual Aggregation for Video Question Answering
Hi! Can you explain the specific steps of GLIP in detail?
rFeature = item_dict['bbox_features'][:, :, 0, :, :]
'bbox_features' does not look like features extracted through the image-encoder branch of CLIP, but rather looks like features extracted directly by GLIP
hi, can you release the code for distributed training using multi-gpu?
Where is the bbox handling in clip?
I'm currently working with the datasets stored in text_features_all.h5 and text_features_clip.h5 and have come across three specific features extracted from these files: text_query_features, text_query_token_features, and text_cands_features.
Could you provide a detailed explanation of what each of these three features represents within the context of the data?
Specifically, how do text_query_features、text_query_token_features and text_cands_features differ in their representation of text data, and what role does each play in the overall model?
I'm also curious about the extraction process for these features. Thank you very much for your support.
Thanks for your excellent code. However, there are some missing files for --data_path
and --feature_path
, Do you have any further open-source plans?
hi, I find the qid
in next_train_qa.json
is not the same as that in original train.csv
, and if you could explain the meaning of qid
?
can you release the code of video and text feature extraction? many thanks to you!
Thank you very much for the publicly available source code and dataset.
I have two questions that I hope to receive your response to:
In NeXt-clip-bbox-features. zip, the shape of each h5 file is (64, 2, 10, 768). I am curious what (2) and (10) represent? I see that in your model. py, I see that the author uses this in model.py: rFeature=item_dict ['bbox_features'] [:,:, 0,:,:]. So, could you explain the meaning referred to by (64, 0, 10, 768), (64, 1, 10, 768),)?
The NExT-QA dataset seems to have a total of 5,440 videos, but there are 9,454 h5 files in both NeXt-clip-features and NeXt-clip-bbox-features files.、
Looking forward to and thank you very much for your reply!
hi, when I reproduce your code on Nextqa dataset, I find that the shape of cFeature (batch_size, num_frames, dim) is (64,16, 768.). But num_frame should be 64 for the dataset. So I want to know if the file text_features_blip_caption.h5
is not corrrect?
Hello, I still have some doubts about using CLIP to extract features of the problem. By modifying the original code of CLIP, we can obtain local question features with the shape [bs, 77, 512], but it is not clear how to obtain the global question features you mentioned in the paper. Can you give me some advice?
Thanks for doing such a great job!
two issues:
When will the extraction feature embedding code for 'bbox_features' using pre-trained GLIP be available?
Hi, I have some doubts about the processing part of the candidates, can you help me?
Suppose we have 40 candidate words, which we put into the clip's text-encoder according to the approach in the paper. Do we put them into the text-encoder one by one to get a [40, 512] shaped feature? Or do we splice all the candidate words into one sentence and get a feature of shape [1, 512]?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.