Giter VIP home page Giter VIP logo

lgva_videoqa's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

lgva_videoqa's Issues

Processing of GLIP

Hi! Can you explain the specific steps of GLIP in detail?
rFeature = item_dict['bbox_features'][:, :, 0, :, :]
'bbox_features' does not look like features extracted through the image-encoder branch of CLIP, but rather looks like features extracted directly by GLIP

Clarification on Feature Representations from text_features_all.h5 and text_features_clip.h5

I'm currently working with the datasets stored in text_features_all.h5 and text_features_clip.h5 and have come across three specific features extracted from these files: text_query_features, text_query_token_features, and text_cands_features.

Could you provide a detailed explanation of what each of these three features represents within the context of the data?
Specifically, how do text_query_features、text_query_token_features and text_cands_features differ in their representation of text data, and what role does each play in the overall model?

I'm also curious about the extraction process for these features. Thank you very much for your support.

Missing h5 extracted feature files

Thanks for your excellent code. However, there are some missing files for --data_path and --feature_path , Do you have any further open-source plans?

About bbox_features in NeXt-clip-bbox-features.zip

Thank you very much for the publicly available source code and dataset.

I have two questions that I hope to receive your response to:

  1. In NeXt-clip-bbox-features. zip, the shape of each h5 file is (64, 2, 10, 768). I am curious what (2) and (10) represent? I see that in your model. py, I see that the author uses this in model.py: rFeature=item_dict ['bbox_features'] [:,:, 0,:,:]. So, could you explain the meaning referred to by (64, 0, 10, 768), (64, 1, 10, 768),)?

  2. The NExT-QA dataset seems to have a total of 5,440 videos, but there are 9,454 h5 files in both NeXt-clip-features and NeXt-clip-bbox-features files.、

Looking forward to and thank you very much for your reply!

code reproduction problem

hi, when I reproduce your code on Nextqa dataset, I find that the shape of cFeature (batch_size, num_frames, dim) is (64,16, 768.). But num_frame should be 64 for the dataset. So I want to know if the file text_features_blip_caption.h5 is not corrrect?

About CLIP’s text-encoder.

Hello, I still have some doubts about using CLIP to extract features of the problem. By modifying the original code of CLIP, we can obtain local question features with the shape [bs, 77, 512], but it is not clear how to obtain the global question features you mentioned in the paper. Can you give me some advice?

The extraction embedding coding of "bbox_features"

Thanks for doing such a great job!
two issues:

  1. Is item_dict['video_features'] supposed to be obtained via a pre-trained CLIP?
  2. Is item_dict['bbox_features'] obtained via pre-trained GLIP?

When will the extraction feature embedding code for 'bbox_features' using pre-trained GLIP be available?

Processing of candidate words

Hi, I have some doubts about the processing part of the candidates, can you help me?

Suppose we have 40 candidate words, which we put into the clip's text-encoder according to the approach in the paper. Do we put them into the text-encoder one by one to get a [40, 512] shaped feature? Or do we splice all the candidate words into one sentence and get a feature of shape [1, 512]?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.