Giter VIP home page Giter VIP logo

lgva_videoqa's Introduction

Language-Guided Visual Aggregation for Video Question Answering

This is the implementation of our paper, all features and weights will be released on github. You can also extract video and text features yourself according to our code and documentation.

Environment

This code is tested with:

  • Ubuntu 20.04
  • PyTorch >= 1.8
  • CUDA >= 10.1
# create your virtual environment
conda create --name lgva python=3.7
conda activate lgva

# dependencies
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=10.1 -c pytorch
conda install pandas

# optional (for feature extraction); see also tools/*.py
pip install git+https://github.com/openai/CLIP.git

Dataset

Feature extraction

Please refer to ./tools/extract_embedding.py

Pre-extracted Features

dataset frame bbox caption question&answer
NExT-QA BaiduDisk BaiduDisk BaiduDisk BaiduDisk
MSVD BaiduDisk BaiduDisk BaiduDisk BaiduDisk
MSRVTT BaiduDisk BaiduDisk BaiduDisk uploading

Due to the large number of videos in TGIF and ActivityNet, we do not plan to upload the features. You can process the original videos using a simple feature extraction script. Similarly, extracting text features (such as questions and answers) does not take much time, and you can extract them on your own based on the json files.

Train & Val & Test

Check trainval_msvd.sh & trainval_nextqa.sh

python3 src/trainval.py \
        --dataset 'nextqa_mc' \
        --data_path './data/Annotation' \
        --feature_path '/home/liangx/Data/NeXt-QA'\
        --batch_size 256

python3 src/test.py \
        --dataset 'nextqa_mc' \
        --data_path './data/Annotation' \
        --feature_path '/home/liangx/Data/NeXt-QA'\
        --checkpoint './checkpoints/nextqa_mc/ckpt_0.6112890243530273.pth' \
        --batch_size 256 \
        --visible

LICENSE / Contact

We release this repo under the open MIT License.

Citations

@article{Liang2023LanguageGuidedVA,
  title={Language-Guided Visual Aggregation Network for Video Question Answering},
  author={Xiao Liang and Di Wang and Quan Wang and Bo Wan and Lingling An and Lihuo He},
  journal={Proceedings of the 31st ACM International Conference on Multimedia},
  year={2023},
  url={https://api.semanticscholar.org/CorpusID:264492577}
}

Acknowledgements

We reference the excellent repos of NeXT-QA, VGT, ATP, CLIP, in addition to other specific repos to the datasets/baselines we examined (see paper). If you build on this work, please be sure to cite these works/repos as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.