Giter VIP home page Giter VIP logo

ma-lmm-forked's Introduction

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

The official repository of our paper "MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding".

teaser

Model Overview

model

Requirements

You can install the conda environment by running:

git clone https://github.com/boheumd/MA-LMM.git
cd MA-LMM
pip install -e .

Dataset

For the long-term video understanding task, we conduct experiments including (LVU) and two standard video summarization datasets (Breakfast, COIN).

For the video question answering task, we conduct experiments including MSRVTT, MSVD, and ActivityNet. For the video captioning task, we also conduct experiments on Youcook2 dataset.

You can download videos for each dataset through the script provided here (lavis/datasets/download_scripts). Then extract video frames of each video with fps=10.

 ├── data
     └── activitynet
         ├── annotation
         ├── frames
         ├── videos
     └── breakfast
         ├── annotation
         ├── frames
         ├── videos
     └── coin
         ├── annotation
         ├── frames
         ├── videos
     └── lvu
         ├── annotation
         ├── frames
         ├── videos
     └── msrvtt
         ├── annotation
         ├── frames
         ├── videos
     └── msvd
         ├── annotation
         ├── frames
         ├── videos
     └── youcook2
         ├── annotation
         ├── frames
         ├── videos

Running

Download Pre-trained LLM

We use Vicuna-v1.1 as our pre-trained LLM weights, you can download from this link as arrange in this format.

├── llm
     ├── vicuna-7b
     ├── vicuna-13b

Training

We train the model on 4 A100 GPUs. To train the model on different dataset, please execute the following command:

bash run_scripts/${dataset}/train.sh

LVU dataset

    # Please choose the task from the list
    # ['director', 'genre', 'relationship', 'scene', 'way_speaking', 'writer', 'year']
    datasets.lvu_cls.task ${task}

Testing

We provided trained checkpoints for each dataset, please download the saved_model.tar and unzip it. Then for the test script of each dataset, pass the checkpoint path to run the evaluation.

bash run_scripts/${dataset}/test.sh ${checkpoint_path}

Hyper-parameters

One important hyper-parameters memory_bank_length, please change that in the training script on different datasets.

    # pre-defined length of the memory bank
    model.memory_bank_length ${value}
    # value=0 means without using the memory bank

Memory Bank Compression Code

The core algorithm for the memory bank compression algorithm is here.

Citation

If you find our code or our paper useful for your research, please [★star] this repo and [cite] the following paper:

@inproceedings{he2024malmm,
  title = {MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding},
  author    = {He, Bo and Li, Hengduo and Jang, Young Kyun and Jia, Menglin and Cao, Xuefei and Shah, Ashish and Shrivastava, Abhinav and Lim, Ser-Nam},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2024}
}

Acknowledgement

We referenced the repo below for the code

ma-lmm-forked's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.