V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Paper | Project Page | Online Demo

Getting Started

Installation

conda create -n vstar python=3.10 -y
conda activate vstar
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
export PYTHONPATH=$PYTHONPATH:path_to_vstar_repo

Pre-trained Model

The VQA LLM can be downloaded here.
The visual search model can be downloaded here.

Training Dataset

The alignment stage of the VQA LLM uses the 558K subset of the LAION-CC-SBU dataset used by LLaVA which can be downloaded here.

The instruction tuning stage requires several instruction tuning subsets which can be found here.

The instruction tuning data requires images from COCO-2014, COCO-2017, and GQA. After downloading them, organize the data following the structure below

├── coco2014
│   └── train2014
├── coco2017
│   └── train2017
└── gqa
     └── images

Demo

You can launch a local Gradio demo after the installation by running python app.py. Note that the pre-trained model weights will be automatically downloaded if you have not downloaded them before.

You are expected to see the web page below:

Benchmark

Our V*Bench is available here. The benchmark contains folders for different subtasks. Within each folder is a list of image files and annotation JSON files. The image and annotations files are paired according to the filename. The format of the annotation files is:

{
  "target_object": [] // A list of target object names
  ,
  "bbox": [] // A list of target object coordinates in <x,y,w,h>
  ,
  "question": "",
  "options": [] // A list of options, the first one is the correct option by default
}

Evaluation

To evaluate our model on the V*Bench benchmark, run

python vstar_bench_eval.py --benchmark-folder PATH_TO_BENCHMARK_FOLDER

To evaluate our visual search mechanism on the annotated targets from the V*Bench benchmark, run

python visual_search.py --benchmark-folder PATH_TO_BENCHMARK_FOLDER

The detailed evaluation results of our model can be found here.

Training

The training of the VQA LLM model includes two stages.

For the pre-training stage, enter the LLaVA folder and run

sh pretrain.sh

For the instruction tuning stage, enter the LLaVA folder and run

sh finetune.sh

For the training data preparation and training procedures of our visual search model, please check this doc.

License

This project is under the MIT license. See LICENSE for details.

Citation

Please consider citing our paper if you find this project helpful for your research:

@article{vstar,
  title={V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs},
  author={Penghao Wu and Saining Xie},
  journal={arXiv preprint arXiv:2312.14135},
  year={2023}
}

Acknowledgement

This work is built upon the LLaVA and LISA.

eric-doug / vstar Goto Github PK

vstar's Introduction

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Paper | Project Page | Online Demo

Contents:

Getting Started

Installation

Pre-trained Model

Training Dataset

Demo

Benchmark

Evaluation

Training

License

Citation

Acknowledgement

vstar's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent