Giter VIP home page Giter VIP logo

vstar's Introduction

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Teaser

Contents:

  1. Getting Started
  2. Demo
  3. Benchmark
  4. Evaluation
  5. Training
  6. License
  7. Citation
  8. Acknowledgement

Getting Started

Installation

conda create -n vstar python=3.10 -y
conda activate vstar
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
export PYTHONPATH=$PYTHONPATH:path_to_vstar_repo

Pre-trained Model

The VQA LLM can be downloaded here.
The visual search model can be downloaded here.

Training Dataset

The alignment stage of the VQA LLM uses the 558K subset of the LAION-CC-SBU dataset used by LLaVA which can be downloaded here.

The instruction tuning stage requires several instruction tuning subsets which can be found here.

The instruction tuning data requires images from COCO-2014, COCO-2017, and GQA. After downloading them, organize the data following the structure below

├── coco2014
│   └── train2014
├── coco2017
│   └── train2017
└── gqa
     └── images

Demo

You can launch a local Gradio demo after the installation by running python app.py. Note that the pre-trained model weights will be automatically downloaded if you have not downloaded them before.

You are expected to see the web page below:

demo

Benchmark

Our V*Bench is available here. The benchmark contains folders for different subtasks. Within each folder is a list of image files and annotation JSON files. The image and annotations files are paired according to the filename. The format of the annotation files is:

{
  "target_object": [] // A list of target object names
  ,
  "bbox": [] // A list of target object coordinates in <x,y,w,h>
  ,
  "question": "",
  "options": [] // A list of options, the first one is the correct option by default
}

Evaluation

To evaluate our model on the V*Bench benchmark, run

python vstar_bench_eval.py --benchmark-folder PATH_TO_BENCHMARK_FOLDER

To evaluate our visual search mechanism on the annotated targets from the V*Bench benchmark, run

python visual_search.py --benchmark-folder PATH_TO_BENCHMARK_FOLDER

The detailed evaluation results of our model can be found here.

Training

The training of the VQA LLM model includes two stages.

For the pre-training stage, enter the LLaVA folder and run

sh pretrain.sh

For the instruction tuning stage, enter the LLaVA folder and run

sh finetune.sh

For the training data preparation and training procedures of our visual search model, please check this doc.

License

This project is under the MIT license. See LICENSE for details.

Citation

Please consider citing our paper if you find this project helpful for your research:

@article{vstar,
  title={V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs},
  author={Penghao Wu and Saining Xie},
  journal={arXiv preprint arXiv:2312.14135},
  year={2023}
}

Acknowledgement

  • This work is built upon the LLaVA and LISA.

vstar's People

Contributors

penghao-wu avatar eltociear avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.