Giter VIP home page Giter VIP logo

mmvp's Introduction

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie

Teaser

Contents:

  1. Getting Started
  2. Benchmark
  3. Evaluation
  4. Training
  5. License
  6. Citation
  7. Acknowledgement

Getting Started

Installation

conda create -n mmvp python=3.10 -y
conda activate mmvp
cd LLaVA
pip install -e .
pip install flash-attn --no-build-isolation

Pre-trained Model

The Interleaved-MoF Models (Based on LLaVA-1.5 13b) can be found here.

Benchmarks

MMVP Bechmark

Our MMVP Benchmark is available here. It is specially crafted to measure multimodal LLM's visual capability via VQA. The benchmark is seperated into a folder containing all 300 testing images and an annotation csv file with questions and correct answers. The format of the data is:

├── MMVP Images
│   ├── 1.jpg
│   ├── 2.jpg
│   ├── 3.jpg
│   ├── ...
│   └── 300.jpg
└── Questions.csv

MMVP VLM Bechmark

Our MMVP-VLM Benchmark is available here. It is distilled from the MMVP benchmark above and simplified the language description for each image. It is designed to evaluate VLMs such as CLIP's visual capability. The benchmark is organized into 9 different visual patterns. In each visual pattern, there are 15 pairs of zero-shot questions. An annotation csv file contains the question, correponding visual pattern and images. The format of the data is:

├── MMVP_VLM_Images
│   ├── Orientation
│   │   ├── 1.jpg
│   │   ├── 2.jpg
│   │   ├── ...
│   │   └── 30.jpg
│   ├── Presence
│   │   ├── 31.jpg
│   │   ├── 32.jpg
│   │   ├── ...
│   │   └── 60.jpg
│   ├── ...
│   └── Camera_Perspective
│       ├── 241.jpg
│       ├── 242.jpg
│       ├── ...
│       └── 270.jpg
└── Questions.csv

Evaluation

To evaluate on the MMVP, run

python scripts/evaluate_mllm.py --directory PATH_TO_MMVP_BENCHMARK_FOLDER --model-path PATH_TO_MODEL_EVALUATED

The script provide an evaluation for LLaVA based models that generate a jsonl file containing questions, correct answer and model response. Feel free to modify the script and apply on other models.

After generating model's response, one can manually check the accuracy or use a LLM (e.g. GPT-4) to generate the score.

python scripts/gpt_grader.py --openai_api_key YOUR_OPENAI_API_KEY --answer_file PATH_TO_MODEL_RESPONSE_EVALUATED

Here is the result of SOTA models on MMVP Benchmark. It shows that these leading models consistently struggles with these straightforward questions on visual grouding, MMVP

To evaluate on MMVP-VLM, run

python scripts/evaluate_vlm.py --directory PATH_TO_MMVPVLM_BENCHMARK_FOLDER

Here is the result of SOTA CLIP models on MMVP-VLM Benchmark. It shows that scaling up parameters and image resolution in CLIP models obtain very little improvement on discerning these visual patterns.

MMVPVLM

Training

The training of the Interleaved-MoF MLLM model follows the training procedure of LLaVA. Please follow the data preparation process in LLaVA. Please replace the directory to data in the training data to your local directories.

For the pre-training stage, enter the LLaVA folder and run

sh pretrain.sh

For the instruction tuning stage, enter the LLaVA folder and run

sh finetune.sh

One can also find the plug and play changes nessecary to Interleaved-MoF in "LLaVA/llava/model/llava_arch.py/#L155". The function prepare_inputs_labels_for_multimodal_withdino contains ways to spatially interleave DINOv2 and CLIP features before feeding to the LLM.

License

This project is under the MIT license. See LICENSE for details.

Citation

Please consider citing our paper if you find this project helpful for your research:

@misc{tong2024eyes,
      title={Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs}, 
      author={Shengbang Tong and Zhuang Liu and Yuexiang Zhai and Yi Ma and Yann LeCun and Saining Xie},
      year={2024},
      eprint={2401.06209},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

  • This work is built upon the LLaVA .

mmvp's People

Contributors

tsb0601 avatar

Stargazers

Pass-O-Guava avatar mdzhang avatar NAN ZHANG avatar Coobiw avatar Xinyu Huang avatar  avatar  avatar Xianing Chen avatar Jiadi Su avatar Haotian Xue avatar Jiayi Guo avatar  avatar xiaofei sun avatar Kien Nguyen avatar Nir Zabari avatar JEONG JIHYEOK avatar Hongbo Zhao avatar  avatar  avatar Siyi Chen avatar Mengcheng Lan avatar Xubing Ye avatar Mert Yuksekgonul avatar Wenxuan Zhu avatar Haomeng Zhang avatar You Li avatar Lorenzo Bianchi avatar Yubin Wang avatar Malik Hashmat avatar Yuan-Man avatar  avatar Tianyun Yang avatar  avatar Drax avatar  avatar Gurumurthi V Ramanan avatar Hertz avatar  avatar Chun Chet Ng avatar  avatar Jeonghun Baek avatar Zilong Zhang avatar  avatar toshi456 avatar Piotr Skalski avatar Xumeng Han avatar rain avatar  avatar Jack Li avatar greedisgood avatar  avatar  avatar Kexin Song avatar Hideaki Omote avatar  avatar Chuenhei Tai avatar Razieh Rezaei avatar shangao avatar  avatar  avatar  avatar Po Tsui avatar JimChien avatar  avatar Zhi Tu avatar  avatar metaaaa avatar Yongtuo Liu avatar Shaoqi Dong avatar  avatar  avatar Felipe Menegazzi avatar Sandalots avatar Max Bain avatar Qinlin Zhao avatar Yikun Liu avatar  avatar  avatar LIU Vicent avatar German Novikov avatar ChrisXue avatar Rohun avatar Guangkai Xu avatar  avatar ChenLong avatar  avatar 姬忠鹏 avatar Xiongkun Linghu avatar  avatar Shilong Liu avatar Evangelion avatar Xiaolong avatar CJZhao avatar seilk avatar Wenqian Ye avatar Xu Cao avatar Yunsheng Ma avatar Wenhao Chai avatar Zhihui Xie avatar  avatar

Watchers

Saining Xie avatar Deserts avatar signal processing fan avatar haikuoxin avatar Yunzhi Zhang avatar Malik Hashmat avatar Yifan Du avatar Yikun Liu avatar  avatar Xing Yun (邢云) avatar

mmvp's Issues

evaluate on the MMVP

Thank you for your good job! Sorry,I have some questions.

when run "python scripts/evaluate_mllm.py --directory PATH_TO_MMVP_BENCHMARK_FOLDER --model-path PATH_TO_MODEL_EVALUATED"
Is the model path “ /myroot/MoF_Models”?

The accuracy of LLaVA-1.5-7b with CLIP encoder is 60.0 on MMVP

I evaluated LLaVA-1.5-7b on the MMVP dataset and found that its accuracy is 60.0%, which is significantly higher than the 24.7% reported in Table 3.
Upon comparing the evaluation code, I discovered that the prompt used in

cur_prompt = row['Question'] + " " + row['Options']

differs from the one used by LLaVA-1.5, which is: '{question}\nA. {}\nB. {}\nAnswer with the option's letter from the given choices directly.'
Given the first prompt, the model generates a long sentence as the answer, whereas with the second prompt, the model provides the option directly. This difference in prompts leads to the large discrepancy in accuracy.
Anyway, the question is a binary choice, where a random guess would result in 50% accuracy; therefore, an accuracy of 60% also implies a significant problem with MLLMs :)

pretrain_dino_mm_mlp_adapter

When I run ’finetune.py‘, it breaks at the following step:
image
An error occors:
image

So, where can I get 'pretrain_dino_mm_mlp_adapter'?

instruction fine-tuning failed.

Hello, thank you very much for your open source contribution. I have a question to ask you. When I was using your project for the second stage of instruction fine-tuning, the program would get stuck without reporting any errors when I used the file parameter zero3.json or zero2.json to configure deepspeed. It just couldn't proceed. However, when I used the file parameter zero3_offload.json to configure deepspeed, the training proceeded normally. Could you please tell me the reason behind this? Have you run it through your open source project?

My training machine has the following hardware configuration: 8xA800 (80G). The deepspeed version is 0.12.6, transformers version is 4.31.0.

leaked api key

hi, i just popped by to say that you may have a leak of your OPENAI API key token. I suggest deleting it, unless you've already have done that.

OPENAI_API_KEY="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" python llava/eval/eval_gpt_review_bench.py \

dino v2 num paches set to 256

Hi!
why is the dinov2 num_patches set to 256?
the image size is 336, and the kernel size is 14. the num patches should be the same to clip, which is 576.

Implementation Details of Additive-MoF

Thank you for the excellent work.
Is the code for Additive-MoF publicly available?
Alternatively, could you provide details on the implementation?
For example, do you L2 normalize both the CLIP features and DINO features before passing them through the adapter, and any other specific details would be greatly appreciated.

Index Overflow in position_ids when evaluate on mmvp

Hello,

I've encountered an issue while running into transformers/models/llama/modeling_llama.py. When I execute the following command use 4 gpu 4090:

python scripts/evaluate_mllm.py --directory data/datasets--MMVP--MMVP --model-path llava_weights/llava-v1.5-MMVP--MoF_Models

I face an error related to position_ids: index out of bounds. During debugging, I found that this error occurs because position_ids.max returns an unusually large value, 3974362539628152236,
Specifically, the error message is: "../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [233,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed." In my debugging process for transformers/models/llama/modeling_llama.py, I found that the value of position_ids is normal at the beginning, but it suddenly overflows after a while.
I've strictly followed MMVP's instructions for setting up the environment, so I don't know what's wrong.

How to reproduce the result of Figure 6?

Hi @tsb0601 @s9xie , thanks for your great projector.

I would like to inquire about the process of reproducing the result of Figure 6 (i.e., MLLM’s performance on visual patterns). In scripts/evaluate_vlm.py, it seems that we have to obtain the logit of image and text.

However, take LLaVA as an example, we cannot directly obtain its logits. Or do you use the input as "<image> Is it a photo of <statement>" to LLaVA and obtain the result? I am a little confused about it.

Can you provide an instruction or a script to evaluate LLaVA on MMVP_VLM? With this, I can follow these steps to reproduce the results for other models.

Answer sheet of LLMs on MMVP?

Hello. Can you provide the answer sheet containing all answers of LLMs? I want to see the full answers of Figure 3 and 4 in your paper.

Correction Needed for Incorrect Answers in MMVP Benchmark Questions 279 and 280

Hi, first of all, thank you for your excellent work on the MMVP Benchmark. It's a great resource.

I've encountered a discrepancy in the answers provided for questions 279 and 280 regarding the position (standing or sitting) of an elderly person in the respective images. As per the Questions.csv file, the answers are listed as follows:

  • "279,Is the elderly person standing or sitting in the picture?,(a) Standing (b) Sitting,(a)"
  • "280,Is the elderly person standing or sitting in the picture?,(a) Standing (b) Sitting,(b)"

However, upon reviewing the images linked for these questions, it seems there's a mismatch:

  • Figure 279: 279

  • Figure 280: 280
    Based on these images, it appears the correct answers should be inverted:

  • For question 279, the elderly person is sitting, so the correct answer should be (b) Sitting.

  • For question 280, the elderly person is standing, so the correct answer should be (a) Standing.

Given the benchmark consists of only 150 pairs of questions, even a single discrepancy can lead to a significant accuracy difference of approximately 2*0.67%. It would be beneficial for the integrity of the benchmark to correct this inconsistency.

Thank you for considering this correction.

Discrepancy between the code and Table 1 in the paper

Hi, thanks for your insightful work!
I am using your MMVP benchmark to test different CLIP models' performance. However, when I run the exactly code from evaluate_vlm.py, I cannot get the same results as in Table 1 in the paper. My results are:

Orientation and Direction Presence of Specific Features State and Condition Quantity and Count Positional and Relational Context Color and Appearance Structural and Physical Characteristics Texts Viewpoint and Perspective
26.7 13.3 26.7 6.7 6.7 40 26.7 13.3 20

, which is different from the first row of Table 1 in the paper, and different from all the rows of Table 1. Could you confirm that? Thanks very much!

CLIP-blind pairs

Hi,
Great work, have you also released the code for finding clip blind pairs?

GPT4-V Prompt Used in Evaluation

Hi, thanks for this amazing work!

It would be great if you can consider releasing the prompts you used for evaluating GPT4-V in your MMVP Benchmark!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.