Giter VIP home page Giter VIP logo

tianyi-lab / hallusionbench Goto Github PK

View Code? Open in Web Editor NEW
216.0 4.0 4.0 11.61 MB

[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
benchmark vlms gpt-4 gpt-4v llava benchmarks hallucination llm lmm large-language-models large-vision-language-models

hallusionbench's Introduction

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models [CVPR 2024]

You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

Tianrui Guan*, Fuxiao Liu*, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou

Updates

🔥🔥🔥

We welcome everyone to contribute the failure cases of Large Multimodal Models (GPT-4V) to our community!

🔥🔥🔥

Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs. To study these two types of VLM mistakes, i.e., language hallucination and visual illusion, we curated HallusionBench, an image-context reasoning benchmark that is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed analysis of examples in HallusionBench, which sheds novel insights on the illusion or hallucination of VLMs and how to improve them in the future.

If you find our paper useful, please cite our paper:

@misc{guan2023hallusionbench,
      title={HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models}, 
      author={Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou},
      year={2023},
      eprint={2310.14566},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{liu2023mitigating,
      title={Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning}, 
      author={Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang},
      year={2023},
      eprint={2306.14565},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{liu2023mmc,
      title={MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning}, 
      author={Fuxiao Liu and Xiaoyang Wang and Wenlin Yao and Jianshu Chen and Kaiqiang Song and Sangwoo Cho and Yaser Yacoob and Dong Yu},
      year={2023},
      eprint={2311.10774},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Dataset Download

To keep evaluation simple, we only provide the question in form of yes/no questions.

Updated on Questions and Annotations Figures Question Count Figure Count
Oct 27, 2023 HallusionBench.json hallusion_bench.zip 254 69

Evaluation

  1. Clone the repo.
git clone https://github.com/tianyi-lab/HallusionBench.git
cd ./HallusionBench
  1. Download the images hallusion_bench.zip and unzip the folder in the same directory.

  2. The questions and image locations are saved in ./HallusionBench.json. The data sample are as follows:

{'category': 'VD', 'subcategory': 'illusion', 'visual_input': '1', 'set_id': '0', 'figure_id': '0', 'sample_note': 'circle', 'question_id': '0', 'question': 'Is the right orange circle the same size as the left orange circle?', 'gt_answer_details': 'The right orange circle is the same size as the left orange circle.', 'gt_answer': '1', 'filename': './hallusion_bench/VD/illusion/0_0.png'}

The key visual_inputmeans whether the question needs visual input like images. If visual_input=1, it means the question need visual input. If visual_input=0, it means the question doesn't need visual input. It's the text-only question.

  1. Run your model on ./HallusionBench.json and save the ouput file as ./HallusionBench_result.json. You need to add the output of your model in the key 'model_prediction'. We provide an sample result here.
  2. Finally, run the following code for evaluation:
python evaluation.py

You can use your own API key for GPT4 evaluation by editing the code here.

Leaderboard

Definition

  • Visual Dependent (VD) Questions: questions that do not have an affirmative answer without the visual context.
    • Easy: Original images that are obtained from Internet.
    • Hard: Edited images from the original images.
  • Visual Supplement (VS) Questions: questions that can be answered without the visual input; the visual component merely provides supplemental information.
    • Easy: No visual input. Uncertain answer without hallucination is also considered correct response.
    • Hard: With visual input. The answer must follow the provided figure and visual context.

Metric

  • Accuracy per Figure (Consistency Test): Accuracy based on each figure. To make sure the mode truly understand image, we ask variant of questions based on the same knowledge on the same figure, and consider it correct if the model can answer all questions correctly. For example, the model should not give inconsistent responses on the questions "Is A bigger than B?" and "Is B smaller A?".
  • Accuracy per Question: Accuracy of all questions, including easy and hard questions.
  • Accuracy per Question Pair: We ask the same questions on similar images (or, with and without images). We consider the same question text on different visual contexts a question pair (usually they come in with an easy question and a corresponding hard question). This metric calculate accuracy of all question pairs.
Model Question Pair Acc Figure Acc Easy Question Acc Hard Question Acc Question Acc Json
GPT4V
Sep 25, 2023 Version
(Human Eval)
31.42 44.22 79.56 38.37 67.58 VD, VS
GPT4V
Sep 25, 2023 Version
(GPT Eval)
28.79 39.88 75.60 37.67 65.28 VD, VS
Claude 3
(GPT Eval)
21.76 28.61 55.16 41.40 56.86 VD, VS
LLaVA-1.5
(Human Eval)
9.45 25.43 50.77 29.07 47.12 VD, VS
LLaVA-1.5
(GPT Eval)
10.55 24.86 49.67 29.77 46.94 VD, VS
Gemini Pro Vision
Dec, 2023 Version
(GPT Eval)
7.69 8.67 35.60 30.23 36.85 VD, VS
GUA_VL
(GPT Eval)
16.70 23.12 53.63 39.77 51.82 VD, VS
BLIP2-T5
(GPT Eval)
15.16 20.52 45.49 43.49 48.09 VD, VS
Qwen-VL
(GPT Eval)
5.93 6.65 31.43 24.88 39.15 VD, VS
Open-Flamingo
(GPT Eval)
6.37 11.27 39.56 27.21 38.44 VD, VS
MiniGPT5
(GPT Eval)
10.55 9.83 36.04 28.37 40.30 VD, VS
MiniGPT4
(GPT Eval)
8.79 10.12 31.87 27.67 35.78 VD, VS
InstructBLIP
(GPT Eval)
9.45 10.11 35.60 45.12 45.26 VD, VS
BLIP2
(GPT Eval)
5.05 12.43 33.85 40.70 40.48 VD, VS
mPLUG_Owl-v2
(GPT Eval)
13.85 19.94 44.84 39.07 47.30 VD, VS
mPLUG_Owl-v1
(GPT Eval)
9.45 10.40 39.34 29.77 43.93 VD, VS
LRV_Instruction
(GPT Eval)
8.79 13.01 39.78 27.44 42.78 VD, VS
ViLT
(GPT Eval)
8.3516 11.2717 37.8022 45.3488 44.4641 VD, VS
GiT
(GPT Eval)
5.27 6.36 26.81 31.86 34.37 VD, VS

Reproduce GPT4V results on leaderboard

  1. We saved the ouput of GPT4V with our annotation. Put HallusionBench.tsv in the root directory of this repo, or set input_file_name in gpt4v_benchmark.py to the location of the HallusionBench.tsv file.

  2. (Optional) If you don't have access to GPT API, you don't need to run it since we have saved evaluation results. They can be downloaded for Visual Dependent and Visual Supplement. Put the json files in the root directory of this repo, or set save_json_path_vd and save_json_path_vd in gpt4v_benchmark.py to their respective locations.

  3. Run python gpt4v_benchmark.py.

Examples and Analysis

Example 1 Example 2 Example 3 Example 4 Example 5 Example 6 Example 7 Example 8 Example 9

License

This repository is under BSD 3-Clause License.

hallusionbench's People

Contributors

eltociear avatar fuxiaoliu avatar rayguan97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hallusionbench's Issues

How to generate "model_prediction"

Could you please give us an example about how to"Run your model on ./HallusionBench.json and save the ouput file as ./HallusionBench_result.json. You need to add the output of your model in the key 'model_prediction'"

gpt版本问题

你好,我在使用lmms-eval仓库对该测试集进行评估时,遇到了一下问题。
使用gpt-4,我的平均分数为23;但是该api有最大请求数量限制,我在进行这一次的评分之中,可能有部分样本超出了限制而请求gpt失败。
随后我更换为gpt-4o,最终的平均分数为25+
我想请问直接使用gpt-4o进行评估是否有问题?

[Results] Sharing HallusionBench results evaluated by VLMEvalKit

Dear authors,
First, congratulations to your great work, which we think a valuable resource for evaluating the hallucination of VLMs. We have implemented HallusionBench in VLMEvalKit and evaluated dozens of models on that benchmark. We would like to share these results to you and also invite you to try our VLM evaluation toolkit. It supports dozens of VLMs, benchmarks, and provides extensive evaluation results.

LLaVA without images

Thanks for the insightful work.
I am curious as to how LLaVA 1.5 infers without images. I don't think the official inference code without images is provided, but do you know the code?

Calculating performance through json

SORRY,in China, it is impossible to access openai service legally. I noticed that the evaluation results of gpt have been stored in json file. Can you directly evaluate the model performance through this json file? Assume that the answer of the model has been stored in the form of hallucionbench_result_sample.json.

Can you provide a guidance plan?

Only evaluate on hard set

Good work! I noticed in the leaderboard that the Hard Score is a lot lower. So i wanna just evaluate LMM performances on the HARD set, i wanna know how to do it. (Do i need to only run evaluations on the questions with visual-input == '2'?) Thanks!

Language and Vision Diagnosis table from the paper

Hi! Thanks for the great work.
I want to reproduce the right side of table 3 (the three columns: "Language Hallucination", "Visual Illusion", "Mixed").
I ran your evaluation code but I can't seem to find the place that computes these numbers, and I don't understand how to obtain them from the given numbers.

Thanks a lot!
Avshalom

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.