Giter VIP home page Giter VIP logo

pope's Introduction

POPE: Polling-based Object Probing Evaluation for Object Hallucination

This repo provides the source code & data of our paper: Evaluating Object Hallucination in Large Vision-Language Models (EMNLP 2023).

@inproceedings{Li-hallucination-2023,
  title={Evaluating Object Hallucination in Large Vision-Language Models},
  author={Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao and Ji-Rong Wen},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023},
  url={https://openreview.net/forum?id=xozJw0kZXF}
}

image-20230517233229650

Update

  • [10/8] 🎉🎉🎉 Our work is accepted to EMNLP 2023!
  • [6/25] We upload the caption result used in Table 2. You can find them under ./caption_data.
  • [6/3] We incorporate the code for using POPE with SEEM to handle unannotated datasets.
  • [6/1] We release the code of POPE.

Collect ground-truth objects

POPE can be easily built based on datasets with annotations about objects in the image. With the help of automatic segmentation tools like SEEM, you can also build POPE on any dataset you want to test.

From annotations

If you want to buile POPE on datasets with object annotations (e.g. COCO) , you should first organize the annotations in a json file with the following format:

{"image": "COCO_val2014_000000131089.jpg", "objects": ["person", "baseball bat"]}
{"image": "COCO_val2014_000000393225.jpg", "objects": ["bowl", "spoon", "carrot"]}

Here the image is the filename of each image and the objects is the objects in the image extracted from annotations. You can also add other keys (e.g. image_id in COCO) , but the above two must be included.

We provide the annotation results of the validation set of COCO 2014 under ./segmentation/. We refer to LisaAnne/Hallucination to collect objects from segmentations and captions.

From automatic segmentation results

Besides building POPE with object annotations, our method also supports building POPE on raw images. Leveraing the automatic segmentation tools (e.g. SEEM), we can first extract the objects in the image and then build the POPE as the above method. If you want to buile POPE with SEEM , you should first organize the annotations in a json file with the following format:

[{"image": "COCO_val2014_000000131089.jpg"}, {"image": "COCO_val2014_000000393225.jpg"}]

Build POPE

Once you have the ground-truth objects prepared, you can build your own POPE by running:

python main.py

You can customize your POPE by specifying these configs:

  • --auto_seg: Whether to use automatic segmentation results. Default = False.
  • --img_path: The path to the json file that contains the path of images to be segmented.
  • --seg_path: The path to the segmentation path that containing ground-truth objects in the image.
  • --seg_num: The number of images to be segmented. Default = 1000.
  • --sample_num: The number of negative objects to be sample for each image. Default = 3.
  • --img_num: The number of images for building POPE. Default = 500.
  • --template: The prompt template. Default = "Is there a {} in the image?".
  • --dataset: The dataset name used for the filename of the built POPE. Default = "coco".
  • --save_path: The save path of the built POPE. Default = "./output/"

If you want to employ SEEM to segment images, you can build POPE by running:

python main.py --auto_seg True --img_path ./segmentation/coco_val_images.json --seg_num 1000

After the execution, you will find 5 json files under "./output/{dataset}/":

  • {dataset}_pope_random.json: The POPE built with random negative sampling strategy. Each question is a dict in the following format:

    {"question_id": 1, "image": "COCO_val2014_000000016631.jpg", "text": "Is there a person in the image?", "label": "yes"}
    {"question_id": 2, "image": "COCO_val2014_000000016631.jpg", "text": "Is there a refrigerator in the image?", "label": "no"}
  • {dataset}_pope_popular.json: The POPE built with popular negative sampling strategy.

  • {dataset}_pope_adversarial.json: The POPE built with adversarial negative sampling strategy.

  • {dataset}_ground_truth_objects.json: The appearance frequencies of all objects in the selected images.

  • {dataset}_co_occur.json: The co-occurrence frequencies of all objects in the selected images.

Evaluation

Now you can use built POPE to evaluate LVLMs and evaluate their object hallucination. The answer of LVLMs should be organized in a json file in the following format:

{"question": "is there a bird in the image?", "answer": "yes"}
{"question": "is there a tree in the image?", "answer": "no"}

Notice that the order of POPE questions and the answer should be the same.

Then you should specify the ans_file (i.e. the answer file of LVLMs) and label_file (i.e. the POPE file) in "evaluate.py" and evaluate the results by running:

python evaluate.py

The program will report the Accuracy, Precision, Recall, F1 Score and Yes ratio as metrics.

pope's People

Contributors

aoidragon avatar

Stargazers

 avatar yangzhen avatar Ahmet Zeer avatar Alex NaN avatar Xiaodong Wang avatar  avatar QM avatar kingfly avatar  avatar  avatar wuyujack (Mingfu Liang) avatar  avatar Naen Xu avatar  avatar Xing Yun (邢云) avatar Yuanhao Xi avatar  avatar Wemersive, Inc avatar  avatar  avatar Sicong avatar Mohammad Reza Taesiri avatar LI XIN avatar  avatar zhikaizhang avatar Ziang Wu avatar oukohou avatar bo zhou avatar  avatar Mauro Risonho de Paula Assumpção avatar fun_dl avatar ZhongdaoWang avatar  avatar ideashan avatar Pritam Sarkar avatar Peng Xia avatar Jeff Carpenter avatar shipeng avatar 林豪佳 avatar LinZhenYu avatar Lële avatar Coobiw avatar JIMMY ZHAO avatar WanqiZhong avatar 爱可可-爱生活 avatar Tiancheng Zhao (Tony)  avatar Kye Gomez avatar Inoichan avatar  avatar WHL avatar Xinhao Xu avatar Xinyu Huang avatar  avatar Angie avatar Xiaolong avatar XiaoyiZhang avatar  avatar YifanXu avatar jcy avatar ZhuDeyao avatar  avatar Pengxiang Li avatar

Watchers

 avatar

pope's Issues

Question about the code of CHAIR

Hello, thanks for your contribution to the VLM evaluation! Besides the metric Pope, will you also release the code of the metric CHAIR you use in your paper? Thanks!

Reproduce Table 6: the json files of COCO, A-OKVQA and GQA

Hi, thanks for your valuable contributions!
I want to reproduce the result of table 6. However, I have encountered an issue when running the script to generate the POPE dataset. It appears that running the script results in different JSON files each time.
Would it be possible for you to share the final JSON files for the COCO, A-OKVQA, and GQA datasets after running the build script?
python main.py --auto_seg True --img_path ./segmentation/<coco><A-OKVQA><GQA>_val_images.json --seg_num 1000
Having access to these final JSON files would greatly assist me in my research.
Looking forward to your early reply! :)
Best Wish!

Word 'imange' in POPE questions

Hello,

Thank you for the nice work!

Just a quick question... I noticed there are 'imange' instead of 'image' in the annotation json files. Is it a typo or it is really 'imange' intentionally (since there are many of them). Sorry if I missed anything from the paper.

Thanks for the help!

Question about Figure 2.

First of all congrats your greate work!

My question is how to find out the Hallucination times of a model just like the Figure 2 shows. I assume the file segmentation/coco_ground_truth_segmentation.json is the ground truth to show what objects are actually shown in the corresponding image. Then base on the caption data you store on caption_data/Instruction1_instructblip.json, I assume you use the generate of the model to describe some images so that you could compare the objects that model predicts and the objects in ground truth.

But how to evaluate the prediction and the ground truth objects? The predicted objects that not included in ground truth object list should be just put in hallucinations? And how to handle the synonym objects for example, predicted objects: man and ground truth object: person. It shouldn't be hallucination, right?

Looking forward to your response :)

Add LICENSE

Hi, could you please consider adding a LICENSE file to the repo?

Hello, I also find the label error in practice.

          Hello, I also find the same error in practice.

May I ask how do you get this name list for each image?
I call coco API to check it and find it is not consistent with the picture of name list you shown above.

Hello, @Maxlinn

We build the POPE questions based on the object annotations of COCO. I've reviewed the annotations of the images you've referenced and confirmed that these objects are indeed present. Here are some examples:

WX20231029-160337@2x WX20231029-160406@2x WX20231029-160858@2x

Anyway, thanks for your detailed check and we'll take your feedback into consideration as we look to enhance our dataset in the future.

Originally posted by @zwbx in #8 (comment)

Nice Work!

Looking forward to the release of the code!

Generated Questions for A-OKVQA and GQA

Thanks for your great work! We are currently following your work on the object hallucinations and developing technique to mitigate this issue.
May I know if you can publish the generated questions for GQA and A-OKVQA with random, popular, and adversarial splits?

GQA 500 samples

Hi, thanks for your amazing work!
I would like to inquire if you can provide 500 samples randomly selected from GQA. This would be incredibly helpful for my evaluation, and I will definitely giving proper citation.

”there are 132,062 questions in the balanced validation set of GQA, we only randomly sample 500 among them for evaluation)“

I will be waiting for your answers :)
Best wishes!

Project Publication

Hi authors, congrats on the great work. May I know if you have any plan to publish this project at some conference venues?

InstructBLIP generation parameters

Hello, I want to ask what generation parameters and what language model or pretrained weight you guys use in producing results in the paper? In our practice, we found that InstructBLIP results are much higher than results in the paper.

For example, in popular and adversarial, our accuracy results are 84% and 81%, however in paper results are 71% and 62%

quite a few labeling errors in pope coco annotations

hello aoidragon and the team, thank for your work of pope!

recently i am using pope coco (random/adverserial/popular) to measure hallucination of my LVLM, however, i found quite a few labeling errors in the released data. i am using the newest commit. i sincerely hope you could look into this matter.

here are some cases:

random: {'question_id': 1271, 'image': 'COCO_val2014_000000244455.jpg', 'text': 'Is there a bicycle in the image?', 'label': 'yes'}

there seems no bicycle

图片

random: {'question_id': 1761, 'image': 'COCO_val2014_000000346707.jpg', 'text': 'Is there a bowl in the image?', 'label': 'yes'}

there is no bowl

图片

random: {'question_id': 1593, 'image': 'COCO_val2014_000000485485.jpg', 'text': 'Is there a car in the image?', 'label': 'yes'}

there seems no car

图片

adverserial: {'question_id': 2375, 'image': 'COCO_val2014_000000325347.jpg', 'text': 'Is there a sports ball in the image?', 'label': 'yes'}

there is no ball

图片

adverserial: {'question_id': 1051, 'image': 'COCO_val2014_000000018150.jpg', 'text': 'Is there a backpack in the image?', 'label': 'yes'}

there is no backpack

图片

adverserial: {'question_id': 1327, 'image': 'COCO_val2014_000000287035.jpg', 'text': 'Is there a tv in the image?', 'label': 'yes'}

there is no tv

图片

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.