aoidragon / pope Goto Github PK

[EMNLP'23] The official GitHub page for ''Evaluating Object Hallucination in Large Vision-Language Models''

License: MIT License

Python 88.52% Shell 0.12% C++ 1.13% Cuda 10.23%

pope's Introduction

POPE: Polling-based Object Probing Evaluation for Object Hallucination

This repo provides the source code & data of our paper: Evaluating Object Hallucination in Large Vision-Language Models (EMNLP 2023).

@inproceedings{Li-hallucination-2023,
  title={Evaluating Object Hallucination in Large Vision-Language Models},
  author={Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao and Ji-Rong Wen},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023},
  url={https://openreview.net/forum?id=xozJw0kZXF}
}

Update

[10/8] 🎉🎉🎉 Our work is accepted to EMNLP 2023！
[6/25] We upload the caption result used in Table 2. You can find them under ./caption_data.
[6/3] We incorporate the code for using POPE with SEEM to handle unannotated datasets.
[6/1] We release the code of POPE.

Collect ground-truth objects

POPE can be easily built based on datasets with annotations about objects in the image. With the help of automatic segmentation tools like SEEM, you can also build POPE on any dataset you want to test.

From annotations

If you want to buile POPE on datasets with object annotations (e.g. COCO) , you should first organize the annotations in a json file with the following format:

{"image": "COCO_val2014_000000131089.jpg", "objects": ["person", "baseball bat"]}
{"image": "COCO_val2014_000000393225.jpg", "objects": ["bowl", "spoon", "carrot"]}

Here the image is the filename of each image and the objects is the objects in the image extracted from annotations. You can also add other keys (e.g. image_id in COCO) , but the above two must be included.

We provide the annotation results of the validation set of COCO 2014 under ./segmentation/. We refer to LisaAnne/Hallucination to collect objects from segmentations and captions.

From automatic segmentation results

Besides building POPE with object annotations, our method also supports building POPE on raw images. Leveraing the automatic segmentation tools (e.g. SEEM), we can first extract the objects in the image and then build the POPE as the above method. If you want to buile POPE with SEEM , you should first organize the annotations in a json file with the following format:

[{"image": "COCO_val2014_000000131089.jpg"}, {"image": "COCO_val2014_000000393225.jpg"}]

Build POPE

Once you have the ground-truth objects prepared, you can build your own POPE by running:

python main.py

You can customize your POPE by specifying these configs:

--auto_seg: Whether to use automatic segmentation results. Default = False.
--img_path: The path to the json file that contains the path of images to be segmented.
--seg_path: The path to the segmentation path that containing ground-truth objects in the image.
--seg_num: The number of images to be segmented. Default = 1000.
--sample_num: The number of negative objects to be sample for each image. Default = 3.
--img_num: The number of images for building POPE. Default = 500.
--template: The prompt template. Default = "Is there a {} in the image?".
--dataset: The dataset name used for the filename of the built POPE. Default = "coco".
--save_path: The save path of the built POPE. Default = "./output/"

If you want to employ SEEM to segment images, you can build POPE by running:

python main.py --auto_seg True --img_path ./segmentation/coco_val_images.json --seg_num 1000

After the execution, you will find 5 json files under "./output/{dataset}/":

{dataset}_pope_random.json: The POPE built with random negative sampling strategy. Each question is a dict in the following format:

{"question_id": 1, "image": "COCO_val2014_000000016631.jpg", "text": "Is there a person in the image?", "label": "yes"}
{"question_id": 2, "image": "COCO_val2014_000000016631.jpg", "text": "Is there a refrigerator in the image?", "label": "no"}

{dataset}_pope_popular.json: The POPE built with popular negative sampling strategy.
{dataset}_pope_adversarial.json: The POPE built with adversarial negative sampling strategy.
{dataset}_ground_truth_objects.json: The appearance frequencies of all objects in the selected images.
{dataset}_co_occur.json: The co-occurrence frequencies of all objects in the selected images.

Evaluation

Now you can use built POPE to evaluate LVLMs and evaluate their object hallucination. The answer of LVLMs should be organized in a json file in the following format:

{"question": "is there a bird in the image?", "answer": "yes"}
{"question": "is there a tree in the image?", "answer": "no"}

Notice that the order of POPE questions and the answer should be the same.

Then you should specify the ans_file (i.e. the answer file of LVLMs) and label_file (i.e. the POPE file) in "evaluate.py" and evaluate the results by running:

python evaluate.py

The program will report the Accuracy, Precision, Recall, F1 Score and Yes ratio as metrics.

pope's People

Contributors

Stargazers

Watchers

Forkers

rucaibox yiheng003 ziangwu-77 ga674

pope's Issues

Question about the code of CHAIR

Hello, thanks for your contribution to the VLM evaluation! Besides the metric Pope, will you also release the code of the metric CHAIR you use in your paper? Thanks!

Reproduce Table 6: the json files of COCO, A-OKVQA and GQA

Hi, thanks for your valuable contributions!
I want to reproduce the result of table 6. However, I have encountered an issue when running the script to generate the POPE dataset. It appears that running the script results in different JSON files each time.
Would it be possible for you to share the final JSON files for the COCO, A-OKVQA, and GQA datasets after running the build script?
python main.py --auto_seg True --img_path ./segmentation/<coco><A-OKVQA><GQA>_val_images.json --seg_num 1000
Having access to these final JSON files would greatly assist me in my research.
Looking forward to your early reply! :)
Best Wish!

Word 'imange' in POPE questions

Hello,

Thank you for the nice work!

Just a quick question... I noticed there are 'imange' instead of 'image' in the annotation json files. Is it a typo or it is really 'imange' intentionally (since there are many of them). Sorry if I missed anything from the paper.

Thanks for the help!

Question about Figure 2.

First of all congrats your greate work!

My question is how to find out the Hallucination times of a model just like the Figure 2 shows. I assume the file segmentation/coco_ground_truth_segmentation.json is the ground truth to show what objects are actually shown in the corresponding image. Then base on the caption data you store on caption_data/Instruction1_instructblip.json, I assume you use the generate of the model to describe some images so that you could compare the objects that model predicts and the objects in ground truth.

But how to evaluate the prediction and the ground truth objects? The predicted objects that not included in ground truth object list should be just put in hallucinations? And how to handle the synonym objects for example, predicted objects: man and ground truth object: person. It shouldn't be hallucination, right?

Looking forward to your response :)

Add LICENSE

Hi, could you please consider adding a LICENSE file to the repo?

One question about the evaluation on A-OKVQA

The AOKVAQ dataset does not provide labels for the test set, are you using the validation set for the evaluation?

Hello, I also find the label error in practice.

          Hello, I also find the same error in practice.

May I ask how do you get this name list for each image?
I call coco API to check it and find it is not consistent with the picture of name list you shown above.

Hello, @Maxlinn

We build the POPE questions based on the object annotations of COCO. I've reviewed the annotations of the images you've referenced and confirmed that these objects are indeed present. Here are some examples:

Anyway, thanks for your detailed check and we'll take your feedback into consideration as we look to enhance our dataset in the future.

Originally posted by @zwbx in #8 (comment)

Nice Work！

Looking forward to the release of the code！

Generated Questions for A-OKVQA and GQA

Thanks for your great work! We are currently following your work on the object hallucinations and developing technique to mitigate this issue.
May I know if you can publish the generated questions for GQA and A-OKVQA with random, popular, and adversarial splits?

GQA 500 samples

Hi, thanks for your amazing work!
I would like to inquire if you can provide 500 samples randomly selected from GQA. This would be incredibly helpful for my evaluation, and I will definitely giving proper citation.

”there are 132,062 questions in the balanced validation set of GQA, we only randomly sample 500 among them for evaluation)“

I will be waiting for your answers :)
Best wishes!

Project Publication

Hi authors, congrats on the great work. May I know if you have any plan to publish this project at some conference venues?

InstructBLIP generation parameters

Hello, I want to ask what generation parameters and what language model or pretrained weight you guys use in producing results in the paper? In our practice, we found that InstructBLIP results are much higher than results in the paper.

For example, in popular and adversarial, our accuracy results are 84% and 81%, however in paper results are 71% and 62%

What is the instructBLIP prompt used in the paper?

Hi. I would like to know how you managed to let instructBLIP answer the yes or no question. In my experiment, it never follows the instructions. It generates only "1 <s>" "3 <s>" "0 <s>" etc

quite a few labeling errors in pope coco annotations

hello aoidragon and the team, thank for your work of pope!

recently i am using pope coco (random/adverserial/popular) to measure hallucination of my LVLM, however, i found quite a few labeling errors in the released data. i am using the newest commit. i sincerely hope you could look into this matter.

here are some cases:

random: {'question_id': 1271, 'image': 'COCO_val2014_000000244455.jpg', 'text': 'Is there a bicycle in the image?', 'label': 'yes'}

there seems no bicycle

random: {'question_id': 1761, 'image': 'COCO_val2014_000000346707.jpg', 'text': 'Is there a bowl in the image?', 'label': 'yes'}

there is no bowl

random: {'question_id': 1593, 'image': 'COCO_val2014_000000485485.jpg', 'text': 'Is there a car in the image?', 'label': 'yes'}

there seems no car

adverserial: {'question_id': 2375, 'image': 'COCO_val2014_000000325347.jpg', 'text': 'Is there a sports ball in the image?', 'label': 'yes'}

there is no ball

adverserial: {'question_id': 1051, 'image': 'COCO_val2014_000000018150.jpg', 'text': 'Is there a backpack in the image?', 'label': 'yes'}

there is no backpack

adverserial: {'question_id': 1327, 'image': 'COCO_val2014_000000287035.jpg', 'text': 'Is there a tv in the image?', 'label': 'yes'}

there is no tv

Could you please comment on which LLAVA model did you use in the paper? e.g., Table 3

Could you please comment on which LLAVA model did you use in the paper? Could you please share either name or link of the exact model as per https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md.
I am trying to reproduce the POPE results of llava but mine is quite different than what was reported in your paper.