AMBER is An LLM-free Multi-dimensional Benchmark) for MLLMs hallucination evaluation
AMBER has fine-grained annotations and an automated evaluation pipeline
- ๐ฅ [11.14] Our data and annotations are being finalized and coming soon!
- ๐ฅ [11.14] Our paper is available at https://arxiv.org/abs/2311.07397.
1. spacy is used for near-synonym judgment
pip install -U spacy
python -m spacy download en_core_web_lg
2. nltk is used for objects extraction
pip install nltk
Download the image.zip from the follow link
Coming soon!
Use the query.json to generate the response from MLLMs.
[
{"id": 1, "response": "This iamge ..."},
# For generative task, the "response" is the description of image
{"id": 2, "response": "No"},
# For discriminative task, the "response" is just "Yes" or "No".
...
]
python inference.py --inference_data path/to/your/inference/file