Giter VIP home page Giter VIP logo

shikiw / opera Goto Github PK

View Code? Open in Web Editor NEW
253.0 2.0 22.0 16.12 MB

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

License: MIT License

Python 91.58% Shell 0.10% Makefile 0.01% Dockerfile 0.05% Jsonnet 0.01% MDX 5.76% Cuda 0.43% C++ 0.04% Cython 0.01% C 0.01% Jupyter Notebook 2.02%
large-multimodal-models llama multimodal vision-language-learning vision-language-model chatbot chatgpt gpt-4

opera's Introduction

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation (CVPR 2024 Highlight)

License: MIT Arxiv Hugging Face Transformers GitHub Stars

This repository provides the official PyTorch implementation of the following paper:

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Qidong Huang1,2, Xiaoyi Dong2, Pan Zhang2, Bin Wang 2, Conghui He 2, Jiaqi Wang2, Dahua Lin2, Weiming Zhang1, Nenghai Yu1
1University of Science and Technology of China, 2Shanghai AI Laboratory

Overview

teaser

Hallucination, posed as a pervasive challenge of multimodal large language models (MLLMs), has significantly impeded their real-world usage that demands precise judgment. Existing methods mitigate this issue with either training with specific designed data or inferencing with external knowledge from other sources, incurring inevitable additional costs. In this paper, we present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate the hallucination issue without additional data, knowledge, or training. Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns manifested in the self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a few summary tokens, but not all the previous tokens. Such partial overtrust inclination results in the neglecting of image tokens and describes the image content with hallucination. Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue, along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens, and re-allocate the token selection if necessary. With extensive experiments, OPERA shows significant hallucination-mitigating performance on different MLLMs and metrics, proving its effectiveness and generality.

Setup

The main implementation of OPERA is in transformers-4.29.2/src/transformers/generation/utils.py.

So it is convenient to use OPERA decoding by just installing our modified transformers package.

conda env create -f environment.yml
conda activate opera
python -m pip install -e transformers-4.29.2

Note: to implement OPERA on other version of transformers, you can follow the steps as the follows:

  • Find the file at transformers-4.29.2/src/transformers/generation/utils.py.
  • Add the arguments in transformers.generate function here.
  • Add the code in transformers.generate function here.
  • Copy and paste the opera_decoding function here.

TL;DR

After setup the environment, you can directly use OPERA on your own MLLM model by:

# specify the location indexes of some input tokens
START_INDEX_of_IMAGE_TOKENS = <the location index of the first image token>
END_INDEX_of_IMAGE_TOKENS = <the location index of the last image token>
NUM_of_TOKENS_IN_THE_PROMPT = <the total number of tokens in the user prompt (including image tokens)>

key_position = {
  "image_start": START_INDEX_of_IMAGE_TOKENS, 
  "image_end": END_INDEX_of_IMAGE_TOKENS, 
  "response_start": NUM_of_TOKENS_IN_THE_PROMPT,
}

# add some arguments in the generate function
outputs = MLLM_model.generate(
    input_ids=input_ids,
    inputs_embeds=inputs_embeds,
    attention_mask=attention_mask,
    do_sample=False,
    num_beams=5,
    max_new_tokens=512,
    # opera
    opera_decoding=True,
    key_position=key_position,
    scale_factor=50,
    threshold=15,
    num_attn_candidates=5,
    penalty_weights=1,
)
# for a more efficient version, please use the setting below:
outputs = MLLM_model.generate(
    input_ids=input_ids,
    inputs_embeds=inputs_embeds,
    attention_mask=attention_mask,
    do_sample=False,
    num_beams=5,
    max_new_tokens=512,
    # opera
    opera_decoding=True,
    key_position=key_position,
    scale_factor=50,
    threshold=25,
    num_attn_candidates=1,
    penalty_weights=1,
)

Please refer to demo.ipynb here for more details.

Evaluation

The following evaluation requires for MSCOCO 2014 dataset. Please download here and extract it in your data path.

Besides, it needs you to prepare the following checkpoints of 7B base models:

Arguments

Argument Example Description
--model llava-1.5 Specify the MLLM model, this codebase supports instructblip, minigpt4, llava-1.5, shikra.
--data-path /path/to/dataset Path to the dataset file or folder, e.g., COCO_2014/val2014/.
--pope-type random Type for POPE evaluation, supports random, popular, adversarial.
--scale_factor 50 The scale factor to scale up the self-attention weights. Default: 50.
--threshold 15 The threshold for attending retrospection. Default: 15.
--num_attn_candidates 5 The number of candidates per beam. Default: 5.
--penalty_weights 1 The weight of penalty term in decoding. Default: 1.

POPE

python pope_eval.py --model MODEL_NAME --data_path /path/to/COCO --pope-type random --gpu-id GPU_IDs --beam 5 --scale_factor 50 --threshold 15 --num_attn_candidates 5 --penalty_weights 1

Result on Random split:

Model Accuracy Precision Recall F1 score Yes ratio
InstructBLIP 7B 90.3 93.8 87.0 90.3 47.8
MiniGPT-4 7B 79.8 89.7 68.7 77.8 39.5
LLaVA-1.5 7B 89.4 90.4 88.8 89.6 50.6

Result on Popular split:

Model Accuracy Precision Recall F1 score Yes ratio
InstructBLIP 7B 83.4 81.2 87.0 84.0 53.6
MiniGPT-4 7B 73.6 75.9 69.0 72.3 45.4
LLaVA-1.5 7B 86.0 84.1 88.8 86.4 52.8

Result on Adversarial split:

Model Accuracy Precision Recall F1 score Yes ratio
InstructBLIP 7B 80.7 77.3 87.0 81.9 56.3
MiniGPT-4 7B 71.6 72.9 68.9 70.8 47.3
LLaVA-1.5 7B 79.1 74.4 88.8 81.0 59.7

CHAIR

  • Generate the MLLM's responses and save them in a jsonl file:
python chair_eval.py --model MODEL_NAME --data_path /path/to/COCO --gpu-id GPU_IDs --beam 5 --scale_factor 50 --threshold 15 --num_attn_candidates 5 --penalty_weights 1

Note: Please check out our released results in log/chair_eval_results for reproduction.

  • Calculate CHAIR using the generated jsonl file:
python chair.py --cap_file /path/to/jsonl --image_id_key image_id --caption_key caption --coco_path /path/to/COCO/annotations_trainval2014/annotations/ --save_path /path/to/save/jsonl

GPT-4V

The GPT-4V evaluation requires you to specify your API key in Line 88 of gpt4v_eval.py.

python gpt4v_eval.py --model MODEL_NAME --data_path /path/to/COCO --gpu-id GPU_IDs --scale_factor 50 --threshold 15 --num_attn_candidates 5 --penalty_weights 1

Acknowledgement

This repo is based on the MLLM codebase of LAVIS and MiniGPT-4 and the CHAIR code of Maxlinn. Thanks for their impressive works!

Citation

If you find this work useful for your research, please cite our paper:

@inproceedings{huang2024opera,
  title={Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation},
  author={Huang, Qidong and Dong, Xiaoyi and Zhang, Pan and Wang, Bin and He, Conghui and Wang, Jiaqi and Lin, Dahua and Zhang, Weiming and Yu, Nenghai},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={13418--13427},
  year={2024}
}

opera's People

Contributors

shannany0606 avatar shikiw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

opera's Issues

Model inference speed is slow

Is it environmental problem, or is the algorithm itself slow? Are there any recommended parameter settings that balance speed and performance?

Discrepancy in Random Split Numbers for POPE (#2910 vs #3000)

Hi,

Thanks for the excellent work!

I noticed that the number of samples in the random split of POPE mentioned in your work is different from the official GitHub page of POPE (2910 vs 3000). Could you tell me if there were additional steps, such as data filtering, applied in this work?

Best,

Does the method to find `knowledge aggregation pattern` have any relevant papers to reference in NLP domain?

Hello, thank you for your great work!

I'm confused about the knowledge aggregation pattern and anchor token. Does it mean the number of paragraph which contains different meanings in LLM'response ?

As to the example below, the "knowledge aggregation pattern" is 2?
The image features a blue bowl filled with a delicious mixture of bananas, nuts, and oatmeal. The bowl is placed on a dining table, and a spoon is resting inside the bowl, ready to be used for enjoying the meal. In addition to the bowl of food, there are a few other items on the table. A bottle can be seen on the left side of the table, while a cup is positioned towards the top right corner. A book is also present on the right side of the table, adding to the cozy atmosphere of the scene.

Does the method to find "knowledge aggregation pattern" have any relevant papers to reference in NLP domain?

I'm wondering if this method can be transferred to judge Transition in Semantics in LLM's response.

Thank you for your time! And hope for your reply~

(My email is [email protected])

可视化问题

你好!我在使用vis.ipynb脚本来可视化我的attention层的时候,报了这个错TypeError: tuple indices must be integers or slices, not tuple,是为什么?请问怎么解决。我可视化的是llava1.5-7b模型,model.generate函数设置return_dict_in_generate=True时会报上面的错

关于复现POPE的结果问题

你好,我尝试使用LLava-v1.5-7b模型复现POPE的Beam-5和greedy,但与论文的值相差很大。
我的beam-5实现方式,按下面注释代码,beam=5;greedy的实现方式按下面注释代码,beam=1。
`out = model.generate(
{"image": image, "prompt":qu},
use_nucleus_sampling=args.sample,
num_beams=args.beam,
max_new_tokens=10,
output_attentions=True,

                # opera_decoding=True,
                # scale_factor=args.scale_factor,
                # threshold=args.threshold,
                # num_attn_candidates=args.num_attn_candidates,
                # penalty_weights=args.penalty_weights,
            )`

而且每次运行的值都是一样的,我尝试修改eval_configs/llava-1.5_eval.yaml中的seed,改成其他的值,结果还是一样。
请问我该如何复现beam和greedy的结果呢?

AttributeError: 'MiniGPT4' object has no attribute 'embed_tokens'

I try to reproduce your result using the mini_gpt4 backbone. It seems there's a bug with the /home/czr/contrast_decoding_LVLMs/sota_to_compare/OPERA/minigpt4/models/mini_gpt4.py file.

Here is the complete output:

Initializing Model Loading VIT Loading VIT Done Do not use Q-Former here. Loading LLAMA Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████| 2/2 [00:17<00:00, 8.52s/it] Loading LLAMA Done Load BLIP2-LLM Checkpoint: /home/czr/contrast_decoding_LVLMs/model_checkpoints/pretrained_minigpt4_llama2_7b.pth Compose( Resize(size=(224, 224), interpolation=bicubic, max_size=None, antialias=warn) ToTensor() Lambda() ) Done! 0%| | 0/50 [00:02<?, ?it/s] Traceback (most recent call last): File "/home/czr/contrast_decoding_LVLMs/sota_to_compare/OPERA/chair_eval.py", line 174, in <module> out = model.generate( File "/home/czr/anaconda3/envs/minigptv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/czr/contrast_decoding_LVLMs/sota_to_compare/OPERA/minigpt4/models/mini_gpt4.py", line 362, in generate inputs_embeds, attention_mask, img_start_pos = self.prompt_wrap(img_embeds, atts_img, instruction) File "/home/czr/contrast_decoding_LVLMs/sota_to_compare/OPERA/minigpt4/models/mini_gpt4.py", line 226, in prompt_wrap p_before_embed = self.embed_tokens(p_before_tokens.input_ids) File "/home/czr/anaconda3/envs/minigptv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'MiniGPT4' object has no attribute 'embed_tokens'

Could you please tell me what goes wrong so I could produce your reported result and cite your paper real quick. Appreciate it:)

GPU information

Hi, is there any detail information of the gpus used in the article?

Truncation of generated results

Thank you for your great work!
I tried to inference with my own model based on llava1.5 fine-tuning plus OPERA's method, but about 1/3 of the results of the inferencing appear to be truncated early.
I followed the method in demo.ipynb and changed Line 14 of eval_configs/llava-1.5_eval.yaml to the path of the fine-tuned model. With beam=5 and max_new_tokens=1024, but the inference results showed the following early truncation:

{
    "question_id": 26, 
    "image": "test_imgs/0318_14.jpeg", 
    "text": "Please analyze the relationship between these animals.", 
    "type": "Relation reasoning", 
    "caption": "In the image, a white bird is perched on the back of a gray elePHant. This unique situation suggests a certain level of tolerance and possibly even symbiotic relationship between the two animals. It is possible that the elephat and th"
}

How can I solve this problem?

Shikra Version

Which version of the Shikra merged 7B model are you using? Please...

reproducing the result

Hello, thank you for the great work!

I'm trying to reproduce the POPE / CHAIR results.

when I evaluate the POPE popular set on LLaVA-1.5, I get the result

Accuracy: 0.868
Precision: 0.8780821917808219
Recall: 0.8546666666666667
F1 score: 0.8662162162162163
Yes ratio: 0.4866666666666667

which is a bit higher than issued in README.md.
the result was the same when i set the random seed to 42(default), or not set.
is there any other possibility affecting the result?

I have also run the LLaVA-1.5 CHAIR evaluation:

CHAIRs    : 49.9
CHAIRi    : 14.1
Recall    : 78.3
Len       : 94.7

this is worse than any baseline from Table 1.
can you please share the 500 random COCO list to reproduce the paper result?

Does it work well on videos?

Nice work! Have you tried it on videos? Or does it still work well on video-text models, such as VideoChat, Video-LLaMA and Video-ChatGPT.

Issue about visualization

When I run the vis.ipynb, the following issue occurs:
attns = [attn.clone() for attn in out.attentions]
AttributeError: 'tuple' object has no attribute 'clone'

It seems that attn is a tuple, not the attention map itself.
The model I used is llava v1.5 7b and the transformer version is 4.29.2.

Can you help me fix this issue? Thank you very much!

Can not reproduce the results on LLaVA-1.5

Hi, I cannot reproduce the results reported in your paper by installing the modified transformers package according to your guidance. Can you share your complete code on LLaVA and InstructBLIP?

Questions about function prepare_inputs_labels_for_multimodal

Hi, when i was running the code, i found most of the time the code hit the situation in line 248 of function prepare_inputs_labels_for_multimodal(OPERA/minigpt4/models/llava_arch.py), does it mean there is only one token in the prompt?Why does it happen?

热力图相关

请教一下,使用vis画出来的热力图,有根竖状是代表这个token(x轴)对后续token生成的影响比较大吗

Random 500 samples in MSCOCO?

Hi, great and insightful work!

In your paper, when evaluating OPERA and other decoding methods on CHAIR, 500 samples are selected randomly from MSCOCO. I'm wondering if you could provide the index list of your sampled images to help reproduce the results.

Best,

CUDA error: out of memory

Thank you for your wonderful work!
I run the demo.ipynb on a A6000 with 48 GB memory, but it tells me that CUDA error: out of memory.

Is 48 GB not enough to load the model? The LVLM model I used is llava-1.5.

Thank you very much!

reproducing shikra problem

When reproducing results of shikra, the evaluation outputs are wrong, as following:
Done!
load data finished
Start eval...
0%| | 0/2910 [00:00<?, ?it/s]/root/autodl-tmp/OPERA/transformers-4.29.2/src/transformers/generation/utils.py:1262: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer
0%| | 1/2910 [00:06<5:10:01, 6.39s/it]
noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer
0%| | 2/2910 [00:10<4:06:04, 5.08s/it]
noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer
0%| | 3/2910 [00:14<3:45:29, 4.65s/it]
noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer
0%|▏ | 4/2910 [00:18<3:35:52, 4.46s/it]
noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer noreferrer
0%|▏ | 5/2910 [00:23<3:30:31, 4.35s/it]

运行vis.ipynb报错

你好,很感谢你开源了论文项目的代码。我在运行vis.ipynb时遇到了如下bug,不知道是什么原因?

““ValueError: Unrecognized configuration class <class 'minigpt4.models.llava_llama.LlavaConfig'> to build an AutoTokenizer.
Model type should be one of AlbertConfig, AlignConfig, BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BlipConfig, Blip2Config, BloomConfig, BridgeTowerConfig, CamembertConfig, CanineConfig, ChineseCLIPConfig, ClapConfig, CLIPConfig, CLIPSegConfig, CodeGenConfig, ConvBertConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DebertaConfig, DebertaV2Config, DistilBertConfig, DPRConfig, ElectraConfig, ErnieConfig, ErnieMConfig, EsmConfig, FlaubertConfig, FNetConfig, FSMTConfig, FunnelConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, GPTSanJapaneseConfig, GroupViTConfig, HubertConfig, IBertConfig, JukeboxConfig, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LiltConfig, LlamaConfig, LongformerConfig, LongT5Config, LukeConfig, LxmertConfig, M2M100Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MgpstrConfig, MobileBertConfig, MPNetConfig, MT5Config, MvpConfig, NezhaConfig, NllbMoeConfig, NystromformerConfig, OneFormerConfig, OpenAIGPTConfig, OPTConfig, OwlViTConfig, PegasusConfig, PegasusXConfig, PerceiverConfig, Pix2StructConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, RagConfig, RealmConfig, ReformerConfig, RemBertConfig, RetriBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2TextConfig, Speech2Text2Config, SpeechT5Config, SplinterConfig, SqueezeBertConfig, SwitchTransformersConfig, T5Config, TapasConfig, TransfoXLConfig, ViltConfig, VisualBertConfig, Wav2Vec2Config, Wav2Vec2ConformerConfig, WhisperConfig, XCLIPConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, YosoConfig.””

CHAIR hallucination evaluation

Hello! I encountered some problems when using code to reproduce the CHAIR metric in the paper.
When I set max new_tokens to 64, I obtained CHAIRs=19.4 and CHAIRi=6.4. This is somewhat different from the CHAIRs=14.2 and CHAIRi=5.2 in the paper. And when max_cew_tokens is set to 512, the results obtained are similar to those in the paper. In these two experiments, I only changed the max_new_tokens of the model.generate in the chair_eval. py file.(Change from 512 to 64).
Therefore, I would like to inquire about how to reproduce the result of max_new_tokens=64 in the paper. Do I need to change any other parameters or code?

What should key_position be on mPLUG-Owl2?

I'm trying to apply OPERA on mPLUG-Owl2, however, I'm stumbling on deciding which values should I associate to the "image_start", "image_end" and "response_start" keys. I tried to copy the setup codes from LLaVA-1.5 but it didn't work. Also I'm having a hard time understanding the why is "NUM_IMAGE_TOKENS = 576" for LLaVA-1.5.

Could you be so kind to please give me a hint on what these three values represent and how to set them for mPLUG-Owl2 ? Thank you!

CHAIR Reproduction Bugs

Thanks for your great work and open-sourcing your codes!

I am working on reproducing CHAIR results for OPERA (which are presented in Table 2 and 3). There is a SizeMismatch error when running python chair_eval.py over single 4090 GPU. May I ask if there is something wrong with my evaluation scripts? For reference, both evaluation scripts and reported errors are presented as follows (have tried both max_new_tokens=512 and max_new_tokens=64 and encounter the same issue).

evaluation script

python chair_eval.py \
    --model llava-1.5 \
    --data_path ./data/coco/val2014/ \
    --gpu-id 1 --beam 5 --scale_factor 50 \
    --threshold 15 --num_attn_candidates 5 \
    --penalty_weights 1

reported error (happens after a few steps)

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮                                                                                                    
│ /home/xingy/OPERA/chair_eval.py:172 in <module>                                                  │                                                                                                    
│                                                                                                  │                                                                                                    
│   169 │                                                                                          │                                                                                                    
│   170with torch.inference_mode():                                                           │                                                                                                    
│   171 │   │   with torch.no_grad():                                                              │                                                                                                    
│ ❱ 172 │   │   │   out = model.generate(                                                          │                                                                                                    
│   173 │   │   │   │   {"image": norm(image), "prompt":qu},                                       │                                                                                                    
│   174 │   │   │   │   use_nucleus_sampling=args.sample,                                          │                                                                                                    
│   175 │   │   │   │   num_beams=args.beam,                                                       │                                                                                                    
│                                                                                                  │                                                                                                    
│ /home/xingy/anaconda3/envs/opera/lib/python3.9/site-packages/torch/utils/_contextlib.py:115 in   │                                                                                                    
│ decorate_context                                                                                 │                                                                                                    
│                                                                                                  │                                                                                                    
│   112 │   @functools.wraps(func)                                                                 │                                                                                                    
│   113def decorate_context(*args, **kwargs):                                                 │                                                                                                    
│   114 │   │   with ctx_factory():                                                                │                                                                                                    
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │                                                                                                    
│   116 │                                                                                          │                                                                                                    
│   117return decorate_context                                                                │                                                                                                    
│   118                                                                                            │                                                                                                    
│                                                                                                  │                                                                                                    
│ /home/xingy/OPERA/minigpt4/models/llava.py:211 in generate                                       │                                                                                                    
│                                                                                                  │                                                                                                    
│   208 │   │   │   │   │   "response_start": input_ids.shape[1]+NUM_IMAGE_TOKENS-1,               │
│   209 │   │   │   │   }                                                                          │                                                                                                    
│   210 │   │   │                                                                                  │                                                                                                    
│ ❱ 211 │   │   │   output_ids = self.llama_model.generate(                                        │
│   212 │   │   │   │   input_ids=input_ids,                                                       │                                                                                                    
│   213 │   │   │   │   use_cache=True,                                                            │                                                                                                    
│   214 │   │   │   │   do_sample=use_nucleus_sampling,                                            │
│                                                                                                  │                                                                                                    
│ /home/xingy/anaconda3/envs/opera/lib/python3.9/site-packages/torch/utils/_contextlib.py:115 in   │
│ decorate_context                                                                                 │                                                                                                    
│                                                                                                  │                                                                                                    
│   112 │   @functools.wraps(func)                                                                 │                                                                                                    
│   113def decorate_context(*args, **kwargs):                                                 │                                                                                                    
│   114 │   │   with ctx_factory():                                                                │                                                                                                    
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │                                                                                                    
│   116 │                                                                                          │                                                                                                    
│   117return decorate_context                                                                │         
│   118                                                                                            │                                                                                                    
│                                                                                                  │                                                                                                    
│ /home/xingy/OPERA/transformers-4.29.2/src/transformers/generation/utils.py:1649 in generate      │
│                                                                                                  │                                                                                                    
│   1646 │   │   │   │   **model_kwargs,                                                           │                                                                                                    
│   1647 │   │   │   )                                                                             │                                                                                                    
│   1648 │   │   │   # 13. run opera beam search                                                   │                                                                                                    
│ ❱ 1649 │   │   │   return self.opera_beam_search(                                                │
│   1650 │   │   │   │   input_ids,                                                                │                                                                                                    
│   1651 │   │   │   │   beam_scorer,                                                              │                                                                                                    
│   1652 │   │   │   │   logits_processor=logits_processor,                                        │
│                                                                                                  │                                                                                                    
│ /home/xingy/OPERA/transformers-4.29.2/src/transformers/generation/utils.py:3353 in               │
│ opera_beam_search                                                                                │                                                                                                    
│                                                                                                  │                                                                                                    
│   3350 │   │   │   else:                                                                         │                                                                                                    
│   3351 │   │   │   │   assert beam_idx is not None and attn_previous is not None                 │
│   3352 │   │   │   │   attn_previous = torch.cat([attn_previous, torch.zeros_like(attn_previous  │
│ ❱ 3353 │   │   │   │   attn_previous = torch.cat(                                                │
│   3354 │   │   │   │   │   [attn_previous[beam_idx], outputs.attentions[-1].clone().max(1, keep  │
│   3355 │   │   │                                                                                 │                                                                                                    
│   3356 │   │   │   attn_previous = attn_previous.max(1, keepdim=True).values.data # [batch_size  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 645 but got size 667 for tensor number 1 in the list.                                                                                           

opera_greedy_search implemented ?

I am just opening up this issue to make sure opera_greedy_search is not being implemented yet, though it has been called in your utils.py function? Thanks.

关于是否是幻觉句子的问题

请问模型未生成幻觉内容的情况下,也有明显的“柱状”特征的token,您做了那些处理呢,来避免对这种情况进行处理。

Reproducing MiniGPT-4's POPE result

Hi, authors! Excellent work! I'm curious, how can I reproduce MiniGPT-4's POPE result? I have executed the provided script but the results seems to be inconsistent with the results reported in the Table 4.

Questions about Figure 3 in paper

We are interested in Figure 3 in the paper.
Because it seems to be the starting point of the paper.
We would like to know how you define knowledge aggregation patterns when collecting data from Figure 3.
Is it through the naked eye?
And how was the "within 10 tokens" calculated?

Looking forward to your reply, thank you!

Over-Trust Logit Penalty

Thank you for this amazing paper! I have initially checked your code file, but I did not find the code content of the specific implementation of Over-Trust Logit Penalty. Could you please provide the location of the relevant code? Thank you very much for your help.

Issue about the visual case provided in the paper

Thank you for your remarkable work. I utilized the visualization code you provided to visualize the attention maps from the images in your log folder, but I did not find the case presented in the paper. Could you please provide the file ID for the visualization case mentioned in the paper, if it's convenient for you?

Random 500 samples in MSCOCO

Hi, great and insightful work!
In your paper, when evaluating OPERA and other decoding methods on CHAIR, 500 samples are selected randomly from MSCOCO. I'm wondering if you could provide the index list of your sampled images to help reproduce the results.
Best,

my email is [email protected], thanks!

Attention map plotting

Dear authors,

Thank you for this wonderful paper! I reproduced your Figure 2 (attention map from InstructBLIP) and got the following result. I did not notice that an outstanding pattern that highlighted by the article (in red box). Compared with the word "Additionally", it seems like the word "that" has much more impacts. Do you have any ideas where might went wrong? Thank you for your help.

rejected_attention_map

Questions about the IM_START and IM_END tokens

Thanks authors for the great work! I have a question regarding the

"image_start": START_INDEX_of_IMAGE_TOKENS, 
"image_end": END_INDEX_of_IMAGE_TOKENS, 

As these two tokens have been deprecated since the 1.3 release for LLaVA, could you please provide some instructions on how to specify these in the settings? Looking forward to the reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.