amazon-science / mm-cot Goto Github PK

Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)

Home Page: https://arxiv.org/abs/2302.00923

License: Apache License 2.0

Python 99.69% Shell 0.31%

mm-cot's Introduction

Multimodal Chain-of-Thought Reasoning in Language Models

"Imagine learning a textbook without figures or tables."

Multimodal-CoT incorporates vision features in a decoupled training framework. The framework consists of two training stages: (i) rationale generation and (ii) answer inference. Both stages share the same model architecture but differ in the input and output.

Requirements

Install all required python dependencies:

pip install -r requirements.txt

Datasets

Download the dataset from the following repository:

https://github.com/lupantech/ScienceQA/tree/main/data

The vision features (detr, resnet, clip, vit) are available at https://huggingface.co/cooelf/vision_features/tree/main

Alternatively, you may download the extracted vision features (detr, resnet, clip) from vision_features and unzip the files under vision_features

Extract Features (optional)

The processed vision features for ScienceQA are available at https://huggingface.co/cooelf/vision_features/tree/main.

The following instructions show how we obtain those features.

Download the image files from Google Drive and unzip all the images (train, dev, test) in the same folder (). The structure should be:

images
├── 1
│   └── image.png
├── 2
│   └── image.png
├── 3
│   └── image.png
├── 5
│   └── image.png
├── 7
│   └── image.png

Run extract_features.py --data_root images --output_dir vision_features --img_type vit

If you hope to use your own images, please structure those images in the way above, or modify the script extract_features.py.

Extract Captions (optional)

The processed captions for ScienceQA are available at data/instruct_captions.json.

The following instructions show how we obtain those features.

Intall lavis and prepare Vicuna weights to use InstructBLIP for caption extraction.

https://github.com/salesforce/LAVIS/tree/f982acc73288408bceda2d35471a8fcf55aa04ca/projects/instructblip

Assume that the images are stored in the images folder.

python extract_caption.py

Instructions

Training

# rationale generation
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
    --data_root data/ScienceQA/data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg rationale --img_type vit \
    --bs 2 --eval_bs 4 --epoch 50 --lr 5e-5 --output_len 512 \
    --use_caption --use_generate --prompt_format QCM-E \
    --output_dir experiments

# answer inference
CUDA_VISIBLE_DEVICES=0,1,2,3 python main_central.py \
    --data_root data/ScienceQA/data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg answer --img_type vit \
    --bs 4 --eval_bs 8 --epoch 50 --lr 5e-5 --output_len 64 \
    --use_caption --use_generate --prompt_format QCMG-A \
    --output_dir experiments \
    --eval_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_eval.json \
    --test_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_test.json

Inference

Our trained models are available at https://huggingface.co/cooelf/mm-cot/tree/main. To use our trained models, please put the them under the models folder.

# rationale generation
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
    --data_root data/ScienceQA/data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg rationale --img_type vit \
    --bs 2 --eval_bs 4  --epoch 50 --lr 5e-5 --output_len 512 \
    --use_caption --use_generate --prompt_format QCM-E \
    --output_dir experiments
    --evaluate_dir models/mm-cot-large-rationale

# answer inference
CUDA_VISIBLE_DEVICES=0,1,2,3 python main_central.py \
    --data_root data/ScienceQA/data \
    --caption_file data/instruct_captions.json \
    --model declare-lab/flan-alpaca-large \
    --user_msg answer --img_type vit \
    --bs 4 --eval_bs 8 --epoch 50 --lr 5e-5 --output_len 64  \
    --use_caption --use_generate --prompt_format QCMG-A \
    --output_dir experiments \
    --eval_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_eval.json \
    --test_le experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_test.json \
    --evaluate_dir models/mm-cot-large-answer

Citing MM-CoT

@article{zhang2023multicot,
  title={Multimodal Chain-of-Thought Reasoning in Language Models},
  author={Zhang, Zhuosheng and Zhang, Aston and Li, Mu and Zhao, Hai and Karypis, George and Smola, Alex},
  journal={arXiv preprint arXiv:2302.00923},
  year={2023}
}

License

This project is licensed under the Apache-2.0 License.

Acknowledgement

Part of our codes are adapted from ScienceQA, Transformers, pytorch-image-models.

We thank Pan Lu for providing parameter size for ScienceQA baselines.

mm-cot's People

Contributors

Stargazers

Watchers

Forkers

oe-heart cooelf 8mikehawk joskid standardgalactic techthiyanes lninjo anmol-m-0 c00renut rakataprime maekawatoshiki limits-to-arbitrage jquave codeaudit arcivanov varunsingh88 awangenh eiro10 dino1729 one-shot-finish wmlba cclauss rodneyramsey higuseonhye dumpmemory rajaramkuberan ai-robotics-research karbazhyev cemberk theonetrueguy khuongnd ewouth ricklentz yoonseokheo peponpylon say383 dwoloszin csuestc harsh223 hadryan tchigher brucepro shexinyi xiasiyu dadoncic gitbenxing zlh1992 tpaviot oltopbaconttrat hufeihu techventurebuilder monup165 cyberax64 vpmohanty toitek grexzen ukaserge onsare borov666 ra312 gianfrancodemarco pruthwik aravindskumar98 ixuzhi xy21 imtial duanexiao yiyinianhua marscod biddwan09 elmehdielboustani angelsantamaria swapnil2597 warmchang xianwei-chris qddse kaushiknitin venetisgr touristshaun fancyfoot utopiazh ishine digitalarsenal soon14 elreynol artart788 babyblue26 syshensyshen maddigit abhinavm24 antorkhan rabdelaal datastudysquad 1-off samonh aliurden fundou nastyavalueva90 jaedukseo shelan

mm-cot's Issues

#The datasets of vision_features I can't fetch

hi~thank for your significant work!
I'd like to repeat your experiment., but I meet the problem of getting vision_features datasets.
I would be honored if you provide any available link or suggests for this problem.

Why Multimodal Chain-of-Thought is stil significantly better than UnifiedQA when there is no visual input?

Dear authors,

Thanks for your exciting and solid work.

May I ask why Multimodal Chain-of-Thought is still significantly better than UnifiedQA when there is no visual input (e.g, the text context category and the no context category of ScienceQA)? I understand that one potential reason is your decoupled framework. But even without the decoupled framework (Table 5), Multimodal-CoT outperforms UnifiedQA (Table 4) by a large margin when both are evaluated on the no context category.

Besides, the "w/o Vision Features" setup in Table 5 sees a drastic decrease in the no context category. Does it mean questions without visual input also benefits from the model trained together with visual information?

The performance of Multimodal-CoT w/o two-stage on SOC

Dear authors,

Thanks for your amazing work.

I wonder why the Multimodal-CoT w/o two-stage outperforms the Multimodel on the SOC in Table 5? This is indeed a huge gap and it seems odd.

Out of memory during eval but not train?

Description:
During the execution of the code in the evaluate phase, the computer's memory(no cuda memory) keeps increasing, and the program is eventually killed.
Server Base Configuration :
GPU : V100S * 2
RAM : 256GB
May I ask if you have modified the source files in the HuggingFace Transformers? What configurations are needed to implement the code?

There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

when using trained model, 8% got this error

[1] 75076 killed python main.py --model allenai/unifiedqa-t5-base --user_msg rationale detr
resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`image_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

We are trying to train the model but getting following error, please help in resolution.

'E'], epoch=50, lr=5e-05, bs=2, input_len=512, output_len=512, eval_bs=4, eval_acc=None, train_split='train', val_split='val', test_split='test', use_generate=True, final_eval=False, user_msg='rationale', img_type='vit', eval_le=None, test_le=None, evaluate_dir=None, caption_file='data/instruct_captions.json', use_caption=True, prompt_format='QCM-E', seed=42)

Downloading tokenizer_config.json: 0%| | 0.00/2.50k [00:00<?, ?B/s]
Downloading tokenizer_config.json: 100%|##########| 2.50k/2.50k [00:00<00:00, 332kB/s]

Downloading (…)cial_tokens_map.json: 0%| | 0.00/2.20k [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|##########| 2.20k/2.20k [00:00<00:00, 441kB/s]

Downloading config.json: 0%| | 0.00/1.53k [00:00<?, ?B/s]
Downloading config.json: 100%|##########| 1.53k/1.53k [00:00<00:00, 382kB/s]

Downloading model.safetensors: 0%| | 0.00/990M [00:00<?, ?B/s]
...
Downloading model.safetensors: 100%|##########| 990M/990M [11:47<00:00, 1.40MB/s]
Some weights of T5ForMultimodalGeneration were not initialized from the model checkpoint at declare-lab/flan-alpaca-base and are newly initialized: ['encoder.image_dense.weight', 'encoder.mha_layer.in_proj_weight', 'encoder.mha_layer.out_proj.weight', 'encoder.image_dense.bias', 'encoder.gate_dense.weight', 'encoder.gate_dense.bias', 'encoder.mha_layer.out_proj.bias', 'encoder.mha_layer.in_proj_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Downloading generation_config.json: 0%| | 0.00/142 [00:00<?, ?B/s]
Downloading generation_config.json: 100%|##########| 142/142 [00:00<00:00, 15.8kB/s]

0%| | 0/318150 [00:00<?, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
Traceback (most recent call last):
File "C:\Users\pakale\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 748, in convert_to_tensors
tensor = as_tensor(value)
File "C:\Users\pakale\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 720, in as_tensor
return torch.tensor(value)
ValueError: expected sequence of length 577 at dim 1 (got 145)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "...\mm-cot-scienceqa\main.py", line 380, in
T5Trainer(
File "...\mm-cot-scienceqa\main.py", line 269, in T5Trainer
trainer.train()
File "C:\Users\pakale\Anaconda3\lib\site-packages\transformers\trainer.py", line 1591, in train
return inner_training_loop(
File "C:\Users\pakale\Anaconda3\lib\site-packages\transformers\trainer.py", line 1870, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "C:\Users\pakale\Anaconda3\lib\site-packages\accelerate\data_loader.py", line 448, in iter
current_batch = next(dataloader_iter)
File "C:\Users\pakale\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 633, in next
data = self._next_data()
File "C:\Users\pakale\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "C:\Users\pakale\Anaconda3\lib\site-packages\torch\utils\data_utils\fetch.py", line 54, in fetch
return self.collate_fn(data)
File "C:\Users\pakale\Anaconda3\lib\site-packages\transformers\trainer_utils.py", line 737, in call
return self.data_collator(features)
File "C:\Users\pakale\Anaconda3\lib\site-packages\transformers\data\data_collator.py", line 586, in call
features = self.tokenizer.pad(
File "C:\Users\pakale\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 3303, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "C:\Users\pakale\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 223, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "C:\Users\pakale\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 764, in convert_to_tensors
raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (image_ids in this case) have excessive nesting (inputs type list where type int is expected).

0%| | 0/318150 [00:01<?, ?it/s]

====Input Arguments====
{
"data_root": "data",
"output_dir": "experiments",
"model": "declare-lab/flan-alpaca-base",
"options": [
"A",
"B",
"C",
"D",
"E"
],
"epoch": 50,
"lr": 5e-05,
"bs": 2,
"input_len": 512,
"output_len": 512,
"eval_bs": 4,
"eval_acc": null,
"train_split": "train",
"val_split": "val",
"test_split": "test",
"use_generate": true,
"final_eval": false,
"user_msg": "rationale",
"img_type": "vit",
"eval_le": null,
"test_le": null,
"evaluate_dir": null,
"caption_file": "data/instruct_captions.json",
"use_caption": true,
"prompt_format": "QCM-E",
"seed": 42
}
img_features size: (11208, 577, 768)
number of train problems: 12726

number of val problems: 4241

number of test problems: 4241

[14:38:56] [Model]: Loading declare-lab/flan-alpaca-base... main.py:66

       [Data]: Reading data...                                   main.py:67

experiments/rationale_declare-lab-flan-alpaca-base_vit_QCM-E_lr5e-05_bs0_op512_ep50
model parameters: 251907840

Try to run it on windows, finally failed...

I'm excited to hear that the text-to-text model is lighter weight than GPT-3, so I tried running it on my PC.

If you're trying to do the same thing as me, the following might help you.

TypeError: linear(): argument 'input' (position 1) must be Tensor, not NoneType

Hi, I tried the colab notebook but I am getting this below error:
TypeError: linear(): argument 'input' (position 1) must be Tensor, not NoneType

Below is the full error:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_8332\3201768157.py in
----> 1 outputs = model.generate(input_ids, max_length=512) # reads the vision feature if file detacted
2 show_result(outputs)
3 #outputs

~\anaconda3\lib\site-packages\torch\autograd\grad_mode.py in decorate_context(*args, **kwargs)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
28 return cast(F, decorate_context)
29

~\anaconda3\lib\site-packages\transformers\generation\utils.py in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
1389
1390 # 11. run greedy search
-> 1391 return self.greedy_search(
1392 input_ids,
1393 logits_processor=logits_processor,

~\anaconda3\lib\site-packages\transformers\generation\utils.py in greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
2177
2178 # forward pass to get next token
-> 2179 outputs = self(
2180 **model_inputs,
2181 return_dict=True,

~\anaconda3\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []

~\Desktop\My Projects\mm-cot\model.py in forward(self, input_ids, image_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
116 hidden_states = encoder_outputs[0]
117
--> 118 image_embedding = self.image_dense(image_ids)
119 image_att, _ = self.mha_layer(hidden_states, image_embedding, image_embedding)
120

~\anaconda3\lib\site-packages\torch\nn\modules\linear.py in forward(self, input)
112
113 def forward(self, input: Tensor) -> Tensor:
--> 114 return F.linear(input, self.weight, self.bias)
115
116 def extra_repr(self) -> str:

TypeError: linear(): argument 'input' (position 1) must be Tensor, not NoneType`

Is the issue with Vision features? Can anyone help me debug this?

Implementation Mm-cot

Great work from yourself and your team. Quick question,we are thinking about using that method with a largeur Falcon model. do you think we Can have therefore a greater gap of performance with gpt 3.5?the idea being if a 1b model Can Do that, what Can be with a 40b model.

typo in utils.prompt line 104 and 106

Original code:
elif output_format == 'AL':
output = f"Answer: The answer is {answer}. BECAUSE: {solution}"
elif output_format == 'AE':
output = f"Answer: The answer is {answer}. BECAUSE: {lecture}"

Shall be modified to:
elif output_format == 'AL':
output = f"Answer: The answer is {answer}. BECAUSE: {lecture}"
elif output_format == 'AE':
output = f"Answer: The answer is {answer}. BECAUSE: {solution}"

If L means {lecture} and E means {solution}

Conflict in dependencies required

Installing on Google Collab I got

INFO: pip is looking at multiple versions of huggingface-hub to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r mm-cot/requirements.txt (line 6) and huggingface-hub==0.0.12 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested huggingface-hub==0.0.12
    sentence-transformers 2.2.2 depends on huggingface-hub>=0.4.0

Use Caption?

Hi expert I saw you are not using caption, would this help?

Request for Release of Multimodal-CoT Large 738M Model

I've recently come across your paper detailing the impressive capabilities of the Multimodal-CoT Large 738M model, particularly its performance across various metrics (95.91, 82.00, 90.82, 95.26, 88.80, 92.89, 92.44, 90.31, and 91.68).

I am writing to inquire about the possibility of its public release because we have noted that the GitHub version, which shows a performance score of 90.45, differs from the one reported in your paper (91.68 performance score). Access to this model could significantly aid in ongoing research and development efforts in our field.

Thank you for your time and your contributions to the field. I look forward to your response and the opportunity to work with this innovative model.

ImportError: cannot import name 'Conv2dSame' from 'timm.models.layers' (unknown location)

When I run the script extract_features.py, I encountered the following error:
ImportError: cannot import name 'Conv2dSame' from 'timm.models.layers' (unknown location)
My Python version is 3.9.12. Do I need to use Python 3.8 or 3.7 instead?"

Can not repro the result

Hi expert: I followed your guidance and rerun the code, but the result turns out to be like this: (which is smaller than the score reported in paper with code)
{ "eval_accuracy": 0.8490921952369724, "eval_loss": 0.032915953546762466, "eval_runtime": 567.9533, "eval_samples_per_second": 7.467, "eval_steps_per_second": 0.468 }
"acc_natural": "87.52", "acc_social": "77.17", "acc_language": "85.82", "acc_has_text": "87.88", "acc_has_image": "82.90", "acc_no_context": "86.83", "acc_grade_1_6": "84.65", "acc_grade_7_12": "85.37", "acc_average": "84.91"

Did I make some wrong?

Question about two stages training?

Hi,I wonder the second stage fintuning is based on finetuned first stage T5 model or initial T5 model?

A Question

For Multimodal Chain-of-Thought Reasoning in Language Models. Do text and numbers belong to different modal data?

Vision feature of questions that contains more than one image

Hi all, thanks for the awesome work from authors.

We found that some samples in the datasets contain one image of question and several images of the corresponding choices. But in the paper, it was not provide details about how to process visual features in this case. We have discussed with other researchers, and we guess that only one image will be used to generate vision features. Is that right?

And according to the official website, ScienceQA contains 10332 samples that have an image in the question. But the data length in detr.npy is 11208. Are the rest part generated from images in choices?

experiments/rationale_allenai-unifiedqa-t5-base_detr_QCM-LE_lr5e-05_bs16_op512_ep20/predictions_ans_eval.json

when i run : run_training.sh

get ：

$ bash run_training.sh
args Namespace(data_root='data', output_dir='experiments', model='allenai/unifiedqa-t5-base', options=['A', 'B', 'C', 'D', 'E'], epoch=20, lr=5e-05, bs=8, input_len=512, output_len=512, eval_bs=4, eval_acc=10, train_split='train', val_split='val', test_split='test', use_generate=False, final_eval=True, user_msg='rationale', img_type='detr', eval_le=None, test_le=None, evaluate_dir=None, caption_file='data/captions.json', use_caption=False, prompt_format='QCM-LE', seed=42)
====Input Arguments====
{
"data_root": "data",
"output_dir": "experiments",
"model": "allenai/unifiedqa-t5-base",
"options": [
"A",
"B",
"C",
"D",
"E"
],
"epoch": 20,
"lr": 5e-05,
"bs": 8,
"input_len": 512,
"output_len": 512,
"eval_bs": 4,
"eval_acc": 10,
"train_split": "train",
"val_split": "val",
"test_split": "test",
"use_generate": false,
"final_eval": true,
"user_msg": "rationale",
"img_type": "detr",
"eval_le": null,
"test_le": null,
"evaluate_dir": null,
"caption_file": "data/captions.json",
"use_caption": false,
"prompt_format": "QCM-LE",
"seed": 42
}
img_features size: (11208, 100, 256)
number of train problems: 12726

number of val problems: 4241

number of test problems: 4241

[16:21:38] [Model]: Loading allenai/unifiedqa-t5-base... main.py:68
[Data]: Reading data... main.py:69
Some weights of T5ForMultimodalGeneration were not initialized from the model checkpoint at allenai/unifiedqa-t5-base and are newly initialized: ['gate_dense.bias', 'mha_layer.in_proj_bias', 'mha_layer.in_proj_weight', 'mha_layer.out_proj.weight', 'mha_layer.out_proj.bias', 'gate_dense.weight', 'image_dense.weight', 'image_dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
model parameters: 226643712
***** Running training *****
Num examples = 12726
Num Epochs = 20
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 15920
0%| | 0/15920 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/Workspace/sxk/2023/mm-cot-main/main.py", line 380, in
T5Trainer(
File "/home/Workspace/sxk/2023/mm-cot-main/main.py", line 269, in T5Trainer
trainer.train()
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/transformers/trainer.py", line 1498, in train
return inner_training_loop(
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/transformers/trainer.py", line 1740, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/transformers/trainer.py", line 2470, in training_step
loss = self.compute_loss(model, inputs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/transformers/trainer.py", line 2502, in compute_loss
outputs = model(**inputs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Workspace/sxk/2023/mm-cot-main/model.py", line 98, in forward
encoder_outputs = self.encoder(
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1035, in forward
layer_outputs = layer_module(
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 666, in forward
self_attention_outputs = self.layer[0](
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 572, in forward
attention_output = self.SelfAttention(
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 498, in forward
query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, dim_per_head)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/anaconda3/envs/s20230223e310mmcot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

0%| | 0/15920 [00:03<?, ?it/s]args Namespace(data_root='data', output_dir='experiments', model='allenai/unifiedqa-t5-base', options=['A', 'B', 'C', 'D', 'E'], epoch=20, lr=5e-05, bs=8, input_len=512, output_len=64, eval_bs=4, eval_acc=10, train_split='train', val_split='val', test_split='test', use_generate=False, final_eval=True, user_msg='answer', img_type='detr', eval_le='experiments/rationale_allenai-unifiedqa-t5-base_detr_QCM-LE_lr5e-05_bs16_op512_ep20/predictions_ans_eval.json', test_le='experiments/rationale_allenai-unifiedqa-t5-base_detr_QCM-LE_lr5e-05_bs16_op512_ep20/predictions_ans_test.json', evaluate_dir=None, caption_file='data/captions.json', use_caption=False, prompt_format='QCMG-A', seed=42)
====Input Arguments====
{
"data_root": "data",
"output_dir": "experiments",
"model": "allenai/unifiedqa-t5-base",
"options": [
"A",
"B",
"C",
"D",
"E"
],
"epoch": 20,
"lr": 5e-05,
"bs": 8,
"input_len": 512,
"output_len": 64,
"eval_bs": 4,
"eval_acc": 10,
"train_split": "train",
"val_split": "val",
"test_split": "test",
"use_generate": false,
"final_eval": true,
"user_msg": "answer",
"img_type": "detr",
"eval_le": "experiments/rationale_allenai-unifiedqa-t5-base_detr_QCM-LE_lr5e-05_bs16_op512_ep20/predictions_ans_eval.json",
"test_le": "experiments/rationale_allenai-unifiedqa-t5-base_detr_QCM-LE_lr5e-05_bs16_op512_ep20/predictions_ans_test.json",
"evaluate_dir": null,
"caption_file": "data/captions.json",
"use_caption": false,
"prompt_format": "QCMG-A",
"seed": 42
}
img_features size: (11208, 100, 256)
number of train problems: 12726

number of val problems: 4241

number of test problems: 4241

[16:22:05] [Model]: Loading allenai/unifiedqa-t5-base... main.py:68
[Data]: Reading data... main.py:69
Some weights of T5ForMultimodalGeneration were not initialized from the model checkpoint at allenai/unifiedqa-t5-base and are newly initialized: ['gate_dense.bias', 'mha_layer.out_proj.bias', 'gate_dense.weight', 'image_dense.weight', 'mha_layer.in_proj_bias', 'image_dense.bias', 'mha_layer.in_proj_weight', 'mha_layer.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
File "/home/Workspace/sxk/2023/mm-cot-main/main.py", line 380, in
T5Trainer(
File "/home/Workspace/sxk/2023/mm-cot-main/main.py", line 101, in T5Trainer
eval_set = ScienceQADatasetImg(
File "/home/Workspace/sxk/2023/mm-cot-main/utils_data.py", line 165, in init
test_le_data =json.load(open(test_le))["preds"]
FileNotFoundError: [Errno 2] No such file or directory: 'experiments/rationale_allenai-unifiedqa-t5-base_detr_QCM-LE_lr5e-05_bs16_op512_ep20/predictions_ans_eval.json'

Environment
Linux version 3.10.0-693.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) hpcaitech/ColossalAI#1 SMP Tue Aug 22 21:09:27 UTC 2017

python=3.10.9

conda 4.14.0

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

torch.cuda.OutOfMemoryError: CUDA out of memory.

GPU Info

$ nvidia-smi
Thu Feb 23 06:54:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

command to run

CUDA_VISIBLE_DEVICES=0 python main.py \
    --model allenai/unifiedqa-t5-base \
    --user_msg rationale --img_type detr \
    --bs 8 --eval_bs 4 --eval_acc 10 --output_len 512 \
    --final_eval --prompt_format QCM-LE

error message

[06:54:23] [Model]: Loading allenai/unifiedqa-t5-base...                                                                                                                                                                                                            main.py:68

           [Data]: Reading data...                                                                                                                                                                                                                                  main.py:69

Some weights of T5ForMultimodalGeneration were not initialized from the model checkpoint at allenai/unifiedqa-t5-base and are newly initialized: ['mha_layer.out_proj.weight', 'image_dense.weight', 'mha_layer.in_proj_bias', 'image_dense.bias', 'mha_layer.in_proj_weight', 'gate_dense.bias', 'mha_layer.out_proj.bias', 'gate_dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
model parameters:  226643712
***** Running training *****
  Num examples = 12726
  Num Epochs = 20
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 31820
  0%|                                                                                                                                                                                                                                               | 0/31820 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/test/deploy/mm-cot/main.py", line 380, in <module>
    T5Trainer(
  File "/home/test/deploy/mm-cot/main.py", line 269, in T5Trainer
    trainer.train()
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/trainer.py", line 1498, in train
    return inner_training_loop(
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/trainer.py", line 1740, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/trainer.py", line 2470, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/trainer.py", line 2502, in compute_loss
    outputs = model(**inputs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/deploy/mm-cot/model.py", line 144, in forward
    decoder_outputs = self.decoder(
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1035, in forward
    layer_outputs = layer_module(
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 692, in forward
    cross_attention_outputs = self.layer[1](
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 606, in forward
    attention_output = self.EncDecAttention(
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 535, in forward
    attn_weights = nn.functional.dropout(
  File "/home/test/deploy/mm-cot/venv/lib/python3.9/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 11.17 GiB total capacity; 10.70 GiB already allocated; 20.25 MiB free; 10.80 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|

OverflowError: out of range integral type conversion attempted

I get the following error:

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
100% 1061/1061 [2:04:03<00:00, 3.45s/it]Traceback (most recent call last):
File "/content/mm-cot/main.py", line 395, in
T5Trainer(
File "/content/mm-cot/main.py", line 284, in T5Trainer
metrics = trainer.evaluate(eval_dataset = test_set, max_length=args.output_len)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer_seq2seq.py", line 159, in evaluate
return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3043, in evaluate
output = eval_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3343, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "/content/mm-cot/main.py", line 215, in compute_metrics_rougel
preds = tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=True)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3469, in batch_decode
return [
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3470, in
self.decode(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3509, in decode
return self._decode(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 546, in _decode
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted

when I run the inference for rationale generation

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
  --data_root data/ScienceQA/data \
  --caption_file data/instruct_captions.json \
  --model declare-lab/flan-alpaca-large \
  --user_msg rationale --img_type vit \
  --bs 2 --eval_bs 4  --epoch 50 --lr 5e-5 --output_len 512 \
  --use_caption --use_generate --prompt_format QCM-E \
  --output_dir experiments \
  --evaluate_dir models/mm-cot-large-rationale

This happens after those 1061 iterations are completed. As a consequence it doesn't generate experiments/rationale_declare-lab-flan-alpaca-large_vit_QCM-E_lr5e-05_bs8_op512_ep50/predictions_ans_eval.json which is expected by answer inference phase for inference

D2L

Hugging Face spiece.model private

Unable to run the command for rationale generation for the inference, getting the below error.

raise RepositoryNotFoundError(
transformers.utils.hub.RepositoryNotFoundError: 401 Client Error: Repository not found for url: https://huggingface.co/models/rationale/resolve/main/spiece.model. If the repo is private, make sure you are authenticated.

raise EnvironmentError(
OSError: models/rationale is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True

Multi-label classification does not work.

Trying to use it as a zero-shot image classification problem. An image where both Adidas and Nike are available, and text input includes ["Adidas", "Nike"], the output is "Adidas". Ideally, both text labels should have been picked up.

How are the vision features generated here ? How to view detr.npy and clip.npy images

I need help in understanding how the vision features are generated for this research ?
I tried viewing images in detr.npy, clip.npy etc to understand what these images are using Image and matplotlib, but couldn't view those images meaningfully.

Need some help in understanding this

requirements specification

Hi, while trying to run inference rationale generation, I encountered this first issue :

self.mha_layer = torch.nn.MultiheadAttention(embed_dim=config.hidden_size, kdim=config.hidden_size, vdim=config.hidden_size, num_heads=1, batch_first=True) 
TypeError: __init__() got an unexpected keyword argument 'batch_first'

then commented the involved parameter and ran to this second issue :

File "/home/l1094547/.conda/envs/vmmcot/lib/python3.8/site-packages/torch/nn/functional.py", line 4079, in multi_head_attention_forward
    k = k.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)      
RuntimeError: shape '[-1, 512, 768]' is invalid for input of size 307200

I believe the real problem here is my torch version is not the one required. Could you add it in the requirements ? The usual conda yaml file would be perfection but simply knowing your torch version might do the trick.

Thanks a lot for your work

Can not train on GPU.

I tried running the training command you provided. On my machine, it only trained on the CPU and not on the GPU.
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \ --data_root data/ScienceQA/data \ --caption_file data/instruct_captions.json \ --model declare-lab/flan-alpaca-large \ --user_msg rationale --img_type vit \ --bs 2 --eval_bs 4 --epoch 50 --lr 5e-5 --output_len 512 \ --use_caption --use_generate --prompt_format QCM-E \ --output_dir experiments

killed during inference

Hi Expert, can you give me some guidance?
I have tuned some parameters, but still can not work..

Code:


CUDA_VISIBLE_DEVICES=0 python main.py \
    --model allenai/unifiedqa-t5-base \
    --user_msg rationale --img_type detr \
    --bs 1 --eval_bs 1 --eval_acc 5 --output_len 256 \
    --final_eval --prompt_format QCM-LE \
    --evaluate_dir models/MM-CoT-UnifiedQA-base-Rationale

MY env:

ERROR:

Validation (prediction) phase, server jammed.

I can train the model in the first phase, but when it comes to validating, the server will get stuck.

The server configuration is as follows:

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
Stepping: 7
CPU MHz: 1408.689
CPU max MHz: 3500.0000
CPU min MHz: 1000.0000

GPU:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:18:00.0 Off | N/A |
| 30% 41C P8 35W / 350W | 17947MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:86:00.0 Off | N/A |
| 30% 37C P8 32W / 350W | 13549MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

If you run the program in the state shown above, it will hang with a high CPU and memory usage and a low GPU usage.

requirements.txt refers to both nltk==3.5 and nltk==3.8.

Firstly, thank you for sharing your work.

requirements.txt refers to both nltk==3.5 and nltk==3.8.

This causes error when installing requirements.txt.

For now I am trying to proceed by removing nltk==3.5 from requirements.txt

Question about vision feature extractor

Hi authors, first of all, thanks for your amazing work on scienceqa dataset!

I have a question concerning the image processing details in your work. I'm not very familiar with computer vision, so I'm a bit confused about how you processed these images. Did you use CLIPFeatureExtractor, or DetrFeatureExtractor, or just used the last hidden state of DetrModel? And could you share the image processing code if possible?

Thanks!

T5ForMultimodalGeneration Inference

I was trying to use the model for inference, but it's currently not supported yet, right?

Maybe my thinking is too complicated here, but the way I see it is that one would have to change the model.generate() method to work with T5ForMultimodalGeneration because of the additional input argument (image_ids). At least that's what I tried to do, but I didn't succeed yet and thought it would be better to ask before spending more time on debugging.

Cheers

RuntimeError: shape '[8, 512, 768]' is invalid for input of size 614400

When running the indicated command for rational training :

CUDA_VISIBLE_DEVICES=0,1 python main.py \
>     --model allenai/unifiedqa-t5-base \
>     --user_msg rationale --img_type detr \
>     --bs 8 --eval_bs 4 --eval_acc 10 --output_len 512 \
>     --final_eval --prompt_format QCM-LE

It leads to the following error :

model parameters:  226643712
***** Running Evaluation *****
  Num examples = 4241
  Batch size = 4
Traceback (most recent call last):
  File "main.py", line 380, in <module>
    T5Trainer(
  File "main.py", line 272, in T5Trainer
    metrics = trainer.evaluate(eval_dataset = test_set)
  File "x/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 79, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "x/lib/python3.8/site-packages/transformers/trainer.py", line 2758, in evaluate
    output = eval_loop(
  File "x/lib/python3.8/site-packages/transformers/trainer.py", line 2936, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)      
  File "x/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 168, in prediction_step
    return super().prediction_step(
  File "x/lib/python3.8/site-packages/transformers/trainer.py", line 3177, in prediction_step
    loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
  File "x/lib/python3.8/site-packages/transformers/trainer.py", line 2502, in compute_loss
    outputs = model(**inputs)
  File "x/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "x/mm-cot/model.py", line 119, in forward
    image_att, _ = self.mha_layer(hidden_states, image_embedding, image_embedding)
  File "x/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "x/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 1153, in forward
    attn_output, attn_output_weights = F.multi_head_attention_forward(
  File "x/lib/python3.8/site-packages/torch/nn/functional.py", line 5122, in multi_head_attention_forward
    k = k.contiguous().view(k.shape[0], bsz * num_heads, head_dim).transpose(0, 1)
RuntimeError: shape '[8, 512, 768]' is invalid for input of size 614400

When running the rationale inference command :

CUDA_VISIBLE_DEVICES=0,1 python main.py     --model allenai/unifiedqa-t5-base     --user_msg rationale --img_type detr     --bs 8 --eval_bs 4 --eval_acc 10 --output_len 512     --final_eval --prompt_format QCM-LE     --evaluate_dir models/MM-CoT-UnifiedQA-base-Rationale

I encounter a similar issue :

File "x/lib/python3.8/site-packages/torch/nn/functional.py", line 5122, in multi_head_attention_forward
    k = k.contiguous().view(k.shape[0], bsz * num_heads, head_dim).transpose(0, 1)
RuntimeError: shape '[4, 512, 768]' is invalid for input of size 307200

I followed each data process step indicated in the readme tough

Thanks in advance for your help on this issue

BUG: FileNotFoundError: [Errno 2] No such file or directory: 'vision_features/name_map.json'

Thank u for your great work and code, I have tried it, but got this error, can u help me solve it. It seems that lack of the josn files.

CUDA_VISIBLE_DEVICES=0,1 python main.py \

--model allenai/unifiedqa-t5-base \
--user_msg rationale --img_type detr \
--bs 8 --eval_bs 4 --eval_acc 10 --output_len 512 \
--final_eval --prompt_format QCM-LE

args Namespace(bs=8, caption_file='data/captions.json', data_root='data', epoch=20, eval_acc=10, eval_bs=4, eval_le=None, evaluate_dir=None, final_eval=True, img_type='detr', input_len=512, lr=5e-05, model='allenai/unifiedqa-t5-base', options=['A', 'B', 'C', 'D', 'E'], output_dir='experiments', output_len=512, prompt_format='QCM-LE', seed=42, test_le=None, test_split='test', train_split='train', use_caption=False, use_generate=False, user_msg='rationale', val_split='val')
====Input Arguments====
{
"data_root": "data",
"output_dir": "experiments",
"model": "allenai/unifiedqa-t5-base",
"options": [
"A",
"B",
"C",
"D",
"E"
],
"epoch": 20,
"lr": 5e-05,
"bs": 8,
"input_len": 512,
"output_len": 512,
"eval_bs": 4,
"eval_acc": 10,
"train_split": "train",
"val_split": "val",
"test_split": "test",
"use_generate": false,
"final_eval": true,
"user_msg": "rationale",
"img_type": "detr",
"eval_le": null,
"test_le": null,
"evaluate_dir": null,
"caption_file": "data/captions.json",
"use_caption": false,
"prompt_format": "QCM-LE",
"seed": 42
}
Traceback (most recent call last):
File "main.py", line 374, in
problems, qids, name_maps, image_features = load_data_img(args) # probelms, test question ids, shot example ids
File "/root/mm-cot/utils_data.py", line 39, in load_data_img
name_maps = json.load(open('vision_features/name_map.json'))
FileNotFoundError: [Errno 2] No such file or directory: 'vision_features/name_map.json'

Question: PC requirements

Which are the minimum disk and RAM?

CUDA out of memory during training

Description:
I encountered a CUDA out of memory error while training my model on one 3090. I ran the following command on the terminal:

bash:
CUDA_VISIBLE_DEVICES=0,1 python main.py
--model allenai/unifiedqa-t5-base
--user_msg rationale --img_type detr
--bs 8 --eval_bs 4 --eval_acc 10 --output_len 512
--final_eval --prompt_format QCM-LE
The input arguments and the error message are shown below:

====Input Arguments====
{
"data_root": "data",
"output_dir": "experiments",
"model": "allenai/unifiedqa-t5-base",
"options": [
"A",
"B",
"C",
"D",
"E"
],
"epoch": 20,
"lr": 5e-05,
"bs": 8,
"input_len": 512,
"output_len": 512,
"eval_bs": 4,
"eval_acc": 10,
"train_split": "train",
"val_split": "val",
"test_split": "test",
"use_generate": false,
"final_eval": true,
"user_msg": "rationale",
"img_type": "detr",
"eval_le": null,
"test_le": null,
"evaluate_dir": null,
"caption_file": "data/captions.json",
"use_caption": false,
"prompt_format": "QCM-LE",
"seed": 42
}
img_features size: (11208, 100, 256)
number of train problems: 12726
number of val problems: 4241
number of test problems: 4241

...
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 1; 23.70 GiB total capacity; 896.00 KiB already allocated; 2.69 MiB free; 2.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It seems that the program ran out of memory while allocating 96.00 MiB on GPU 1. The GPU has a total capacity of 23.70 GiB, and only 2.69 MiB free memory was available at the time. The error message suggests trying to set max_split_size_mb to avoid fragmentation.

Is there any way to run on single 3090❓ I gonna wanted to know how many GPU needed for train this model. Thank u.

How to use the mm-cot frame as a utility library through local LLM?

Hi! Much appreciated for the excellent work!

I am working on vision-QA task using BLIP2, which consists of three modules:
ViT that extracting vision feature
QFORMER that narrow the gap between vision and language modalities
T5xxl that receive the question and the output of QFORMER to generate answers.

I wonder if it's possible to employ the mm-cot as a utility library in BLIP2 model to enhance vision-QA inference?

While running ‵extract_caption.py`, raise many garbled text. So will you put the models in `https://huggingface.co/Salesforce/instructblip-vicuna-7b/tree/main` the `llm` folder?

"blip2_vicuna_instruct" can't find lead to nonetype

File "/home/hiccup/Desktop/mm-cot-main/extract_caption.py", line 10, in
model, vis_processors, _ = load_model_and_preprocess(name="blip2_vicuna_instruct", model_type="vicuna7b", is_eval=True, device=device)
File "/home/hiccup/app/miniconda/lib/python3.9/site-packages/lavis/models/init.py", line 195, in load_model_and_preprocess
model = model_cls.from_pretrained(model_type=model_type)
AttributeError: 'NoneType' object has no attribute 'from_pretrained'

Where is the main_central.py

Thank you for sharing your nice work.

Where is the main_central.py?

where is the code for a-okvqa?

The A-OKVQA results are reported in the paper, however, I did not find related code in the repository or did I miss it?

How to train

How i prepare the dataset to train, where is the documentation

[17:28:39] [Model]: Loading declare-lab/flan-alpaca-large...

Thanks for the great work,
I am trying to implement the work of this paper on google colab with 166 G disk and T4. but at the training stage for both rationale generation and answer inference I got the output:

2023-09-29 17:27:49.955571: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
args Namespace(data_root='/content/mm-cot/data', output_dir='/content/mm-cot/experiments', model='declare-lab/flan-alpaca-large', options=['A', 'B', 'C', 'D', 'E'], epoch=50, lr=5e-05, bs=2, input_len=512, output_len=512, eval_bs=4, eval_acc=None, train_split='train', val_split='val', test_split='test', use_generate=True, final_eval=False, user_msg='rationale', img_type='vit', eval_le=None, test_le=None, evaluate_dir=None, caption_file='data/instruct_captions.json', use_caption=True, prompt_format='QCM-E', seed=42)
====Input Arguments====
{
  "data_root": "/content/mm-cot/data",
  "output_dir": "/content/mm-cot/experiments",
  "model": "declare-lab/flan-alpaca-large",
  "options": [
    "A",
    "B",
    "C",
    "D",
    "E"
  ],
  "epoch": 50,
  "lr": 5e-05,
  "bs": 2,
  "input_len": 512,
  "output_len": 512,
  "eval_bs": 4,
  "eval_acc": null,
  "train_split": "train",
  "val_split": "val",
  "test_split": "test",
  "use_generate": true,
  "final_eval": false,
  "user_msg": "rationale",
  "img_type": "vit",
  "eval_le": null,
  "test_le": null,
  "evaluate_dir": null,
  "caption_file": "data/instruct_captions.json",
  "use_caption": true,
  "prompt_format": "QCM-E",
  "seed": 42
}
img_features size:  torch.Size([11208, 145, 1024])
number of train problems: 12726

number of val problems: 4241

number of test problems: 4241

[17:28:39] [Model]: Loading declare-lab/flan-alpaca-large...

and the cell stop and the expermint folder is empty. can anyone explain what is the problem for me? (I am still a new learner in the field)

Where is Gold Rationale from?

Hi authors, thanks for your great work.
I have one question of the phase 1 process: how to train $R$ using ${X_{language}, X_{vision}}$?
I see the Figure 2 shows 'Gold Rationale'. So you construct some rationale as ground truth to guide the phase 1 training process?

Thanks!

I can't find main_central.py.

Hello, why can't I find main_central.py?

Question on fine-tuning time

Thank you for sharing the paper and code.
While reading the Experimental Settings section in the 5.2 Implementation, I have a question about fine-tuning time.

Could you please let me know approximate fine-tuning time for Multimodal-CoT if you remember?

I am trying to understand the paper and code for re-implementation.
However, due to limited computing resources(no multi-GPUs), I have to use cloud services.
This has led me to calculate the approximate fine-tuning time, as cloud companies charge based on hour.

OverflowError: can't convert negative int to unsigned

I get the following error:
File "E:\workspace-py\BigModel\mm-cot\main.py", line 380, in
T5Trainer(
File "E:\workspace-py\BigModel\mm-cot\main.py", line 272, in T5Trainer
metrics = trainer.evaluate(eval_dataset = test_set, max_length=args.output_len)
File "E:\software.conda\ai\lib\site-packages\transformers\trainer_seq2seq.py", line 159, in evaluate
return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
File "E:\software.conda\ai\lib\site-packages\transformers\trainer.py", line 3043, in evaluate
output = eval_loop(
File "E:\software.conda\ai\lib\site-packages\transformers\trainer.py", line 3343, in evaluation_loop
metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
File "E:\workspace-py\BigModel\mm-cot\main.py", line 204, in compute_metrics_rougel
preds = tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=True)
File "E:\software.conda\ai\lib\site-packages\transformers\tokenization_utils_base.py", line 3469, in batch_decode
return [
File "E:\software.conda\ai\lib\site-packages\transformers\tokenization_utils_base.py", line 3470, in
self.decode(
File "E:\software.conda\ai\lib\site-packages\transformers\tokenization_utils_base.py", line 3509, in decode
return self._decode(
File "E:\software.conda\ai\lib\site-packages\transformers\tokenization_utils_fast.py", line 546, in _decode
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: can't convert negative int to unsigned

when I run the inference for rationale generation
python main.py --data_root data
--caption_file data/instruct_captions.json
--model allenai/unifiedqa-t5-base
--user_msg rationale
--img_type detr
--bs 2
--eval_bs 4
--epoch 50
--lr 5e-5
--output_len 512
--use_caption
--use_generate
--prompt_format QCM-E
--output_dir experiments
--evaluate_dir models/mm-cot-large-rationale

This happens after those 1061 iterations are completed.

Question :The code to generate Vision Features

I would like to study more about the Vision Features, is it convenient to share the coding part to generate the npy file?
Much appreciate the hard work here.

KeyError: 'true_false'

** CUDA_VISIBLE_DEVICES=0,1 python main.py
--model allenai/unifiedqa-t5-base
--user_msg rationale --img_type detr
--bs 4 --eval_bs 2 --eval_acc 10 --output_len 512
--final_eval --prompt_format QCM-LE

When running the above code, an error is reported, please ask which part of the data true_false in**

_Traceback (most recent call last):
File "/environment/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'true_false'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "main.py", line 381, in
args = args
File "main.py", line 312, in T5Trainer
scores = get_scores(results_ans, results_rationale, results_reference, os.path.join(args.data_root, "scienceqa/problems.json"))
File "/home/featurize/work/mm-cot/mm-cot-main/utils_evaluate.py", line 54, in get_scores
print(res_pd['true_false'])
File "/environment/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3458, in getitem
indexer = self.columns.get_loc(key)
File "/environment/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'true_false'_

Demo usage

Thank you for the great work on Multimodal Chain of Thought and for open-sourcing the code! The results are really impressive. I was wondering if there is any colab notebook or example script to try this work on demo images/text

amazon-science / mm-cot Goto Github PK

mm-cot's Introduction

Multimodal Chain-of-Thought Reasoning in Language Models

"Imagine learning a textbook without figures or tables."

Requirements

Datasets

Extract Features (optional)

Extract Captions (optional)

Instructions

Training

Inference

Citing MM-CoT

License

Acknowledgement

mm-cot's People

Contributors

Stargazers

Watchers

Forkers

mm-cot's Issues

Recommend Projects

Recommend Topics

Recommend Org