Giter VIP home page Giter VIP logo

promptcap's Introduction

PromptCap

This repository contains the code and models for our paper PromptCap: Prompt-Guided Task-Aware Image Captioning. Please refer to the project page for a quick overview. This paper is also accepted to ICCV 2023, with the title PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3.

Replicating results

Since Codex has been deprecated, it is hard to replicate the results for PromptCap. For ease of use, we release all our logs, with the prompts we give to GPT-3 (codex), and the GPT-3's answers for each question in OK-VQA and A-OKVQA, in Evaluation Logs.

QuickStart

Installation

pip install promptcap

Two pipelines are included. One is for image captioning, and the other is for visual question answering.

Captioning Pipeline

Please follow the prompt format, which will give the best performance.

Generate a prompt-guided caption by following:

import torch
from promptcap import PromptCap

model = PromptCap("tifa-benchmark/promptcap-coco-vqa")  # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large"

if torch.cuda.is_available():
  model.cuda()

prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

To try generic captioning, just use "what does the image describe?"

prompt = "what does the image describe?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

PromptCap also support taking OCR inputs:

prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(model.caption(prompt, image, ocr))

Visual Question Answering Pipeline

Notice: This is not the pipeline we used for the paper, please reference to the Replicating Results section to get our GPT-3 result.

Different from typical VQA models, which are doing classification on VQAv2, PromptCap is open-domain and can be paired with arbitrary text-QA models. Here we provide a pipeline for combining PromptCap with UnifiedQA.

import torch
from promptcap import PromptCap_VQA

# QA model support all UnifiedQA variants. e.g. "allenai/unifiedqa-v2-t5-large-1251000"
vqa_model = PromptCap_VQA(promptcap_model="tifa-benchmark/promptcap-coco-vqa", qa_model="allenai/unifiedqa-t5-base")

if torch.cuda.is_available():
  vqa_model.cuda()

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(vqa_model.vqa(question, image))

Similarly, PromptCap supports OCR inputs

question = "what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(vqa_model.vqa(question, image, ocr=ocr))

Because of the flexibility of Unifiedqa, PromptCap also supports multiple-choice VQA

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
choices = ["gloves", "socks", "shoes", "coats"]
print(vqa_model.vqa_multiple_choice(question, image, choices))

Reference codes for re-training PromptCap.

We provide the original codes we use for PromptCap in Original Codes. Notice that this is not a runnable pipeline because codex and OpenAI text completion are deprecated, and the CLIP embeddings for the whole coco are too big. Nevertheless, it is still valuable for follow-up works to know the details of our implementation.

  1. Training data generation: refer to Original Codes/promptcap-gen on how we generate the prompt-guided caption training data with Codex.
  2. Training data filtering: refer to Original Codes/example-filtering on how we filter the training data.
  3. Training PromptCap with GPT-3 synthesized data:

We release the training data synthesized by Codex in vqa2_train_1010.zip. To train PromptCap from OFA, first process the data according to add a task and then fine-tune according to how to train. As the field is developing so quickly, we recommend train PromptCap with newer vision-language models, like BLIP-2 and LLaVA.

  1. Inference on GPT-3: the prompt logs are in Evaluation Logs. For construction of the prompts, refer to Original Codes/GPT3-inference.

Bibtex

@article{hu2022promptcap,
  title={PromptCap: Prompt-Guided Task-Aware Image Captioning},
  author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
  journal={arXiv preprint arXiv:2211.09699},
  year={2022}
}

promptcap's People

Contributors

jonathanrayner avatar yushi-hu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

promptcap's Issues

Fine-Tuning?

How can we fine-tune this model? Is there any script available?

VQA samples for in-context learning during inference

Hi, thanks for your nice work. I'm wondering what is the source of VQA samples used for in-context learning during OKVQA evaluation. Is it the VQAV2 training set, OKVQA training set or OKVQA validation set? I would appreciate it if you could provide information about my queries.

Reproduce paper results

Hi!

Thanks for the repository and the paper (cool idea!).

I was wondering how I can reproduce your results with either GPT-3 or Flan-T5? What you show in the README is UnifiedQA, as far as I can see works without any few-shot demonstrations, and also from my experiments performs significantly worse on OK-VQA (around 32% even with a larger T5 than you use in the README).
Would I need to run https://github.com/Yushi-Hu/PromptCap/blob/main/new_pica/run_pica_okvqa.sh ?
If so, could you make the needed files available? They seem to be custom files that do not come with the standard datasets.

Thank you!
Benno

caption_generate_gpt3

Are you using GPT3 here to generate issue-aware image captions? If so? Do you want to know that each image in the dataset has 5 human-annotated subtitles? Like you mentioned in the paper; or are only the 20 examples here in-context with 5 human-annotated subtitles, and the rest of the images in the dataset still only have one subtitle? If you have time, please give some suggestions

prompt += f"Original contexts: {'. '.join([it['caption'] for it in vqa_dataset[i]['caption']])}\n"

PromptCAP is not working on google colab

Hey there!
Thanks a lot for amazing work and making it public.
Unfortunately when i tried to run the code on colab, i got the following error:

TypeError Traceback (most recent call last)

in <cell line: 7>()
5 image = "/content/temp1.jpg"
6
----> 7 print(model.caption(prompt, image))

2 frames

/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1573 if self.config.is_encoder_decoder and "encoder_outputs" not in model_kwargs:
1574 # if model is encoder decoder encoder_outputs are created and added to model_kwargs
-> 1575 model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
1576 inputs_tensor, model_kwargs, model_input_name, generation_config
1577 )

TypeError: OFAModel._prepare_encoder_decoder_kwargs_for_generation() takes from 3 to 4 positional arguments but 5 were given

The code that i tried to run is as follow:

In one cell, i run:
!pip install promptcap

In other cell, i run:
import torch
from promptcap import PromptCap

model = PromptCap("tifa-benchmark/promptcap-coco-vqa") # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large"
if torch.cuda.is_available():
model.cuda()

prompt = "what does the image describe?"
image = "/content/temp1.jpg"

print(model.caption(prompt, image))

Any help will be appreciated.

Batch Inference (Captioning)

Thanks for your interesting work and for sharing the code.

In the README, you only provide examples of how to generate captions for one image at a time (batch size = 1). Could you (@Yushi-Hu) explain how to generate captions in batches (multiple questions and corresponding images) in one go, instead of iteratively calling the model to improve time efficiency?

Evaluation result on OK-VQA

Hi, thanks for your interesting work.

I wonder how you evaluated your final results of OK-VQA.

Paper says it was evaluated with soft accuracy of VQAv2,
so I tried to evaluate your OKVQA_val_gpt3.json in Evaluation Logs using VQA evalutation code.
https://github.com/GT-Vision-Lab/VQA

It shows 58.89 score, but 60.4 in paper.
I used "mscoco_val2014_annotations.json" file in OK-VQA website for annotation file.

Didn't you use the VQA evaluation code?
or the log files are not final result?

Thank you

could you Provide more details on how to reproduce promptcap?

now i download your train data : vqa2_train_1010.zip. but how could I use this dataset via OFAsys to reproduce a model like promptcap?Such as how to load this type of dataset in OFAsys?could you please provide the yaml config file you finetune your promptcap model like this ?
POPO-screenshot-20240603-134207

API key

Hey,

You've pushed your code with the OpenAI API key in it. Not sure if you wanted to put it out there, just letting you know :)

training code

Will the code for training the model be provided? thanks!

Error with “model = PromptCap("vqascore/promptcap-coco-vqa")”

Hello, I encountered the following error while using this interface. Is it not working?

vqascore/promptcap-coco-vqa
<super: <class 'OFATokenizer'>, >
Traceback (most recent call last):
File "/home/lh/mukea-clip/pc.py", line 4, in
model = PromptCap("vqascore/promptcap-coco-vqa") # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large"
File "/home/lh/.conda/envs/pytorch/lib/python3.6/site-packages/promptcap/promptcap.py", line 12, in init
self.tokenizer = OFATokenizer.from_pretrained(ckpt)
File "/home/lh/.conda/envs/pytorch/lib/python3.6/site-packages/promptcap/tokenization_ofa.py", line 67, in from_pretrained
tokenizer = super().from_pretrained(pretrained_model_name_or_path, *init_inputs, **kwargs)
File "/home/lh/.conda/envs/pytorch/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1664, in from_pretrained
local_files_only=local_files_only,
File "/home/lh/.conda/envs/pytorch/lib/python3.6/site-packages/transformers/file_utils.py", line 2242, in get_file_from_repo
use_auth_token=use_auth_token,
File "/home/lh/.conda/envs/pytorch/lib/python3.6/site-packages/transformers/file_utils.py", line 1854, in cached_path
local_files_only=local_files_only,
File "/home/lh/.conda/envs/pytorch/lib/python3.6/site-packages/transformers/file_utils.py", line 2103, in get_from_cache
"Connection error, and we cannot find the requested files in the cached path."
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.