j-min / clip-caption-reward Goto Github PK

View Code? Open in Web Editor NEW

227.0 227.0 25.0 2.7 MB

PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)

Home Page: https://arxiv.org/abs/2205.13115

License: Other

Python 58.67% Shell 0.02% Jupyter Notebook 41.31%

clip image-captioning reinforcement-learning vision-and-language

clip-caption-reward's People

Contributors

Stargazers

Watchers

clip-caption-reward's Issues

onnx version ?

hello, thank for code.
dont you have onnx release ?

my raspberry pi 4b reproduce 1 min 20 sec on 1 image (

Error running python 'clip_prepro_feats.py'

Hello,

I'm encountering an error (further below) when running:
python scripts/clip_prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root datasets/COCO/images --model_type RN50

Which throws:
File "scripts/clip_prepro_feats.py", line 117, in main tmp_att, tmp_fc = model.encode_image(image) (…) query should be unbatched 2D or batched 3D tensor but received 4-D query tensor

I have followed the steps in the README, and the line that I'm running is the exact same one from the README. My installation includes:
torch==1.12.0
torchvision==0.13.0
clip @ git+https://github.com/openai/CLIP.git@d50d76daa670286dd6cacf3bcd80b5e4823fc8e1

I can run the steps that come before (text processing) and after (visual feature extraction for CLIP-S) without problems.

Would you have any idea of what is wrong here? Thank you in advance for any pointers!

phase 2 of clips

Sorry for bothering again! A question about the model saving strategy in clips (similar to clips + grammar) training.
I notice the code always saves the highest cider score model as the best model during the training.

But, most metrics are decreasing except CLIP-S during the training process under clips RL. How do you choose the "best" model? By choosing the highest CLIP-S score? If true, under other RL (such as CIDEr and CIDEr + CLIPS), should adopt the same way (instead of the highest CIDEr score model) to choose the "best" model?

pre-training model

Excuse me, is the downloaded pre-training model already trained? Why is the test result shown in the figure below?

Phase 1 validation throws: shape '[4, -1, 512]' is invalid for input of size 64000

Hello,

I have successfully generated all features (both text and visual) for the COCO dataset. However, when running MLE training, the code throws the following error at the moment it starts validation at 96% of the first epoch:

File "/home/soaresbu/clip-captioning/captioning/utils/clipscore.py", line 177, in forward
    refclip_s = self.calc_refclip_s(
  File "/home/soaresbu/clip-captioning/captioning/utils/clipscore.py", line 124, in calc_refclip_s
    ref_text_feat = ref_text_feat.view(B, -1, dim)
RuntimeError: shape '[4, -1, 512]' is invalid for input of size 64000

Any idea of what could be wrong here? Am I missing something when generating CLIP-S with python scripts/clipscore_prepro_feats.py?

Select part of the data set

If I want to select only part of the data set for training, please tell me how to modify the input json file. I don’t understand the relationship between several input files. Could you please explain it? Thank you!

Factor in the weigthed reward sum is inconsistent with the paper

Hello,

In the paper, the "reward augmented with the grammar score" is defined as:
$$R(I,c) = CLIP-S(i,c) + \lambda g(c)$$ with $\lambda=2.0$.
However, in the code, this is the clip_s reward that is multiplied by the 2.0 scalar (in addition to the 2.5 factor from the definition of CLIP_S): rewards = opt.clipscore_reward_weight * rewards

The two rewards are then directly summed and I can't find the grammar reward being scaled anywhere (especially not by 4.0, to balance the 2.0 factor of the clipscore reward), so I think there has been an inversion somewhere.

Since I am reproducing results with my own code, it would be really helpful to know which scaling is the correct one, because one constraint the language model more than the other.

Config file and performance reproduce

I re-train (8 V100) the mle phase using your released config file of configs/phase1/clipRN50_mle.yml, but the performance is lower than reported in the paper (CIDEr: 106.5 v.s 110.3). Does the config file correspond to the reported experiment in the paper?

The warmup step is set to 20000 in the config file, is it too large? The learning rate has been rising during the full training phase (just warmup without decreasing).

About language_evaluation

Hi, authors. Would you please provide the details of language_evaluation in eval_finecapeval.py used in Evaluation on FineCapEval?

Confusion about the config

Hello,
I am trying to reproduce your code and I am confused about these parameters

input_label_h5: data/cocotalk_label.h5
input_fc_dir: data/cocotalk_clip_RN50_fc
input_att_dir: data/cocotalk_clip_RN50_att

Could you please elaborate on these parameters? Like how can I reproduce cocotalk_label.h5, cocotalk_clip_RN50_fc and cocotalk_clip_RN50_att

Where is the file 'data/ finecapeval. json'?

Hi,

Thank you for this amazing contribution!
I can't find the 'data/ finecapeval. json' file.Could you give me some guidance?

difference between clip_prepo_feat.py and clipscore_prepo_feat.py

What is the difference between clip_prepo_feat.py and clipscore_prepo_feat.py? Why do we need to run both for visual feature extraction?

j-min / clip-caption-reward Goto Github PK

clip-caption-reward's People

Contributors

Stargazers

Watchers

Forkers

clip-caption-reward's Issues

Recommend Projects

Recommend Topics

Recommend Org