j-min / clip-caption-reward Goto Github PK
View Code? Open in Web Editor NEWPyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)
Home Page: https://arxiv.org/abs/2205.13115
License: Other
PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)
Home Page: https://arxiv.org/abs/2205.13115
License: Other
hello, thank for code.
dont you have onnx release ?
my raspberry pi 4b reproduce 1 min 20 sec on 1 image (
Hello,
I'm encountering an error (further below) when running:
python scripts/clip_prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root datasets/COCO/images --model_type RN50
Which throws:
File "scripts/clip_prepro_feats.py", line 117, in main tmp_att, tmp_fc = model.encode_image(image) (…) query should be unbatched 2D or batched 3D tensor but received 4-D query tensor
I have followed the steps in the README, and the line that I'm running is the exact same one from the README. My installation includes:
torch==1.12.0
torchvision==0.13.0
clip @ git+https://github.com/openai/CLIP.git@d50d76daa670286dd6cacf3bcd80b5e4823fc8e1
I can run the steps that come before (text processing) and after (visual feature extraction for CLIP-S) without problems.
Would you have any idea of what is wrong here? Thank you in advance for any pointers!
Sorry for bothering again! A question about the model saving strategy in clips (similar to clips + grammar) training.
I notice the code always saves the highest cider score model as the best model during the training.
But, most metrics are decreasing except CLIP-S during the training process under clips RL. How do you choose the "best" model? By choosing the highest CLIP-S score? If true, under other RL (such as CIDEr and CIDEr + CLIPS), should adopt the same way (instead of the highest CIDEr score model) to choose the "best" model?
Hello,
I have successfully generated all features (both text and visual) for the COCO dataset. However, when running MLE training, the code throws the following error at the moment it starts validation at 96% of the first epoch:
File "/home/soaresbu/clip-captioning/captioning/utils/clipscore.py", line 177, in forward
refclip_s = self.calc_refclip_s(
File "/home/soaresbu/clip-captioning/captioning/utils/clipscore.py", line 124, in calc_refclip_s
ref_text_feat = ref_text_feat.view(B, -1, dim)
RuntimeError: shape '[4, -1, 512]' is invalid for input of size 64000
Any idea of what could be wrong here? Am I missing something when generating CLIP-S with python scripts/clipscore_prepro_feats.py
?
If I want to select only part of the data set for training, please tell me how to modify the input json file. I don’t understand the relationship between several input files. Could you please explain it? Thank you!
Hello,
In the paper, the "reward augmented with the grammar score" is defined as:
However, in the code, this is the clip_s reward that is multiplied by the 2.0 scalar (in addition to the 2.5 factor from the definition of CLIP_S): rewards = opt.clipscore_reward_weight * rewards
The two rewards are then directly summed and I can't find the grammar reward being scaled anywhere (especially not by 4.0, to balance the 2.0 factor of the clipscore reward), so I think there has been an inversion somewhere.
Since I am reproducing results with my own code, it would be really helpful to know which scaling is the correct one, because one constraint the language model more than the other.
I re-train (8 V100) the mle phase using your released config file of configs/phase1/clipRN50_mle.yml
, but the performance is lower than reported in the paper (CIDEr: 106.5 v.s 110.3). Does the config file correspond to the reported experiment in the paper?
The warmup step is set to 20000 in the config file, is it too large? The learning rate has been rising during the full training phase (just warmup without decreasing).
Hi, authors. Would you please provide the details of language_evaluation
in eval_finecapeval.py
used in Evaluation on FineCapEval?
Hello,
I am trying to reproduce your code and I am confused about these parameters
input_label_h5: data/cocotalk_label.h5
input_fc_dir: data/cocotalk_clip_RN50_fc
input_att_dir: data/cocotalk_clip_RN50_att
Could you please elaborate on these parameters? Like how can I reproduce cocotalk_label.h5, cocotalk_clip_RN50_fc and cocotalk_clip_RN50_att
Hi,
Thank you for this amazing contribution!
I can't find the 'data/ finecapeval. json' file.Could you give me some guidance?
What is the difference between clip_prepo_feat.py and clipscore_prepo_feat.py? Why do we need to run both for visual feature extraction?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.