Giter VIP home page Giter VIP logo

clip-caption-reward's People

Contributors

ak391 avatar chenxwh avatar j-min avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

clip-caption-reward's Issues

onnx version ?

hello, thank for code.
dont you have onnx release ?

my raspberry pi 4b reproduce 1 min 20 sec on 1 image (

Error running python 'clip_prepro_feats.py'

Hello,

I'm encountering an error (further below) when running:
python scripts/clip_prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root datasets/COCO/images --model_type RN50

Which throws:
File "scripts/clip_prepro_feats.py", line 117, in main tmp_att, tmp_fc = model.encode_image(image) (…) query should be unbatched 2D or batched 3D tensor but received 4-D query tensor

I have followed the steps in the README, and the line that I'm running is the exact same one from the README. My installation includes:
torch==1.12.0
torchvision==0.13.0
clip @ git+https://github.com/openai/CLIP.git@d50d76daa670286dd6cacf3bcd80b5e4823fc8e1

I can run the steps that come before (text processing) and after (visual feature extraction for CLIP-S) without problems.

Would you have any idea of what is wrong here? Thank you in advance for any pointers!

phase 2 of clips

Sorry for bothering again! A question about the model saving strategy in clips (similar to clips + grammar) training.
I notice the code always saves the highest cider score model as the best model during the training.

But, most metrics are decreasing except CLIP-S during the training process under clips RL. How do you choose the "best" model? By choosing the highest CLIP-S score? If true, under other RL (such as CIDEr and CIDEr + CLIPS), should adopt the same way (instead of the highest CIDEr score model) to choose the "best" model?
image

pre-training model

Excuse me, is the downloaded pre-training model already trained? Why is the test result shown in the figure below?
image

Phase 1 validation throws: shape '[4, -1, 512]' is invalid for input of size 64000

Hello,

I have successfully generated all features (both text and visual) for the COCO dataset. However, when running MLE training, the code throws the following error at the moment it starts validation at 96% of the first epoch:

File "/home/soaresbu/clip-captioning/captioning/utils/clipscore.py", line 177, in forward
    refclip_s = self.calc_refclip_s(
  File "/home/soaresbu/clip-captioning/captioning/utils/clipscore.py", line 124, in calc_refclip_s
    ref_text_feat = ref_text_feat.view(B, -1, dim)
RuntimeError: shape '[4, -1, 512]' is invalid for input of size 64000

Any idea of what could be wrong here? Am I missing something when generating CLIP-S with python scripts/clipscore_prepro_feats.py?

Select part of the data set

If I want to select only part of the data set for training, please tell me how to modify the input json file. I don’t understand the relationship between several input files. Could you please explain it? Thank you!

Factor in the weigthed reward sum is inconsistent with the paper

Hello,

In the paper, the "reward augmented with the grammar score" is defined as:
$$R(I,c) = CLIP-S(i,c) + \lambda g(c)$$ with $\lambda=2.0$.
However, in the code, this is the clip_s reward that is multiplied by the 2.0 scalar (in addition to the 2.5 factor from the definition of CLIP_S): rewards = opt.clipscore_reward_weight * rewards

The two rewards are then directly summed and I can't find the grammar reward being scaled anywhere (especially not by 4.0, to balance the 2.0 factor of the clipscore reward), so I think there has been an inversion somewhere.

Since I am reproducing results with my own code, it would be really helpful to know which scaling is the correct one, because one constraint the language model more than the other.

Config file and performance reproduce

I re-train (8 V100) the mle phase using your released config file of configs/phase1/clipRN50_mle.yml, but the performance is lower than reported in the paper (CIDEr: 106.5 v.s 110.3). Does the config file correspond to the reported experiment in the paper?
image

The warmup step is set to 20000 in the config file, is it too large? The learning rate has been rising during the full training phase (just warmup without decreasing).
image

About language_evaluation

Hi, authors. Would you please provide the details of language_evaluation in eval_finecapeval.py used in Evaluation on FineCapEval?

Confusion about the config

Hello,
I am trying to reproduce your code and I am confused about these parameters

input_label_h5: data/cocotalk_label.h5
input_fc_dir: data/cocotalk_clip_RN50_fc
input_att_dir: data/cocotalk_clip_RN50_att

Could you please elaborate on these parameters? Like how can I reproduce cocotalk_label.h5, cocotalk_clip_RN50_fc and cocotalk_clip_RN50_att

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.