om-ai-lab / vl-checklist Goto Github PK

View Code? Open in Web Editor NEW

124.0 124.0 4.0 27.26 MB

Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.

Python 100.00%

evaluation-metrics multimodal-deep-learning vision-and-language

vl-checklist's People

Contributors

Stargazers

Watchers

Forkers

e-kiss-me farmingtong tutuna rabiulcste

vl-checklist's Issues

License for dataset?

Hey, thanks for releasing your great work. Could you please add a license to your dataset so that we can cite and use it fairly. Thanks!

Question about the visualization results shown in Figure5 and Figure6

Thanks for this wonderful work. This work is very inspiring. I am confused about how to get the heat-map as shown in your paper. Looking forward to your reply at your convenience.

Reproducing CLIP score in the paper

Hi,

Thanks for opening the source code.
I'm trying to reproduce the scores for CLIP in the paper but fail to reproduce it.
I use the sample config file by changing MODE_NAME to CLIP (ViT-L/14).
I evaluate all the datasets in the corpus then average the final accuracy.
I got the following score which is quite different from the paper,

Object: 0.8205209550766983
Attribute: 0.6806109948697314
Relation: 0.67975

How can I reproduce the scores in the paper?

Why attention demo chooses language model layer to catch model attention？

In attention.py demo, get_attention_by_gradcam method's inputs have image_input and text_input, I want to know why choosing text_input to deal. The demo is showed below.

def get_attention_by_gradcam(self, model, tokenizer, image_path, image_input, text_input, attr_name, target_layer):
    encoder_name = getattr(model, attr_name, None)
    encoder_name.encoder.layer[target_layer].crossattention.self.save_attention = True
    output = model(image_input, text_input)
    loss = output[:, 1].sum()
    model.zero_grad()
    loss.backward()
    image_size = 256
    temp = int(np.sqrt(image_size))
    # the effect of mask is let those padding tokens multiply with 0 so that they won't be calculated in cams and
    # grads , because of the text preprocess of ALBEF and TCL, mask is unuseful here
    mask = **text_input**.attention_mask.view(text_input.attention_mask.size(0), 1, -1, 1, 1)
    grads = **encoder_name**.encoder.layer[target_layer].crossattention.self.get_attn_gradients()
    cams = encoder_name.encoder.layer[target_layer].crossattention.self.get_attention_map()

Another same question is in 'albef' attention, demo shows atter_name is 'text_encoder', The demo is showed below.

def getAttMap(self, image_path, text):
    if self.model_name.lower() == 'albef':
        engine = ALBEF('ALBEF_4M.pth')
        model, tokenizer = engine.load_model(engine.model_id)
        image_input = engine.load_data(src_type='local', data=[image_path])[0]
        text_input = tokenizer(engine.pre_caption(text), return_tensors="pt")
        self.get_attention_by_gradcam(model, tokenizer, image_path, image_input, text_input,
                                          attr_name='text_encoder', target_layer=8)

Difference between code and description in the paper

Hi,

Thanks for open sourcing your code. I am trying to reproduce the results for ALBEF in your paper, but no success. I was going through your code and noticed that ITM logits/probabilities are used differently in the code than in the paper. Paper describes, "If the model score on the original text description is higher than the score on the generated negative samples, we regard it as positive output." However, in the code only the ITM logit corresponding to "matching" z[1] is used. Basically, the code never compares the scores between positive and negative text as described in the paper. Can you please clarify?

Thanks,
Ajinkya

demo链接失效了

How to generate the negative caption?

hi, it is a great work. I want to know where is the code about the generation of negative texts?

PIC dataset is not available anymore

I tried to follow the instruction on HAKE codebase, but the PIC dataset (also their website) is gone..

Annotation path in Relation corpus is wrong

VL-CheckList/corpus/v1/Relation/spatial/vg.yaml

Line 1 in 9e6b5ef

ANNO_PATH: "data/Attribute/vg/spatial.json"

I think Attribute should be fixed to Relation.

Running test with other models

Hi!
Thank you for publishing this great work. I was able to run test with your Vilt model, is it possible to run test with other models such TCL and the rest? their checkpoints are different so it's not clear to me if the code should support it or not.
Thank you and have a great week,
Amit

Object features for Oscar model

Hi, it is a great work! Since the region-based methods like Oscar using extracted features for evaluating, can you provide the features.tsv file or the detector used for object detection in your paper?

Many thanks!

om-ai-lab / vl-checklist Goto Github PK

vl-checklist's People

Contributors

Stargazers

Watchers

Forkers

vl-checklist's Issues

License for dataset?

Question about the visualization results shown in Figure5 and Figure6

Reproducing CLIP score in the paper

Why attention demo chooses language model layer to catch model attention？

Difference between code and description in the paper

demo链接失效了

How to generate the negative caption?

PIC dataset is not available anymore

Annotation path in Relation corpus is wrong

Running test with other models

Object features for Oscar model

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent