om-ai-lab / vl-checklist Goto Github PK
View Code? Open in Web Editor NEWEvaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.
Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.
Hey, thanks for releasing your great work. Could you please add a license to your dataset so that we can cite and use it fairly. Thanks!
Thanks for this wonderful work. This work is very inspiring. I am confused about how to get the heat-map as shown in your paper. Looking forward to your reply at your convenience.
Hi,
Thanks for opening the source code.
I'm trying to reproduce the scores for CLIP in the paper but fail to reproduce it.
I use the sample config file by changing MODE_NAME
to CLIP (ViT-L/14).
I evaluate all the datasets in the corpus then average the final accuracy.
I got the following score which is quite different from the paper,
Object: 0.8205209550766983
Attribute: 0.6806109948697314
Relation: 0.67975
How can I reproduce the scores in the paper?
In attention.py demo, get_attention_by_gradcam method's inputs have image_input and text_input, I want to know why choosing text_input to deal. The demo is showed below.
def get_attention_by_gradcam(self, model, tokenizer, image_path, image_input, text_input, attr_name, target_layer):
encoder_name = getattr(model, attr_name, None)
encoder_name.encoder.layer[target_layer].crossattention.self.save_attention = True
output = model(image_input, text_input)
loss = output[:, 1].sum()
model.zero_grad()
loss.backward()
image_size = 256
temp = int(np.sqrt(image_size))
# the effect of mask is let those padding tokens multiply with 0 so that they won't be calculated in cams and
# grads , because of the text preprocess of ALBEF and TCL, mask is unuseful here
mask = **text_input**.attention_mask.view(text_input.attention_mask.size(0), 1, -1, 1, 1)
grads = **encoder_name**.encoder.layer[target_layer].crossattention.self.get_attn_gradients()
cams = encoder_name.encoder.layer[target_layer].crossattention.self.get_attention_map()
Another same question is in 'albef' attention, demo shows atter_name is 'text_encoder', The demo is showed below.
def getAttMap(self, image_path, text):
if self.model_name.lower() == 'albef':
engine = ALBEF('ALBEF_4M.pth')
model, tokenizer = engine.load_model(engine.model_id)
image_input = engine.load_data(src_type='local', data=[image_path])[0]
text_input = tokenizer(engine.pre_caption(text), return_tensors="pt")
self.get_attention_by_gradcam(model, tokenizer, image_path, image_input, text_input,
attr_name='text_encoder', target_layer=8)
Hi,
Thanks for open sourcing your code. I am trying to reproduce the results for ALBEF in your paper, but no success. I was going through your code and noticed that ITM logits/probabilities are used differently in the code than in the paper. Paper describes, "If the model score on the original text description is higher than the score on the generated negative samples, we regard it as positive output." However, in the code only the ITM logit corresponding to "matching" z[1]
is used. Basically, the code never compares the scores between positive and negative text as described in the paper. Can you please clarify?
Thanks,
Ajinkya
hi, it is a great work. I want to know where is the code about the generation of negative texts?
I tried to follow the instruction on HAKE codebase, but the PIC dataset (also their website) is gone..
I think Attribute
should be fixed to Relation
.
Hi!
Thank you for publishing this great work. I was able to run test with your Vilt model, is it possible to run test with other models such TCL and the rest? their checkpoints are different so it's not clear to me if the code should support it or not.
Thank you and have a great week,
Amit
Hi, it is a great work! Since the region-based methods like Oscar using extracted features for evaluating, can you provide the features.tsv file or the detector used for object detection in your paper?
Many thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.