Giter VIP home page Giter VIP logo

vle's Introduction

中文 | English



GitHub GitHub repo size GitHub top language GitHub last commit

VLE: Vision-Language Encoder

Multimodal pre-trained models are trained on massive multimodal data, and they can utilize information from different modalities and perform various cross-modal tasks.

In this repository, we introduce VLE (Vision-Language Encoder), an image-text multimodal understanding model built on the pre-trained text and image encoders. It can be used for multimodal discriminative tasks such as visual question answering and image-text retrieval. Especially on the visual commonsense reasoning (VCR) task, which requires high-level language understanding and reasoning skills, VLE achieves the best performance among the public methods.

Recently, LLMs (Large Language Models) have achieved great success and have been used for a wide range of text tasks, including translation, question answering, text summarization, etc. While LLMs are unimodal, their abilities can be leveraged for multimodal understanding tasks. We propose a VQA+LLM pipeline that integrates multimodal models with LLMs for the visual question answering task. It helps the VQA model generate more accurate and fluent answers.

We open-source VLE-related resources for promoting academic research and better facilitating our community.

Try our VLE-based VQA Demo at 🤗Space 👇👇👇

VLE-based VQA Demo

Chinese LERT | Chinese and English PERT | Chinese MacBERT | ChineseMiniRBT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | Knowledge distillation tool TextBrewer | Model pruning tool TextPruner

More resources released by HFL: https://github.com/iflytek/HFL-Anthology

Table of Contents

Section Description
Introduction Introduction to VLE
Downloads Download links for VLE
Comparison Comparison of VLE with other models
VQA with LLM Visual question answering with LLM
Usage How to load VLE for different tasks

Introduction

Structure

The structure of VLE is similar to METER, which consists of two unimodal encoders for text and image separately, followed by a crossmodal fusion module. However, there are several structural differences between VLE and METER:

  • VLE uses DeBERTa-v3 as the text encoder, which is stronger than RoBERTa-base used in METER.
  • In the large version of VLE (VLE-large), the hidden size of the crossmodal co-attention fusion module is scaled up to 1024 to increase capacities.
  • During fine-tuning, VLE introduces additional token_type_embeddings.

Pre-training

VLE is pre-trained with image-caption pairs. There are four objectives applied during the pre-training stage:

  • MLM (Masked Language Modeling): Given an image-caption pair, we randomly mask some input text tokens, and the model is trained to reconstruct the original tokens.
  • ITM (Image-Text Matching): Given a batch of matched or mismatched image-caption pairs, the model needs to identify which images and captions correspond to each other.
  • MPC (Masked Patch-box Classification): Given an image-caption pair with some patches masked, the model needs to predict the classes of the objects in the masked patches.
  • PBC (Patch-box Classification): Given an image-caption pair, the models need to identify which patches are related to the caption.

VLE models are pre-trained on 14M public English image-caption pairs for 25k steps with a batch size of 2048.

The following figure illustrates the VLE structure and the pre-training objectives (for simplicity, we omit the PBC objective in the figure).

VLE structure and pre-training tasks

Adaptation for downstream tasks

Visual Question Answering (VQA)

  • We follow the standard practice to train the models on VQA with both training and validation data, and test the models on the test-dev set. The pooler output from the last layer of the fusion module is used for classification.

Visual Commonsense Reasoning (VCR)

  • We format VCR as a multiple-choice task which is similar to RACE. For each object in the image in each example, we append the average of patches that cover the object to the image feature embeddings before the fusion module. We also assign token_type_ids to the objects in the image and text to improve alignment between different modalities.

Downloads

The model weights are in PyTorch format and can be downloaded through the 🤗 transformers model hub. You can either download the weights and configurations manually or initialize a VLE model with from_pretrained(model_name) method in your code. See Usage for details.

Pre-trained Checkpoints

Model Text Encoder Image Encoder # Params* MODEL_NAME Link
VLE-base DeBERTa-v3-base CLIP-ViT-base-patch16 378M hfl/vle-base link
VLE-large DeBERTa-v3-large CLIP-ViT-large-patch14 930M hfl/vle-large link

* : We exclude task heads when counting the number of parameters.

Fine-tuned Checkpoints

Model Text Encoder Image Encoder MODEL_NAME Link
VLE-base-for-VQA DeBERTa-v3-base CLIP-ViT-base-patch16 hfl/vle-base-for-vqa link
VLE-large-for-VQA DeBERTa-v3-large CLIP-ViT-large-patch14 hfl/vle-large-for-vqa link
VLE-base-for-VCR-q2a DeBERTa-v3-base CLIP-ViT-base-patch16 hfl/vle-base-for-vcr-q2a link
VLE-large-for-VCR-q2a DeBERTa-v3-large CLIP-ViT-large-patch14 hfl/vle-large-for-vcr-q2a link
VLE-base-for-VCR-qa2r DeBERTa-v3-base CLIP-ViT-base-patch16 hfl/vle-base-for-vcr-qa2r link
VLE-large-for-VCR-qa2r DeBERTa-v3-large CLIP-ViT-large-patch14 hfl/vle-large-for-vcr-qa2r link

Comparison

In the following table, we compare the performance of VLE with METER and other multimodal models. The VQA results are on the test-dev set, and the VCR results are on the dev set.

Model VQA VCR (QA2R) VCR (Q2A) #Params #PT data*
CoCa 82.3 - - 2.1 B unknown
BeiT-3 84.2 - - 1.9 B 21M(I-T) + 14M(I) + 160G(T)
OFA 82.0 - - 930M 20M(I-T) + 39M(I) + 140G(T)
BLIP 78.3 - - 385M ~130M(I-T)
METER-base 77.7 (76.8†‡) 79.8§ 77.6§ 345M 9M(I-T)
METER-Huge 80.3 - - 878M 20M(I-T)
VLE-base 77.6 83.7§ 79.9§ 378M 15M(I-T)
VLE-large 79.3 87.5§ 84.3§ 930M 15M(I-T)

: Result from our reimplementation.

: Fine-tuning hyperparameters: lr=7e-6, batch_size={256, 512}, num_epochs=10

§ : Fine-tuning hyperparameters: lr=1e-5, batch_size=128, num_epochs=5

* : Pre-training data. I-T: Image-caption pairs. I: Images. T: Text.

From the above results, we can see that:

  • VLE is pre-training efficient. Compared to models with similar model sizes, VLE achieves comparable or even better performance on VQA with much less pre-training data.

  • VLE shows higher reasoning ability. Especially it significantly outperforms METER on Visual Commonsense Reasoning (VCR), which requires higher level language and reasoning skills than VQA.

VQA with LLM

Generating Accurate and Fluent VQA Answers

LLMs have achieved great success on a wide range of text tasks, while the abilities of LLMs can also be leveraged for multimodal understanding tasks. Specifically, we present a VQA+LLM pipeline that integrates multimodal models with LLMs for the visual question answering task, which helps the VQA model to generate more accurate and fluent answers.

The workflows are shown in the figure below.

Workflows

(a) VQA: This is the standard way to perform the VQA task with a discriminative model. The question and the image are fed into the multimodal model, and the model is trained to predict the correct answer labels.

(b) VQA + LLM: The captioning model generates a caption of the image. The caption, question, and answer candidates predicted by the VQA model are concatenated and fed to the LLM. The LLM is asked to give the most reasonable answer.

We find that VQA+LLM can not only make more accurate predictions, but also generate more fluent and readable predictions. We list some examples:

men and truck
hatch

The demo is available at : https://huggingface.co/spaces/hfl/VQA_VLE_LLM

Usage

Requirements

  • PIL
  • Transformers >= 4.25
  • PyTorch Lightning (only required for running fine-tuning scripts)

The model classes and utilities are defined in the *.py files in models/VLE. To import VLE into your code, just copy models directory into your project.

To run the following demo code, git clone the repository and cd into it, ensuring you are in the repository's root directory.

Load the VLEModel

from models.VLE import VLEModel, VLEProcessor
from PIL import Image
import torch

model_name="hfl/vle-large"
images = [Image.open('pics/dogs.png')]
text = ["There are dogs on the grass."]

model = VLEModel.from_pretrained(model_name)
vle_processor = VLEProcessor.from_pretrained(model_name)
multimodal_inputs = vle_processor(text=text,images=images, return_tensors='pt',padding=True)

#forward
vle_output = model(**multimodal_inputs)

Inference

Visual Question Answering (VQA)

from models.VLE import VLEForVQA, VLEProcessor, VLEForVQAPipeline
from PIL import Image

model_name="hfl/vle-base-for-vqa"
text= "What is the color of the floor?"
image = Image.open("pics/door.png")

model = VLEForVQA.from_pretrained(model_name)
vle_processor = VLEProcessor.from_pretrained(model_name)
vqa_pipeline = VLEForVQAPipeline(model=model, device='cpu', vle_processor=vle_processor)

vqa_answers = vqa_pipeline(image=image, question=text, top_k=5)
print(f"Question: {text}. Answers: {vqa_answers}")

Image-Text Matching

from models.VLE import VLEForITM, VLEProcessor, VLEForITMPipeline
from PIL import Image

model_dir = 'hfl/vle-base'
itm_text = ["a photo of a cat.", "a photo of dogs."]
itm_images = Image.open("pics/dogs.png")

print("Init ITM model")
model = VLEForITM.from_pretrained(model_dir)
vle_processor = VLEProcessor.from_pretrained(model_dir)

print("init ITM pipeline")
itm_pipeline = VLEForITMPipeline(model=model, device='cpu', vle_processor=vle_processor)
itm_pred = itm_pipeline([{"image": itm_images, "text": itm_text[0]}, 
                         {"image": itm_images, "text": itm_text[1]}])

for t, pred in zip(itm_text,itm_pred):
    print(t,pred)

Patch Box Classification

from models.VLE import VLEForPBC, VLEProcessor, VLEForPBCPipeline
from PIL import Image

model_dir = 'hfl/vle-base'
pbc_text = "pink tongues"
pbc_image = Image.open("pics/dogs.png")

print("Init PBC model")
model = VLEForPBC.from_pretrained(model_dir)
vle_processor = VLEProcessor.from_pretrained(model_dir)

print("init PBC pipeline")
pbc_pipeline = VLEForPBCPipeline(model=model, device='cpu', vle_processor=vle_processor)
pbc_pred = pbc_pipeline(image=pbc_image,text=pbc_text)
print(pbc_text)
pbc_pred['image'].save('pics/pink_tongues.png')

Visual Commonsense Reasoning (VCR)

Please follow the instructions in examples/VCR/README.md

Fine-tuning

Fine-tuning on VQA

Please follow the instructions in examples/VQA/README.md

Follow us

Welcome to follow the official WeChat account of HFL to keep up with the latest technical developments.

qrcode.png

Disclaimer

This repository's resources are solely intended for academic purposes, and we assume no responsibility for any unforeseen damages or losses that may result from their use.

This is not an official product by iFLYTEK Co., Ltd.

Issues

If you have questions, please submit them in a GitHub Issue.

  • Before submitting an issue, please check whether the FAQ can solve the problem, and it is recommended to check whether the previous issue can solve your problem.
  • Duplicate and unrelated issues will be handled by [stable-bot](stale · GitHub Marketplace).
  • We will try our best to answer your questions, but there is no guarantee that your questions will be answered.
  • Politely ask questions and build a harmonious discussion community.

vle's People

Contributors

airaria avatar gogojoestar avatar ymcui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

vle's Issues

反向传播不带梯度

你好,当使用反向传播到cross_modal_image_layers和cross_modal_text_layers的时候 为什么会没有梯度。在BERT的modelling_bert 设置梯度的时候显示是没有,请问怎么拿到每一层的梯度,例如cross_modal_image_layers倒数第一层的特征图和梯度。谢谢

not all images have caption annotations

when i run script "write_vqa.py", use the public datasets as author recommended in "README.md", console output some message as show below:
./VLE/VLE-main/examples/VQA/write_vqa.py
100%|██████████| 443757/443757 [00:00<00:00, 851830.30it/s]
100%|██████████| 214354/214354 [00:00<00:00, 431325.80it/s]
100%|██████████| 447793/447793 [00:00<00:00, 809395.66it/s]
100%|██████████| 107394/107394 [00:00<00:00, 3530035.22it/s]
100%|██████████| 443757/443757 [00:00<00:00, 7974002.36it/s]
100%|██████████| 214354/214354 [00:00<00:00, 7733833.17it/s]
100%|██████████| 658111/658111 [00:06<00:00, 100919.83it/s]
100%|██████████| 443757/443757 [00:01<00:00, 359162.75it/s]
100%|██████████| 214354/214354 [00:00<00:00, 237090.17it/s]
not all images have caption annotations
82783 82774 82774
100%|██████████| 82774/82774 [56:48<00:00, 24.28it/s]
not all images have caption annotations
40504 40503 40503
100%|██████████| 40503/40503 [26:37<00:00, 25.35it/s]
0%| | 0/81434 [00:00<?, ?it/s]all images have caption annotations
81434 81434 81434
100%|██████████| 81434/81434 [49:52<00:00, 27.21it/s]
not all images have caption annotations
81434 36807 36807
100%|██████████| 36807/36807 [19:12<00:00, 31.93it/s]

Process finished with exit code 0

some line mentioned that" not all image have caption annotations",should i care about this message? whether it effect the next training steps?
please kindly answer my doubts~
tks~

KeyError: 'pi'

System enviroment:
Ubuntu20.04
torch 2.0.0+cu118
torchvision 0.15.1+cu118

Commandline:
#run_vqav2_ft.py --train_config_file=vqa_train_config.json

Error Description as below:
/home/steven/anaconda3/envs/nlp/bin/python /home/steven/workstore/nlp/VLE-main/run_vqav2_ft.py --train_config_file=vqa_train_config.json
/home/steven/workstore/nlp/VLE-main/run_vqav2_ft.py:76: SyntaxWarning: "is" with a literal. Did you mean "=="?
max_epochs=_config["max_epoch"] if max_steps is -1 else 1000,
2023-09-20 13:52:41.071596: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Global seed set to 0
Some weights of VLEForVQA were not initialized from the model checkpoint at hfl/vle-base and are newly initialized: ['vqa_classifier.1.bias', 'vqa_classifier.3.bias', 'vqa_classifier.3.weight', 'vqa_classifier.0.bias', 'vqa_classifier.1.weight', 'vqa_classifier.0.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
File "/home/steven/workstore/nlp/VLE-main/run_vqav2_ft.py", line 107, in
main(train_config)
File "/home/steven/workstore/nlp/VLE-main/run_vqav2_ft.py", line 25, in main
model = VLEForVQA_PL(_config)
File "/home/steven/workstore/nlp/VLE-main/vqav2_train_module.py", line 73, in init
new_state_dict = extend_position_embedding(self.model.state_dict(), patch_size, config["image_size"])
File "/home/steven/workstore/nlp/VLE-main/models/VLE/modeling_vle.py", line 124, in extend_position_embedding
state_dict[keys['pi'][0]] = torch.arange(grid_after*grid_after + 1).unsqueeze(0)
KeyError: 'pi'

Process finished with exit code 1

when i debug step by step, i found that cannot found key value "vision_model.embeddings.position_ids" in parameters list of model 'vle-base'.
Any body encounter same question as me?
please kindly help to solve this problem!
Tks~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.