Giter VIP home page Giter VIP logo

awesome-vision-language-pretraining-papers's Introduction

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

Maintained by WANG Yue ([email protected]). Last update on 2021/02/26.

Table of Contents

Image-based VL-PTMs

Representation Learning

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]

LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]

VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020 [code]

VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019/08, ACL 2020 [code]

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020

Unified Vision-Language Pre-Training for Image Captioning and VQA, AAAI 2020, [code], (VLP)

UNITER: Learning Universal Image-text Representations, ECCV 2020, [code]

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks, arXiv 2019/12

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining, arXiv 2020/03

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, arXiv 2020/04, ECCV 2020

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020/04

ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH, arXiv 2020/06

DeVLBert: Learning Deconfounded Visio-Linguistic Representations, ACM MM 2020, [code]

SEMVLP: VISION-LANGUAGE PRE-TRAINING BY ALIGNING SEMANTICS AT MULTIPLE LEVELS, ICLR 2021 submission

CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations, arXiv 2020/10

Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs, arXiv 2020/11

LAMP: Label Augmented Multimodal Pretraining, arXiv 2020/12

Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, AAAI 2021

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, arXiv 2021

Task-specific

VCR: Fusion of Detected Objects in Text for Visual Question Answering, EMNLP 2019, [code], (B2T2)

TextVQA: Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA, CVPR 2020, [code], (M4C)

VisDial: VD-BERT: A Unified Vision and Dialog Transformer with BERT, EMNLP 2020 [code], (VD-BERT)

VisDial: Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline, ECCV 2020 [code], (VisDial-BERT)

VLN: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, CVPR 2020, [code], (PREVALENT)

Text-image retrieval: ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data, arXiv 2020/01

Image captioning: XGPT: Cross-modal Generative Pre-Training for Image Captioning, arXiv 2020/03

Visual Question Generation: BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations, arXiv 2020/02

Text-image retrieval: CROSS-PROBE BERT FOR EFFICIENT AND EFFECTIVE CROSS-MODAL SEARCH, ICLR 2021 submission.

Chart VQA: STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering, EMNLP 2020.

Other Analysis

Multi-task Learning, 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020, [code]

Multi-task Learning, Unifying Vision-and-Language Tasks via Text Generation, arXiv 2021/02

Social Bias in VL Embedding, Measuring Social Biases in Grounded Vision and Language Embeddings, arXiv 2020/02, [code]

In-depth Analysis, Are we pretraining it right? Digging deeper into visio-linguistic pretraining,

In-depth Analysis, Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models, ECCV 2020 Spotlight

Adversarial Training, Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 Spotlight

Adaptive Analysis, Adaptive Transformers for Learning Multimodal Representations, ACL SRW 2020

Neural Architecture Search, Deep Multimodal Neural Architecture Search, arXiv 2020/04

Dataset perspective, Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, arXiv 2021/02

Video-based VL-PTMs

VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019

Learning Video Representations Using Contrastive Bidirectional Transformers, arXiv 2019/06, (CBT)

M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019/08

BERT for Large-scale Video Segment Classification with Test-time Augmentation, ICCV 2019 YouTube8M workshop, [code]

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog, AAAI2020 DSTC8 workshop

Learning Spatiotemporal Features via Video and Text Pair Discrimination, arXiv 2020/01, (CPD), [code]

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020/02

ActBERT: Learning Global-Local Video-Text Representations, CVPR 2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, EMNLP 2020

Video-Grounded Dialogues with Pretrained Generation Language Models, ACL 2020

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training, arXiv 2020/07

Multimodal Pretraining for Dense Video Captioning, arXiv 2020/11

PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING, arXiv 2020/12

Speech-based VL-PTMs

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models, arXiv 2019/06

Understanding Semantics from Speech Through Pre-training, arXiv 2019/09

SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering, arXiv 2019/10

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations, arXiv 2019/10

Effectiveness of self-supervised pre-training for speech recognition, arXiv 2019/11

Other Transformer-based multimodal networks

Multi-Modality Cross Attention Network for Image and Sentence Matching, ICCV 2020

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning, ACL 2020

History for Visual Dialog: Do we really need it?, ACL 2020

Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020

Other Resources

awesome-vision-language-pretraining-papers's People

Contributors

lllllli avatar yuewang-cuhk avatar zh-plus avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.