Light

zuiwufenghua / awesome-vision-language-pretraining-papers Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yuewang-cuhk/awesome-vision-language-pretraining-papers

0.0 0.0 0.0 80 KB

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

awesome-vision-language-pretraining-papers's Introduction

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

Maintained by WANG Yue ([email protected]). Last update on 2021/02/26.

Table of Contents

Image-based VL-PTMs
Video-based VL-PTMs
Speech-based VL-PTMs
Other Transformer-based multimodal networks
Other Resources

Image-based VL-PTMs

Representation Learning

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]

LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]

VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020 [code]

VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019/08, ACL 2020 [code]

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020

Unified Vision-Language Pre-Training for Image Captioning and VQA, AAAI 2020, [code], (VLP)

UNITER: Learning Universal Image-text Representations, ECCV 2020, [code]

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks, arXiv 2019/12

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining, arXiv 2020/03

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, arXiv 2020/04, ECCV 2020

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020/04

ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH, arXiv 2020/06

DeVLBert: Learning Deconfounded Visio-Linguistic Representations, ACM MM 2020, [code]

SEMVLP: VISION-LANGUAGE PRE-TRAINING BY ALIGNING SEMANTICS AT MULTIPLE LEVELS, ICLR 2021 submission

CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations, arXiv 2020/10

Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs, arXiv 2020/11

LAMP: Label Augmented Multimodal Pretraining, arXiv 2020/12

Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, AAAI 2021

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, arXiv 2021

Task-specific

VCR: Fusion of Detected Objects in Text for Visual Question Answering, EMNLP 2019, [code], (B2T2)

TextVQA: Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA, CVPR 2020, [code], (M4C)

VisDial: VD-BERT: A Unified Vision and Dialog Transformer with BERT, EMNLP 2020 [code], (VD-BERT)

VisDial: Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline, ECCV 2020 [code], (VisDial-BERT)

VLN: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, CVPR 2020, [code], (PREVALENT)

Text-image retrieval: ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data, arXiv 2020/01

Image captioning: XGPT: Cross-modal Generative Pre-Training for Image Captioning, arXiv 2020/03

Visual Question Generation: BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations, arXiv 2020/02

Text-image retrieval: CROSS-PROBE BERT FOR EFFICIENT AND EFFECTIVE CROSS-MODAL SEARCH, ICLR 2021 submission.

Chart VQA: STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering, EMNLP 2020.

Other Analysis

Multi-task Learning, 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020, [code]

Multi-task Learning, Unifying Vision-and-Language Tasks via Text Generation, arXiv 2021/02

Social Bias in VL Embedding, Measuring Social Biases in Grounded Vision and Language Embeddings, arXiv 2020/02, [code]

In-depth Analysis, Are we pretraining it right? Digging deeper into visio-linguistic pretraining,

In-depth Analysis, Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models, ECCV 2020 Spotlight

Adversarial Training, Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 Spotlight

Adaptive Analysis, Adaptive Transformers for Learning Multimodal Representations, ACL SRW 2020

Neural Architecture Search, Deep Multimodal Neural Architecture Search, arXiv 2020/04

Dataset perspective, Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, arXiv 2021/02

Video-based VL-PTMs

VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019

Learning Video Representations Using Contrastive Bidirectional Transformers, arXiv 2019/06, (CBT)

M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019/08

BERT for Large-scale Video Segment Classification with Test-time Augmentation, ICCV 2019 YouTube8M workshop, [code]

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog, AAAI2020 DSTC8 workshop

Learning Spatiotemporal Features via Video and Text Pair Discrimination, arXiv 2020/01, (CPD), [code]

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020/02

ActBERT: Learning Global-Local Video-Text Representations, CVPR 2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, EMNLP 2020

Video-Grounded Dialogues with Pretrained Generation Language Models, ACL 2020

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training, arXiv 2020/07

Multimodal Pretraining for Dense Video Captioning, arXiv 2020/11

PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING, arXiv 2020/12

Speech-based VL-PTMs

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models, arXiv 2019/06

Understanding Semantics from Speech Through Pre-training, arXiv 2019/09

SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering, arXiv 2019/10

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations, arXiv 2019/10

Effectiveness of self-supervised pre-training for speech recognition, arXiv 2019/11

Other Transformer-based multimodal networks

Multi-Modality Cross Attention Network for Image and Sentence Matching, ICCV 2020

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning, ACL 2020

History for Visual Dialog: Do we really need it?, ACL 2020

Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020

Other Resources

Two recent surveys on pretrained language models
- Pre-trained Models for Natural Language Processing: A Survey, arXiv 2020/03
- A Survey on Contextual Embeddings, arXiv 2020/03
Other surveys about multimodal research
- Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods, arXiv 2019
- Deep Multimodal Representation Learning: A Survey, arXiv 2019
- Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2018
- A Comprehensive Survey of Deep Learning for Image Captioning, ACM Computing Surveys 2018
Other repositories of relevant reading list

awesome-vision-language-pretraining-papers's People

Contributors

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.