Light

aalkaid / transformer-in-vision Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dirtyharrylyl/transformer-in-vision

0.0 0.0 0.0 4.71 MB

Recent Transformer-based CV and related works.

transformer-in-vision's Introduction

Transformer-in-Vision

Recent Transformer-based CV and related works. Welcome to comment/contribute!

Keep updated.

Resource

Google PaLM: Scaling Language Modeling with Pathways, [Paper]
OpenAI DALL·E 2 [Page], [Paper]
SCENIC: A JAX Library for Computer Vision Research and Beyond, [Code]
V-L joint learning study (with good tables): [METER], [Kaleido-BERT]
Attention is all you need, [Paper]
OpenAI CLIP [Page], [Paper], [Code], [arXiv]
OpenAI DALL·E [Page], [Code], [Paper]
huggingface/transformers
Kyubyong/transformer, TF
jadore801120/attention-is-all-you-need-pytorch, Torch
krasserm/fairseq-image-captioning
PyTorch Transformers Tutorials
ictnlp/awesome-transformer
basicv8vc/awesome-transformer
dk-liang/Awesome-Visual-Transformer
yuewang-cuhk/awesome-vision-language-pretraining-papers

Survey

(arXiv 2022.03) A Roadmap for Big Model, [Paper]
(arXiv 2022.03) Transformers Meet Visual Learning Understanding: A Comprehensive Review, [[Paper]](https://arxiv.org/pdf/2203.12944.pdf）
(arXiv 2022.03) Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work, [Paper], [Project]
(arXiv 2022.02) A Survey of Vision-Language Pre-Trained Models, [Paper]
(arXiv 2022.02) VLP: A Survey on Vision-Language Pre-training, [Paper]
(arXiv 2022.02) Transformer for Graphs: An Overview from Architecture Perspective, [Paper]
(arXiv 2022.01) Video Transformers: A Survey, [Paper]
(arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP MLP, [Paper]
(arXiv 2021.11) A Survey of Visual Transformers, [Paper]
(arXiv 2021.09) Survey: Transformer based Video-Language Pre-training, [Paper]
(arXiv 2021.06) A Survey of Transformers, [Paper]
(arXiv 2021.06) Attention mechanisms and deep learning for machine vision: A survey of the state of the art, [Paper]
(arXiv 2021.06) Pre-Trained Models: Past, Present and Future, [Paper]
(arXiv 2021.05) Can Attention Enable MLPs To Catch Up With CNNs? [Paper]
(arXiv 2021.03) A Practical Survey on Faster and Lighter Transformers, [Paper]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]
(arXiv 2021.01) A Survey on Visual Transformer, [Paper]
(arXiv 2020.9) Efficient Transformers: A Survey, [Paper]
(arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

2022.04

(arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR VISUAL RELATIONAL REASONING, [Paper]
(arXiv 2022.04) Unsupervised Human Action Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance, [Paper], [Code]
(arXiv 2022.04) Learning Future Object Prediction with a Spatiotemporal Detection Transformer, [Paper]
(arXiv 2022.04) R^2-Trans: Fine-Grained Visual Categorization with Redundancy Reduction, [Paper], [Code]
(arXiv 2022.04) A New Dataset and Transformer for Stereoscopic Video Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization, [Paper]
(arXiv 2022.04) Multi-Scale Features and Parallel Transformers Based Image Quality Assessment, [Paper], [Code]
(arXiv 2022.04) BTranspose: Bottleneck Transformers for Human Pose Estimation with Self-Supervised Pre-Training, [Paper]
(arXiv 2022.04) Human-Object Interaction Detection via Disentangled Transformer, [Paper]
(arXiv 2022.04) ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models, [Paper]
(arXiv 2022.04) Interactiveness Field in Human-Object Interactions, [Paper], [Code]
(arXiv 2022.04) DeiT III: Revenge of the ViT, [Paper]
(arXiv 2022.04) Residual Swin Transformer Channel Attention Network for Image Demosaicing, [Paper]
(arXiv 2022.04) Neighborhood Attention Transformer, [Paper], [Code]
(arXiv 2022.04) MiniViT: Compressing Vision Transformers with Weight Multiplexing, [Paper], [Code]
(arXiv 2022.04) ViTOL: Vision Transformer for Weakly Supervised Object Localization, [Paper], [Code]
(arXiv 2022.04) What Matters in Language Conditioned Robotic Imitation Learning, [Paper], [Code]
(arXiv 2022.04) Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes, [Paper]
(arXiv 2022.04) ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension, [Paper]
(arXiv 2022.04) Are Multimodal Transformers Robust to Missing Modality? [Paper]
(arXiv 2022.04) TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, [Paper], [Code]
(arXiv 2022.04) X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks, [Paper]
(arXiv 2022.04) Event Transformer, [Paper]
(arXiv 2022.04) Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels, [Paper]
(arXiv 2022.04) ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation, [Paper], [Code]
(arXiv 2022.04) Multimodal Transformer for Nursing Activity Recognition, [Paper], [Code]
(arXiv 2022.04) Robust Cross-Modal Representation Learning with Progressive Self-Distillation, [Paper]
(arXiv 2022.04) Stripformer: Strip Transformer for Fast Image Deblurring, [Paper]
(arXiv 2022.04) No Token Left Behind: Explainability-Aided Image Classification and Generation, [Paper]
(arXiv 2022.04) Fashionformer: A Simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition, [Paper], [Code]
(arXiv 2022.04) Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation, [Paper], [Code]
(arXiv 2022.04) DILEMMA: Self-Supervised Shape and Texture Learning with Transformers, [Paper]
(arXiv 2022.04) Learning Trajectory-Aware Transformer for Video Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Learning to Induce Causal Structure, [Paper]
(arXiv 2022.04) Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection, [Paper], [Code]
(arXiv 2022.04) Category-Aware Transformer Network for Better Human-Object Interaction Detection, [Paper]
(arXiv 2022.04) Does Robustness on ImageNet Transfer to Downstream Tasks?, [Paper]
(arXiv 2022.04) POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition, [Paper], [Code]
(arXiv 2022.04) Vision Transformers for Single Image Dehazing, [Paper], [Code]
(arXiv 2022.04) Underwater Image Enhancement Using Pre-trained Transformer, [Paper]
(arXiv 2022.04) Event Transformer. A sparse-aware solution for efficient event data processing, [Paper], [Code]
(arXiv 2022.04) PSTR: End-to-End One-Step Person Search With Transformers, [Paper], [Code]
(arXiv 2022.04) Adapting CLIP For Phrase Localization Without Further Training, [Paper], [Code]
(arXiv 2022.04) FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment, [Paper], [Project]
(arXiv 2022.04) DaViT: Dual Attention Vision Transformers, [Paper], [Code]
(arXiv 2022.04) Unsupervised Prompt Learning for Vision-Language Models, [Paper], [Code]
(arXiv 2022.04) Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer, [Paper], [Project]
(arXiv 2022.04) Unified Contrastive Learning in Image-Text-Label Space, [Paper], [Code]
(arXiv 2022.04) HunYuan_tvr for Text-Video Retrivial, [Paper]
(arXiv 2022.04) LEARNING TO COMPOSE SOFT PROMPTS FOR COMPOSITIONAL ZERO-SHOT LEARNING, [Paper]
(arXiv 2022.04) End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation, [Paper], [Code]
(arXiv 2022.04) Temporal Alignment Networks for Long-term Video, [Paper], [Code]
(arXiv 2022.04) Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection, [Paper], [Code]
(arXiv 2022.04) MixFormer: Mixing Features across Windows and Dimensions, [Paper], [Code]
(arXiv 2022.04) CM3: A CAUSAL MASKED MULTIMODAL MODEL OF THE INTERNET, [Paper]
(arXiv 2022.04) DO AS I CAN, NOT AS I SAY: GROUNDING LANGUAGE IN ROBOTIC AFFORDANCES, [Paper], [Project]
(arXiv 2022.04) TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization, [Paper], [Code]
(arXiv 2022.04) Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, [Paper], [Project]
(arXiv 2022.04) Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition, [Paper]
(arXiv 2022.04) Learning Audio-Video Modalities from Image Captions, [Paper]
(arXiv 2022.04) Improving Vision Transformers by Revisiting High-frequency Components, [Paper]
(arXiv 2022.04) POS-BERT: Point Cloud One-Stage BERT Pre-Training, [Paper], [Code]
(arXiv 2022.04) BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation, [Paper], [Code]
(arXiv 2022.04) BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning, [Paper]
(arXiv 2022.04) TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting, [Paper]
(arXiv 2022.04) Long Movie Clip Classification with State-Space Video Models, [Paper], [Code]
(arXiv 2022.04) TALLFormer: Temporal Action Localization with Long-memory Transformer, [Paper], [Code]
(arXiv 2022.04) MultiMAE: Multi-modal Multi-task Masked Autoencoders, [Paper], [Project]
(arXiv 2022.04) “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations, [Paper]
(arXiv 2022.04) SE(3)-Equivariant Attention Networks for Shape Reconstruction in Function Space, [Paper]
(arXiv 2022.04) Multi-View Transformer for 3D Visual Grounding, [Paper], [Code]
(arXiv 2022.04) VISION TRANSFORMER EQUIPPED WITH NEURAL RESIZER ON FACIAL EXPRESSION RECOGNITION TASK, [Paper]
(arXiv 2022.04) Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition, [Paper], [Project]
(arXiv 2022.04) Detector-Free Weakly Supervised Group Activity Recognition, [Paper]
(arXiv 2022.04) Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos, [Paper], [Project]
(arXiv 2022.04) What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions, [Paper]
(arXiv 2022.04) MaxViT: Multi-Axis Vision Transformer, [Paper]

2022.03

(arXiv 2022.03) Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation, [Paper]
(arXiv 2022.03) ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval, [Paper]
(arXiv 2022.03) ReSTR: Convolution-free Referring Image Segmentation Using Transformers, [Paper], [Project]
(arXiv 2022.03) CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation, [Paper]
(arXiv 2022.03) Deformable Video Transformer, [Paper]
(arXiv 2022.03) End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps, [Paper]
(arXiv 2022.03) CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow, [Paper], [Code]
(arXiv 2022.03) VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, [Paper], [App]
(arXiv 2022.03) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing, [Paper], [Code]
(arXiv 2022.03) BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, [Paper], [Code]
(arXiv 2022.03) Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models, [Paper], [Code]
(arXiv 2022.03) Bringing Old Films Back to Life, [Paper], [Code]
(arXiv 2022.03) Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model, [Paper], [Code]
(arXiv 2022.03) SeqTR: A Simple yet Universal Network for Visual Grounding, [Paper], [Code]
(arXiv 2022.03) InstaFormer: Instance-Aware Image-to-Image Translation with Transformer, [Paper]
(arXiv 2022.03) Omni-DETR: Omni-Supervised Object Detection with Transformers, [Paper], [Code]
(arXiv 2022.03) Learning Program Representations for Food Images and Cooking Recipes, [Paper], [Project]
(arXiv 2022.03) ITTR: Unpaired Image-to-Image Translation with Transformers, [Paper]
(arXiv 2022.03) VPTR: Efficient Transformers for Video Prediction, [Paper], [Code]
(arXiv 2022.03) Parameter-efficient Fine-tuning for Vision Transformers, [Paper]
(arXiv 2022.03) TubeDETR: Spatio-Temporal Video Grounding with Transformers, [Paper], [Code]
(arXiv 2022.03) Exploring Plain Vision Transformer Backbones for Object Detection, [Paper]
(arXiv 2022.03) PROMPTDET: EXPAND YOUR DETECTOR VOCABULARY WITH UNCURATED IMAGES, [Paper], [Code]
(arXiv 2022.03) Few-Shot Object Detection with Fully Cross-Transformer, [Paper]
(arXiv 2022.03) Unified Transformer Tracker for Object Tracking, [Paper]
(arXiv 2022.03) X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.03) Fine-tuning Image Transformers using Learnable Memory, [Paper]
(arXiv 2022.03) MAT: Mask-Aware Transformer for Large Hole Image Inpainting, [Paper], [Code]
(arXiv 2022.03) mc-BEiT: Multi-choice Discretization for Image BERT Pre-training, [Paper]
(arXiv 2022.03) End-to-End Transformer Based Model for Image Captioning, [Paper]
(arXiv 2022.03) Hybrid Routing Transformer for Zero-Shot Learning, [Paper]
(arXiv 2022.03) TREATMENT LEARNING TRANSFORMER FOR NOISY IMAGE CLASSIFICATION, [Paper]
(arXiv 2022.03) Do Vision-Language Pretrained Models Learn Primitive Concepts?, [Paper]
(arXiv 2022.03) Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs, [Paper]
(arXiv 2022.03) SepViT: Separable Vision Transformer, [Paper]
(arXiv 2022.03) MatteFormer: Transformer-Based Image Matting via Prior-Tokens, [Paper], [Code]
(arXiv 2022.03) Feature Selective Transformer for Semantic Image Segmentation, [Paper]
(arXiv 2022.03) Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos, [Paper], [Code]
(arXiv 2022.03) RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution, [Paper], [Code]
(arXiv 2022.03) Single-Stream Multi-Level Alignment for Vision-Language Pretraining, [Paper]
(arXiv 2022.03) Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers, [Paper], [Code]
(arXiv 2022.03) Collaborative Transformers for Grounded Situation Recognition, [Paper], [Code]
(arXiv 2022.03) Object Memory Transformer for Object Goal Navigation, [Paper]
(arXiv 2022.03) Brain-inspired Multilayer Perceptron with Spiking Neurons, [Paper], [Code]
(arXiv 2022.03) HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network, [Paper], [Code]
(arXiv 2022.03) REGTR: End-to-end Point Cloud Correspondences with Transformers, [Paper], [Code]
(arXiv 2022.03) Automated Progressive Learning for Efficient Training of Vision Transformers, [Paper]
(arXiv 2022.03) Stratified Transformer for 3D Point Cloud Segmentation, [Paper], [Code]
(arXiv 2022.03) NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge, [Paper]
(arXiv 2022.03) FACIAL EXPRESSION RECOGNITION WITH SWIN TRANSFORMER, [Paper]
(arXiv 2022.03) Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness, [Paper]
(arXiv 2022.03) Efficient Visual Tracking via Hierarchical Cross-Attention Transformer, [Paper], [Code]
(arXiv 2022.03) High-Performance Transformer Tracking, [Paper], [Code]
(arXiv 2022.03) RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers, [Paper]
(arXiv 2022.03) Multi-modal Multi-label Facial Action Unit Detection with Transformer, [Paper]
(arXiv 2022.03) MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection, [Paper], [Code]
(arXiv 2022.03) Text to Mesh Without 3D Supervision Using Limit Subdivision, [Paper], [Project]
(arXiv 2022.03) GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection, [Paper], [Code]
(arXiv 2022.03) CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation, [Paper]
(arXiv 2022.03) FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks, [Paper], [Code]
(arXiv 2022.03) Vision Transformer Compression with Structured Pruning and Low Rank Approximation, [Paper]
(arXiv 2022.03) Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers, [Paper]
(arXiv 2022.03) MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection, [Paper]
(arXiv 2022.03) Learning Patch-to-Cluster Attention in Vision Transformer, [Paper]
(arXiv 2022.03) Visual Prompt Tuning, [Paper]
(arXiv 2022.03) Training-free Transformer Architecture Search, [Paper]
(arXiv 2022.03) VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, [Paper], [Code]
(arXiv 2022.03) METAMORPH: LEARNING UNIVERSAL CONTROLLERS WITH TRANSFORMERS, [Paper], [Project]
(arXiv 2022.03) A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning, [Paper]
(arXiv 2022.03) Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers, [Paper], [Project]
(arXiv 2022.03) Associating Objects with Scalable Transformers for Video Object Segmentation, [Paper], [[Project]](https://github.com/z-x-yang/AOT0
(arXiv 2022.03) HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation, [Paper], [Code]
(arXiv 2022.03) Learning to generate line drawings that convey geometry and semantics, [Paper], [Project]
(arXiv 2022.03) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection, [Paper], [Code]
(arXiv 2022.03) AIMusicGuru: Music Assisted Human Pose Correction, [Paper]
(arXiv 2022.03) What to Hide from Your Students: Attention-Guided Masked Image Modeling, [Paper]
(arXiv 2022.03) Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer, [Paper]
(arXiv 2022.03) ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator, [Paper]
(arXiv 2022.03) Keypoints Tracking via Transformer Networks, [Paper], [Code]
(arXiv 2022.03) Beyond Fixation: Dynamic Window Visual Transformer, [Paper], [Code]
(arXiv 2022.03) Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors, [Paper]
(arXiv 2022.03) Self-supervised Video-centralised Transformer for Video Face Clustering, [Paper]
(arXiv 2022.03) Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [Paper]
(arXiv 2022.03) Global Tracking Transformers, [Paper], [Code]
(arXiv 2022.03) Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer, [Paper], [Code]
(arXiv 2022.03) QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation, [Paper]
(arXiv 2022.03) Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos, [Paper], [Project]
(arXiv 2022.03) GradViT: Gradient Inversion of Vision Transformers, [Paper], [Code]
(arXiv 2022.03) Mask Usage Recognition using Vision Transformer with Transfer Learning and Data Augmentation, [Paper]
(arXiv 2022.03) Under the Hood of Transformer Networks for Trajectory Forecasting, [Paper]
(arXiv 2022.03) Open-Vocabulary DETR with Conditional Matching, [Paper]
(arXiv 2022.03) Meta-attention for ViT-backed Continual Learning, [Paper], [Code]
(arXiv 2022.03) CNNs and Transformers Perceive Hybrid Images Similar to Humans, [Paper], [Code]
(arXiv 2022.03) Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory, [Paper], [Code]
(arXiv 2022.03) Affective Feedback Synthesis Towards Multimodal Text and Image Data, [Paper]
(arXiv 2022.03) ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers, [Paper]
(arXiv 2022.03) CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration, [Paper]
(arXiv 2022.03) Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds, [Paper], [Code]
(arXiv 2022.03) HIPA: Hierarchical Patch Transformer for Single Image Super Resolution, [Paper]
(arXiv 2022.03) DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition, [Paper], [Code]
(arXiv 2022.03) MixFormer: End-to-End Tracking with Iterative Mixed Attention, [Paper], [Code]
(arXiv 2022.03) PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark, [Paper], [Code]
(arXiv 2022.03) Relationformer: A Unified Framework for Image-to-Graph Generation, [Paper], [Code]
(arXiv 2022.03) CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning, [Paper], [Code]
(arXiv 2022.03) Hyperbolic Vision Transformers: Combining Improvements in Metric Learning, [Paper], [Code]
(arXiv 2022.03) MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer, [Paper], [Code]
(arXiv 2022.03) Transformer-based HTR for Historical Documents, [Paper]
(arXiv 2022.03) simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers, [Paper], [Code]
(arXiv 2022.03) End-to-End Human-Gaze-Target Detection with Transformers, [Paper]
(arXiv 2022.03) End-to-End Video Text Spotting with Transformer, [Paper], [Code]
(arXiv 2022.03) Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation, [Paper], [Code]
(arXiv 2022.03) V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer, [Paper]
(arXiv 2022.03) LocATe: End-to-end Localization of Actions in 3D with Transformers, [Paper]
(arXiv 2022.03) AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder, [Paper]
(arXiv 2022.03) ViM: Out-Of-Distribution with Virtual-logit Matching, [Paper], [Code]
(arXiv 2022.03) ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer, [Paper]
(arXiv 2022.03) Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows, [Paper]
(arXiv 2022.03) Vision Transformer with Convolutions Architecture Search, [Paper]
(arXiv 2022.03) Cascade Transformers for End-to-End Person Search, [Paper], [Code]
(arXiv 2022.03) CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance, [Paper]
(arXiv 2022.03) MatchFormer: Interleaving Attention in Transformers for Feature Matching, [Paper], [Code]
(arXiv 2022.03) Local-Global Context Aware Transformer for Language-Guided Video Segmentation, [Paper], [Code]
(arXiv 2022.03) Three things everyone should know about Vision Transformers, [Paper]
(arXiv 2022.03) Are Vision Transformers Robust to Spurious Correlations? [Paper], [Code]
(arXiv 2022.03) MUTUAL GENERATIVE TRANSFORMER LEARNING FOR CROSS-VIEW GEO-LOCALIZATION, [Paper]
(arXiv 2022.03) DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training, [Paper]
(arXiv 2022.03) Semantic-aligned Fusion Transformer for One-shot Object Detection, [Paper]
(arXiv 2022.03) UNIMO-2: End-to-End Unified Vision-Language Grounded Learning, [Paper], [Code]
(arXiv 2022.03) Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning, [Paper], [Code]
(arXiv 2022.03) One-Shot Adaptation of GAN in Just One CLIP, [Paper]
(arXiv 2022.03) PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation, [Paper]
(arXiv 2022.03) PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer, [Paper]
(arXiv 2022.03) Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image, [Paper], [Code]
(arXiv 2022.03) Transframer: Arbitrary Frame Prediction with Generative Models, [Paper]
(arXiv 2022.03) Towards Data-Efficient Detection Transformers, [Paper], [Code]
(arXiv 2022.03) Bi-directional Object-Context Prioritization Learning for Saliency Ranking, [Paper], [Code]
(arXiv 2022.03) PATCH-FOOL: ARE VISION TRANSFORMERS ALWAYS ROBUST AGAINST ADVERSARIAL PERTURBATIONS? [Paper], [Code]
(arXiv 2022.03) WegFormer: Transformers for Weakly Supervised Semantic Segmentation, [Paper]
(arXiv 2022.03) Open Set Recognition using Vision Transformer with an Additional Detection Head, [Paper], [Code]
(arXiv 2022.03) UNIFIED VISUAL TRANSFORMER COMPRESSION, [Paper], [Code]
(arXiv 2022.03) Towards Practical Certifiable Patch Defense with Vision Transformer, [Paper]
(arXiv 2022.03) EDTER: Edge Detection with Transformer, [Paper], [Code]
(arXiv 2022.03) ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation, [Paper]
(arXiv 2022.03) Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution, [Paper]
(arXiv 2022.03) Revitalize Region Feature for Democratizing Video-Language Pre-training, [Paper], [Code]
(arXiv 2022.03) Inverted Pyramid Multi-task Transformer for Dense Scene Understanding, [Paper]
(arXiv 2022.03) Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Style Transformer for Image Inversion and Editing, [Paper], [Code]
(arXiv 2022.03) MotionCLIP: Exposing Human Motion Generation to CLIP Space, [Paper], [Project]
(arXiv 2022.03) The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy, [Paper], [Code]
(arXiv 2022.03) Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation, [Paper]
(arXiv 2022.03) Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning, [Paper], [Code]
(arXiv 2022.03) Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting, [Paper]
(arXiv 2022.03) DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection, [Paper]
(arXiv 2022.03) DATR: Domain-adaptive transformer for multi-domain landmark detection, [Paper]
(arXiv 2022.03) EventFormer: AU Event Transformer for Facial Action Unit Event Detection, [Paper]
(arXiv 2022.03) Accelerating DETR Convergence via Semantic-Aligned Matching, [Paper], [Code]
(arXiv 2022.03) All in One: Exploring Unified Video-Language Pre-training, [Paper], [Code]
(arXiv 2022.03) CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment, [Paper]
(arXiv 2022.03) EIT: Efficiently Lead Inductive Biases to ViT, [Paper], [Code]
(arXiv 2022.03) Self-Promoted Supervision for Few-Shot Transformer, [Paper], [Code]
(arXiv 2022.03) MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization, [Paper]
(arXiv 2022.03) Disentangled Representation Learning for Text-Video Retrieval, [Paper]
(arXiv 2022.03) TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding, [Paper], [Dataset]
(arXiv 2022.03) Visualizing and Understanding Patch Interactions in Vision Transformer, [Paper]
(arXiv 2022.03) ANTI-OVERSMOOTHING IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS: FROM THEORY TO PRACTICE, [Paper], [Code]
(arXiv 2022.03) Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision, [Paper], [Code]
(arXiv 2022.03) ActiveMLP: An MLP-like Architecture with Active Token Mixer, [Paper], [Code]
(arXiv 2022.03) Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding, [Paper]
(arXiv 2022.03) TrueType Transformer: Character and Font Style Recognition in Outline Format, [Paper]
(arXiv 2022.03) LOOPITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval, [Paper]
(arXiv 2022.03) MVP: Multimodality-guided Visual Pre-training, [Paper]
(arXiv 2022.03) DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting, [Paper]
(arXiv 2022.03) Multi-Modal Mixup for Robust Fine-tuning, [Paper]
(arXiv 2022.03) AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant, [Paper], [Project]
(arXiv 2022.03) Coarse-to-Fine Vision Transformer, [Paper], [Code]
(arXiv 2022.03) Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers, [Paper]
(arXiv 2022.03) WAVEMIX: RESOURCE-EFFICIENT TOKEN MIXING FOR IMAGES, [Paper]
(arXiv 2022.03) VOVIT: LOW LATENCY GRAPH-BASED AUDIO-VISUAL VOICE SEPARATION TRANSFORMER, [Paper], [Code]
(arXiv 2022.03) Graph Attention Transformer Network for Multi-Label Image Classification, [Paper]
(arXiv 2022.03) EDGEFORMER: IMPROVING LIGHT-WEIGHT CONVNETS BY LEARNING FROM VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.03) Skating-Mixer: Multimodal MLP for Scoring Figure Skating, [Paper]
(arXiv 2022.03) Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention, [Paper]
(arXiv 2022.03) CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction, [Paper]
(arXiv 2022.03) Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning, [Paper]
(arXiv 2022.03) ChiTransformer: Towards Reliable Stereo from Cues, [Paper]
(arXiv 2022.03) A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation,** Co-Saliency Detection** and Video Salient Object Detection, [Paper], [Code]
(arXiv 2022.03) Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction, [Paper]
(arXiv 2022.03) CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.03) Multiscale Transformer for Hyperspectral Image Classification, [Paper]
(arXiv 2022.03) Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning, [Paper], [Code]
(arXiv 2022.03) Autoregressive Image Generation using Residual Quantization, [Paper]
(arXiv 2022.03) CONTEXTFORMER: A TRANSFORMER WITH SPATIO-CHANNEL ATTENTION FOR CONTEXT MODELING IN LEARNED IMAGE COMPRESSION, [Paper]
(arXiv 2022.03) Patch Similarity Aware Data-Free Quantization for Vision Transformers, [Paper]
(arXiv 2022.03) ViT-P: Rethinking Data-efficient Vision Transformers from Locality, [Paper]
(arXiv 2022.03) DIT: SELF-SUPERVISED PRE-TRAINING FOR DOCUMENT IMAGE TRANSFORMER, [Paper]
(arXiv 2022.03) Towards Efficient and Scalable Sharpness-Aware Minimization, [Paper]
(arXiv 2022.03) HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening, [Paper], [Code]
(arXiv 2022.03) UVCGAN: UNET VISION TRANSFORMER CYCLE-CONSISTENT GAN FOR UNPAIRED IMAGE-TO-IMAGE TRANSLATION, [Paper], [Code]
(arXiv 2022.03) Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, [Paper], [Code]
(arXiv 2022.03) PANFORMER: A TRANSFORMER BASED MODEL FOR PAN-SHARPENING, [Paper], [Code]
(arXiv 2022.03) Multi-class Token Transformer for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Cross Language Image Matching for Weakly Supervised Semantic Segmentation, [Paper]
(arXiv 2022.03) Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.03) DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, [Paper], [Code]
(arXiv 2022.03) MetaFormer : A Unified Meta Framework for Fine-Grained Recognition, [Paper], [Code]
(arXiv 2022.03) Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language, [Paper]
(arXiv 2022.03) Knowledge Amalgamation for Object Detection with Transformers, [Paper]
(arXiv 2022.03) Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos, [Paper]
(arXiv 2022.03) Modeling Coreference Relations in Visual Dialog, [Paper], [Code]
(arXiv 2022.03) VITRANSPAD: VIDEO TRANSFORMER USING CONVOLUTION AND SELF-ATTENTION FOR FACE PRESENTATION ATTACK DETECTION, [Paper]
(arXiv 2022.03) Multi-Tailed Vision Transformer for Efficient Inference, [Paper]
(arXiv 2022.03) Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Ensembles of Vision Transformers as a New Paradigm for Automated Classification in Ecology, [Paper]
(arXiv 2022.03) LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network, [Paper], [Code]
(arXiv 2022.03) LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction, [Paper]
(arXiv 2022.03) DCT-Former: Efficient Self-Attention with Discrete Cosine Transform, [Paper], [Code]
(arXiv 2022.03) Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment, [Paper]
(arXiv 2022.03) Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in Point Cloud, [Paper]
(arXiv 2022.03) CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP, [Paper]
(arXiv 2022.03) MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video, [Paper]
(arXiv 2022.03) X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning, [Paper]
(arXiv 2022.03) 3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification, [Paper]
(arXiv 2022.03) DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation, [Paper]
(arXiv 2022.03) D_2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention, [Paper]
(arXiv 2022.03) Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding, [Paper], [Code]
(arXiv 2022.03) Self-supervised Transformer for Deepfake Detection, [Paper]
(arXiv 2022.03) Aggregated Pyramid Vision Transformer: Splittransform-merge Strategy for Image Recognition without Convolutions, [Paper]
(arXiv 2022.03) TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration, [Paper], [Code]
(arXiv 2022.03) DN-DETR: Accelerate DETR Training by Introducing Query DeNoising, [Paper], [Code]
(arXiv 2022.03) Protecting Celebrities with Identity Consistency Transformer, [Paper]
(arXiv 2022.03) Masked Visual Pre-training for Motor Control, [Paper], [Project]
(arXiv 2022.03) NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, [Paper], [Code]
(arXiv 2022.03) Conditional Prompt Learning for Vision-Language Models, [Paper], [Code]
(arXiv 2022.03) Lane Detection with Versatile AtrousFormer and Local Semantic Guidance, [Paper]
(arXiv 2022.03) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]
(arXiv 2022.03) Forecasting Characteristic 3D Poses of Human Actions , [Paper], [Code]

2022.02

(arXiv 2022.02) Bayesian Structure Learning with Generative Flow Networks, [Paper]
(arXiv 2022.02) Towards Unsupervised Domain Adaptation via Domain-Transformer, [Paper]
(arXiv 2022.02) An End-to-End Transformer Model for Crowd Localization, [Paper]
(arXiv 2022.02) Instantaneous Physiological Estimation using Video Transformers, [Paper], [Code]
(arXiv 2022.02) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation, [Paper], [Code]
(arXiv 2022.02) ATTENTION ENABLES ZERO APPROXIMATION ERROR, [Paper]
(arXiv 2022.02) When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection, [Paper], [Code]
(arXiv 2022.02) AUTO-SCALING VISION TRANSFORMERS WITHOUT TRAINING, [Paper], [Code]
(arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation, [Paper], [Project]
(arXiv 2022.02) LEARNING TO MERGE TOKENS IN VISION TRANSFORMERS, [Paper]
(arXiv 2022.02) ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers, [Paper], [Code]
(arXiv 2022.02) SELF-SUPERVISED TRANSFORMERS FOR UNSUPERVISED OBJECT DISCOVERY USING NORMALIZED CUT, [Paper], [Project]
(arXiv 2022.02) Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis, [Paper]
(arXiv 2022.02) CaMEL: Mean Teacher Learning for Image Captioning, [Paper]
(arXiv 2022.02) Hierarchical Perceiver, [Paper]
(arXiv 2022.02) Movies2Scenes: Learning Scene Representations Using Movie Similarities, [Paper]
(arXiv 2022.02) GroupViT: Semantic Segmentation Emerges from Text Supervision, [Paper], [[Code
(arXiv 2022.02) Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer, [Paper], [Code]
(arXiv 2022.02) Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations, [Paper]
(arXiv 2022.02) ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond, [Paper]
(arXiv 2022.02) PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths, [Paper], [Code]
(arXiv 2022.02) DataMUX: Data Multiplexing for Neural Networks, [Paper], [Code]
(arXiv 2022.02) On Guiding Visual Attention with Language Specification, [Paper]
(arXiv 2022.02) SPATIO-TEMPORAL OUTDOOR LIGHTING AGGREGATION ON IMAGE SEQUENCES USING TRANSFORMER NETWORKS, [Paper]
(arXiv 2022.02) MISINFORMATION DETECTION IN SOCIAL MEDIA VIDEO POSTS, [Paper]
(arXiv 2022.02) Can Deep Learning be Applied to Model-Based Multi-Object Tracking? [Paper]
(arXiv 2022.02) NOT ALL PATCHES ARE WHAT YOU NEED: EXPEDITING VISION TRANSFORMERS VIA TOKEN REORGANIZATIONS, [Paper], [Code]
(arXiv 2022.02) ActionFormer: Localizing Moments of Actions with Transformers, [Paper], [Code]
(arXiv 2022.02) One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones, [Paper]
(arXiv 2022.02) XAI for Transformers: Better Explanations through Conservative Propagation, [Paper]
(arXiv 2022.02) MeshLeTemp: Leveraging the Learnable Vertex-Vertex Relationship to Generalize Human Pose and Mesh Reconstruction for In-the-Wild Scenes, [Paper]
(arXiv 2022.02) ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer, [Paper]
(arXiv 2022.02) Hyper-relationship Learning Network for Scene Graph Generation, [Paper]
(arXiv 2022.02) CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval, [Paper]
(arXiv 2022.02) Flowformer: Linearizing Transformers with Conservation Flows, [Paper]
(arXiv 2022.02) DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following, [Paper], [Code]
(arXiv 2022.02) CATs++: Boosting Cost Aggregation with Convolutions and Transformers, [Paper]
(arXiv 2022.02) Geometric Transformer for Fast and Robust Point Cloud Registration, [Paper], [Code]
(arXiv 2022.02) I-Tuning: Tuning Language Models with Image for Caption Generation, [[Paper]](I-Tuning: Tuning Language Models with Image for Caption Generation)
(arXiv 2022.02) Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval, [Paper], [Code]
(arXiv 2022.02) Visual Acoustic Matching, [Paper]
(arXiv 2022.02) LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling, [Paper]
(arXiv 2022.02) BViT: Broad Attention based Vision Transformer, [Paper], [Code]
(arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation, [Paper]
(arXiv 2022.02) Domain Adaptation via Prompt Learning, [Paper]
(arXiv 2022.02) Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs, [Paper], [Code]
(arXiv 2022.02) Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework, [Paper], [Project]
(arXiv 2022.02) HOW DO VISION TRANSFORMERS WORK? [Paper], [Code]
(arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning, [Paper], [Code]
(arXiv 2022.02) CLIPasso: Semantically-Aware Object Sketching, [Paper], [Code]
(arXiv 2022.02) Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer, [Paper]
(arXiv 2022.02) DEEP SOCCER CAPTIONING WITH TRANSFORMER: DATASET, SEMANTICS-RELATED LOSSES, AND MULTI-LEVEL EVALUATION, [Paper], [Project]
(arXiv 2022.02) ENTROFORMER: A TRANSFORMER-BASED ENTROPY MODEL FOR LEARNED IMAGE COMPRESSION, [Paper], [Code]
(arXiv 2022.02) Image Difference Captioning with Pre-training and Contrastive Learning, [Paper], [Code]
(arXiv 2022.02) MaskGIT: Masked Generative Image Transformer, [Paper]
(arXiv 2022.02) Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning, [Paper]
(arXiv 2022.02) Motion-Aware Transformer For Occluded Person Re-identification, [Paper]
(arXiv 2022.02) Conditional Motion In-betweening, [Paper], [Code]
(arXiv 2022.02) Memory-based gaze prediction in deep imitation learning for robot manipulation, [Paper]
(arXiv 2022.02) Spherical Transformer, [Paper]
(arXiv 2022.02) OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context, [Paper]
(arXiv 2022.02) The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning, [Paper], [Project]
(arXiv 2022.02) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]
(arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]
(arXiv 2022.02) TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer, [Paper]
(arXiv 2022.02) The devil is in the labels: Semantic segmentation from sentences, [Paper]
(arXiv 2022.02) Webly Supervised Concept Expansion for General Purpose Vision Models, [Paper], [Project]
(arXiv 2022.02) VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG, [Paper]
(arXiv 2022.02) UNIFYING ARCHITECTURES, TASKS, AND MODALITIES THROUGH A SIMPLE SEQUENCE-TO-SEQUENCE LEARNING FRAMEWORK, [Paper], [Code]
(arXiv 2022.02) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper]
(arXiv 2022.02) TRANSDREAMER: REINFORCEMENT LEARNING WITH TRANSFORMER WORLD MODELS, [Paper]
(arXiv 2022.02) Vision-Language Pre-Training with Triple Contrastive Learning, [Paper], [Code]
(arXiv 2022.02) Corrupted Image Modeling for Self-Supervised Visual Pre-Training, [Paper]
(arXiv 2022.02) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, [Paper], [Code]
(arXiv 2022.02) DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators, [Paper]
(arXiv 2022.02) Interactron: Embodied Adaptive Object Detection, [Paper]
(arXiv 2022.02) Local Feature Matching with Transformers for low-end devices LoFTR method adaptation approach, [Paper], [Code]
(arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]
(arXiv 2022.02) Can Transformers be Strong Treatment Effect Estimators?, [Paper]
(arXiv 2022.02) Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers, [Paper]
(arXiv 2022.02) Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics, [Paper], [Code]

2022.01

(arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [Paper]
(arXiv 2022.01) DynaMixer: A Vision MLP Architecture with Dynamic Mixing, [Paper]
(arXiv 2022.01) VRT: A Video Restoration Transformer, [Paper], [Code]
(arXiv 2022.01) DAB-DETR: DYNAMIC ANCHOR BOXES ARE BETTER QUERIES FOR DETR, [Paper], [Code]
(arXiv 2022.01) Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations, [Paper]
(arXiv 2022.01) MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment, [Paper]
(arXiv 2022.01) VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training, [Paper]
(arXiv 2022.01) BOAT: Bilateral Local Attention Vision Transformer, [Paper]
(arXiv 2022.01) GRAPH SELF-ATTENTION FOR LEARNING GRAPH REPRESENTATION WITH TRANSFORMER, [Paper]
(arXiv 2022.01) Aggregating Global Features into Local Vision Transformer, [Paper], [Code]
(arXiv 2022.01) Transformer Module Networks for Systematic Generalization in Visual Question Answering, [Paper]
(arXiv 2022.01) Generalised Image Outpainting with U-Transformer, [Paper]
(arXiv 2022.01) RelTR: Relation Transformer for Scene Graph Generation, [Paper]
(arXiv 2022.01) DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer, [Paper]
(arXiv 2022.01) Pre-Trained Language Transformers are Universal Image Classifiers, [Paper]
(arXiv 2022.01) Explore and Match: End-to-End Video Grounding with Transformer, [Paper]
(arXiv 2022.01) TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network, [Paper]
(arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals, [Paper]
(arXiv 2022.01) ShapeFormer: Transformer-based Shape Completion via Sparse Representation, [Paper], [Project]
(arXiv 2022.01) CONVOLUTIONAL XFORMERS FOR VISION, [Paper], [Code]
(arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer, [Paper], [Code]
(arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [Paper]
(arXiv 2022.01) SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering, [Paper]
(arXiv 2022.01) DUAL-TASKS SIAMESE TRANSFORMER FRAMEWORK FOR BUILDING DAMAGE ASSESSMENT, [Paper]
(arXiv 2022.01) When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism, [Paper], [Code]
(arXiv 2022.01) Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation, [Paper]
(arXiv 2022.01) Training Vision Transformers with Only 2040 Images, [Paper]
(arXiv 2022.01) Learning To Recognize Procedural Activities with Distant Supervision, [Paper]
(arXiv 2022.01) EVALUATING LANGUAGE-BIASED IMAGE CLASSIFICATION BASED ON SEMANTIC REPRESENTATIONS, [Paper]
(arXiv 2022.01) A Comprehensive Study of Vision Transformers on Dense Prediction Tasks, [Paper]
(arXiv 2022.01) UniFormer: Unifying Convolution and Self-attention for Visual Recognition, [Paper], [Code]
(arXiv 2022.01) Patches Are All You Need? [Paper], [Code]
(arXiv 2022.01) Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval, [Paper]
(arXiv 2022.01) LEARNING TO ACT WITH AFFORDANCE-AWARE MULTIMODAL NEURAL SLAM, [Paper]
(arXiv 2022.01) Visual Information Guided Zero-Shot Paraphrase Generation, [Paper]
(arXiv 2022.01) TerViT: An Efficient Ternary Vision Transformer, [Paper]
(arXiv 2022.01) End-to-end Generative Pretraining for Multimodal Video Captioning, [Paper]
(arXiv 2022.01) OMNIVORE: A Single Model for Many Visual Modalities, [Paper], [Project]
(arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition, [Paper]
(arXiv 2022.01) The CLEAR Benchmark: Continual LEArning on Real-World Imagery, [Paper], [Project]
(arXiv 2022.01) ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues, [Paper]
(arXiv 2022.01) Cross-modal Contrastive Distillation for Instructional Activity Anticipation, [Paper]
(arXiv 2022.01) Transformers in Action: Weakly Supervised Action Segmentation, [Paper]
(arXiv 2022.01) VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer, [Paper]
(arXiv 2022.01) CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks, [Paper]
(arXiv 2022.01) Domain Adaptation via Bidirectional Cross-Attention Transformer, [Paper]
(arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for Online Inference, [Paper]
(arXiv 2022.01) Motion Inbetweening via Deep ∆-Interpolator, [Paper]
(arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper]
(arXiv 2022.01) GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events, [Paper]
(arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [Paper]
(arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]
(arXiv 2022.01) Disentangled Latent Transformer for Interpretable Monocular Height Estimation, [Paper], [Project]
(arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers*, [Paper]
(arXiv 2022.01) SWINUNET3D - A HIERARCHICAL ARCHITECTURE FOR DEEP TRAFFIC PREDICTION USING SHIFTED WINDOW TRANSFORMERS, [Paper], [Code]
(arXiv 2022.01) SWIN-POSE: SWIN TRANSFORMER BASED HUMAN POSE ESTIMATION, [Paper]
(arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation, [Paper], [Project]
(arXiv 2022.01) ViT2Hash: Unsupervised Information-Preserving Hashing, [Paper]
(arXiv 2022.01) LANGUAGE-DRIVEN SEMANTIC SEGMENTATION, [Paper], [Code]
(arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]
(arXiv 2022.01) ImageSubject: A Large-scale Dataset for Subject Detection, [Paper]
(arXiv 2022.01) Detecting Twenty-thousand Classes using Image-level Supervision, [Paper], [Code]
(arXiv 2022.01) Generalized Category Discovery, [Paper], [Code]
(arXiv 2022.01) Video Summarization Based on Video-text Modelling, [Paper]
(arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2022.01) QUADTREE ATTENTION FOR VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.01) A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval, [Paper], [Project]
(arXiv 2022.01) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]
(arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]
(arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]
(arXiv 2022.01) Multiview Transformers for Video Recognition, [Paper]
(arXiv 2022.01) HYPERTRANSFORMER: MODEL GENERATION FOR SUPERVISED AND SEMI-SUPERVISED FEW-SHOT LEARNING, [Paper]
(arXiv 2022.01) UNIFORMER: UNIFIED TRANSFORMER FOR EFFICIENT SPATIOTEMPORAL REPRESENTATION LEARNING, [Paper], [Code]
(arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper], [Project]
(arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers, [Paper]
(arXiv 2022.01) CLIP-Event: Connecting Text and Images with Event Structures, [Paper], [Code]
(arXiv 2022.01) Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training, [Paper]
(arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]
(arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]
(arXiv 2022.01) TransVPR: Transformer-based TransVPR: Transformer-based place recognition with multi-level attention aggregation with multi-level attention aggregation, [Paper]
(arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]
(arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring, [Paper]
(arXiv 2022.01) Stochastic Layers in Vision Transformers, [Paper]
(arXiv 2022.01) ERNIE-VILG: UNIFIED GENERATIVE PRE-TRAINING FOR BIDIRECTIONAL VISION-LANGUAGE GENERATION, [Paper]
(arXiv 2022.01) InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer, [Paper], [Code]
(arXiv 2022.01) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]
(arXiv 2022.01) Persformer: A Transformer Architecture for Topological Machine Learning, [Paper]
(arXiv 2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]
(arXiv 2022.01) Language as Queries for Referring Video Object Segmentation, [Paper], [Code]
(arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]
(arXiv 2022.01) A TRANSFORMER-BASED SIAMESE NETWORK FOR CHANGE DETECTION, [Paper], [Code]
(arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]
(arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer, [Paper], [Project]
(arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

2021.12

(arXiv 2021.12) Multi-Dimensional Model Compression of Vision Transformer, [Paper]
(arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]
(arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Atention, [Paper], [Code]
(arXiv 2021.12) APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers, [Paper]
(arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper]
(arXiv 2021.12) Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?, [Paper]
(arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]
(arXiv 2021.12) A FISTFUL OF WORDS: LEARNING TRANSFERABLE VISUAL MODELS FROM BAG-OF-WORDS SUPERVISION, [Paper]
(arXiv 2021.12) StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2, [Paper], [Code]
(arXiv 2021.12) A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, [Paper], [Code]
(arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]
(arXiv 2021.12) SIMVIT: EXPLORING A SIMPLE VISION TRANSFORMER WITH SLIDING WINDOWS, [Paper], [Code]
(arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]
(arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]
(arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]
(arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]
(arXiv 2021.12) ViR: the Vision Reservoir, [Paper]
(arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]
(arXiv 2021.12) Open-Vocabulary Image Segmentation, [Paper]
(arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]
(arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]
(arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]
(arXiv 2021.12) Fine-grained Multi-Modal Self-Supervised Learning, [Paper]
(arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper], [Code]
(arXiv 2021.12) CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes, [Paper]
(arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input Adaptation, [Paper]
(arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.12) Contrastive Object Detection Using Knowledge Graph Embeddings, [Paper]
(arXiv 2021.12) RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality, [Paper], [Code]
(arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]
(arXiv 2021.12) MPViT : Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]
(arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]
(arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]
(arXiv 2021.12) LOCFORMER: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]
(arXiv 2021.12) Tell me what you see: A zero-shot action recognition method based on natural language descriptions, [Paper], [Code]
(arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]
(arXiv 2021.12) ScanQA: 3D Question Answering for Spatial Scene Understanding, [Paper]
(arXiv 2021.12) Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [Paper]
(arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]
(arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]
(arXiv 2021.12) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, [Paper], [Code]
(arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]
(arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]
(arXiv 2021.12) Align and Prompt: Video-and-Language Pre-training with Entity Prompts, [Paper], [Code]
(arXiv 2021.12) DATA EFFICIENT LANGUAGE-SUPERVISED ZEROSHOT RECOGNITION WITH OPTIMAL TRANSPORT DISTILLATION, [Paper]
(arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]
(arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]
(arXiv 2021.12) ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources, [Paper]
(arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper]
(arXiv 2021.12) How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation, [Paper]
(arXiv 2021.12) Learning to Prompt for Continual Learning, [Paper], [Code]
(arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper], [Code]
(arXiv 2021.12) Dense Video Captioning Using Unsupervised Semantic Information, [Paper], [Code]
(arXiv 2021.12) Looking Outside the Box to Ground Language in 3D Scenes, [Paper], [Code]
(arXiv 2021.12) RegionCLIP: Region-based Language-Image Pretraining, [Paper], [Code]
(arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper]
(arXiv 2021.12) Masked Feature Prediction for Self-Supervised Visual Pre-Training, [Paper]
(arXiv 2021.12) SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning, [Paper]
(arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos, [Paper], [Code]
(arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]
(arXiv 2021.12) QAHOI: Query-Based Anchors for Human-Object Interaction Detection, [Paper], [Code]
(arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]
(arXiv 2021.12) CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations, [Paper]
(arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]
(arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors, [Paper], [Project]
(arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]
(arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]
(arXiv 2021.12) COMPOSER: Compositional Learning of Group Activity in Videos, [Paper]
(arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]
(arXiv 2021.12) Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection, [Paper]
(arXiv 2021.12) SVIP: Sequence VerIfication for Procedures in Videos, [Paper]
(arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]
(arXiv 2021.12) VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]
(arXiv 2021.12) PartGlot: Learning Shape Part Segmentation from Language Reference Games, [Paper]
(arXiv 2021.12) Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network, [Paper]
(arXiv 2021.12) LEARNING SEMANTIC-ALIGNED FEATURE REPRESENTATION FOR TEXT-BASED PERSON SEARCH, [Paper]
(arXiv 2021.12) L-Verse: Bidirectional Generation Between Image and Text, [Paper]
(arXiv 2021.12) SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY, [Paper]
(arXiv 2021.12) Are Vision Transformers Robust to Patch Perturbations? [Paper]
(arXiv 2021.12) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]
(arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]
(arXiv 2021.12) MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning, [Paper]
(arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]
(arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]
(arXiv 2021.12) Rethinking the Two-Stage Framework for Grounded Situation Recognition, [Paper], [Code]
(arXiv 2021.12) CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions, [Paper]
(arXiv 2021.12) Couplformer: Rethinking Vision Transformer with Coupling Attention Map, [Paper]
(arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]
(arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]
(arXiv 2021.12) Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training, [Paper], [Code]
(arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]
(arXiv 2021.12) Grounded Language-Image Pre-training, [Paper], [Code]
(arXiv 2021.12) U^2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper]
(arXiv 2021.12) ADAPTIVE CHANNEL ENCODING TRANSFORMER FOR POINT CLOUD ANALYSIS, [Paper]
(arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]
(arXiv 2021.12) VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts, [Paper]
(arXiv 2021.12) PointCLIP: Point Cloud Understanding by CLIP, [Paper], [Code]
(arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]
(arXiv 2021.12) DYNAMIC TOKEN NORMALIZATION IMPROVES VISION TRANSFORMER, [Paper], [Code]
(arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.12) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]
(arXiv 2021.12) Text2Mesh: Text-Driven Neural Stylization for Meshes, [Paper], [Project]
(arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]
(arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]
(arXiv 2021.12) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization, [Paper], [Code]
(arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer, [Paper], [Code]
(arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection, [Paper]
(arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper]
(arXiv 2021.12) Transformer based trajectory prediction, [Paper]
(arXiv 2021.12) Evaluating Transformers for Lightweight Action Recognition, [Paper]
(arXiv 2021.12) Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision, [Paper]
(arXiv 2021.12) CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification, [Paper]
(arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]
(arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]
(arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers, [Paper]
(arXiv 2021.12) Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning, [Paper]
(arXiv 2021.12) AUDIO-VISUAL SYNCHRONISATION IN THE WILD, [Paper], [Project]
(arXiv 2021.12) Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs, [Paper]
(arXiv 2021.12) Garment4D: Garment Reconstruction from Point Cloud Sequences, [Paper], [Code]
(arXiv 2021.12) Locally Shifted Attention**** With Early Global Integration, [Paper], [Code]
(arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]
(arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Project]
(arXiv 2021.12) HairCLIP: Design Your Hair by Text and Reference Image, [Paper], [Project]
(arXiv 2021.12) CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields, [Paper], [Code]
(arXiv 2021.12) A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code], [Dataset]
(arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper], [Code]
(arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]
(arXiv 2021.12) Fast Point Transformer, [Paper]
(arXiv 2021.12) Assistive Tele-op: Leveraging Transformers to Collect Robotic Task Demonstrations, [Paper], [Project]
(arXiv 2021.12) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper]
(arXiv 2021.12) PatchFormer: An Efficient Point Transformer with Patch Attention, [Paper]
(arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]
(arXiv 2021.12) MLP Architectures for Vision-and-Language Modeling: An Empirical Study, [Paper], [Code]
(arXiv 2021.12) Everything at Once – Multi-modal Fusion Transformer for Video Retrieval, [Paper]
(arXiv 2021.12) Prompting Visual-Language Models for Efficient Video Understanding, [Paper], [Project]
(arXiv 2021.12) FLAVA: A Foundational Language And Vision Alignment Model, [Paper]
(arXiv 2021.12) Embedding Arithmetic for Text-driven Image Transformation, [Paper]
(arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper]
(arXiv 2021.12) Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos, [Paper], [Project]
(arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]
(arXiv 2021.12) DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, [Paper], [Code]
(arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]
(arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]
(arXiv 2021.12) Zero-Shot Text-Guided Object Generation with Dream Fields, [Paper], [Project]
(arXiv 2021.12) Video-Text Pre-training with Learned Regions, [Paper], [Code]
(arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]
(arXiv 2021.12) TCTN: A 3D-Temporal Convolutional Transformer Network for Spatiotemporal Predictive Learning, [Paper]
(arXiv 2021.12) DenseCLIP: Extract Free Dense Labels from CLIP, [Paper]
(arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper]
(arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]
(arXiv 2021.12) Object-Centric Unsupervised Image Captioning, [Paper]
(arXiv 2021.12) Vision Pair Learning: An Efficient Training Framework for Image Classification, [Paper]
(arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]
(arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]
(arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]
(arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]
(arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]
(arXiv 2021.12) Human-Object Interaction Detection via Weak Supervision, [Paper]
(arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]
(arXiv 2021.12) CLIPstyler: Image Style Transfer with a Single Text Condition, [Paper]
(arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]
(arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]
(arXiv 2021.12) Object-aware Video-language Pre-training for Retrieval, [Paper], [Code]

2021.11

(arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]
(arXiv 2021.11) Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model, [Paper]
(arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]
(arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]
(arXiv 2021.11) ADAPTIVE FOURIER NEURAL OPERATORS: EFFICIENT TOKEN MIXERS FOR TRANSFORMERS, [Paper]
(arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]
(arXiv 2021.11) DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning, [Paper], [Code]
(arXiv 2021.11) Ice hockey player identification via transformers, [Paper]
(arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]
(arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]
(arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]
(arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]
(arXiv 2021.11) DISCRETE REPRESENTATIONS STRENGTHEN VISION TRANSFORMER ROBUSTNESS, [Paper]
(arXiv 2021.11) TRAVLR: Now You See It, Now You Don’t! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning, [Paper]
(arXiv 2021.11) Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling, [Paper]
(arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]
(arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]
(arXiv 2021.11) ZERO-SHOT CERTIFIED DEFENSE AGAINST ADVERSARIAL PATCHES WITH VISION TRANSFORMERS, [Paper]
(arXiv 2021.11) PointMixer: MLP-Mixer for Point Cloud Understanding, [Paper]
(arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]
(arXiv 2021.11) Florence: A New Foundation Model for Computer Vision, [Paper]
(arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]
(arXiv 2021.11) Learning to Compose Visual Relations, [Paper], [Project]
(arXiv 2021.11) REFERENCE-BASED MAGNETIC RESONANCE IMAGE RECONSTRUCTION USING TEXTURE TRANSFORMER, [Paper]
(arXiv 2021.11) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval, [Paper]
(arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]
(arXiv 2021.11) SimMIM: A Simple Framework for Masked Image Modeling, [Paper], [Code]
(arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]
(arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]
(arXiv 2021.11) ClipCap: CLIP Prefix for Image Captioning, [Paper], [Code]
(arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]
(arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]
(arXiv 2021.11) Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, [Paper], [Code]
(arXiv 2021.11) Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning, [Paper], [Code]
(arXiv 2021.11) Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement, [Paper], [Code]
(arXiv 2021.11) Tracking People with 3D Representations, [Paper], [Code]
(arXiv 2021.11) LiT: Zero-Shot Transfer with Locked-image Text Tuning, [Paper]
(arXiv 2021.11) FILIP: FINE-GRAINED INTERACTIVE LANGUAGE-IMAGE PRE-TRAINING, [Paper]
(arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code]
(arXiv 2021.11) Attention Approximates Sparse Distributed Memory, [Paper]
(arXiv 2021.11) SLICED RECURSIVE TRANSFORMER, [Paper], [Code]
(arXiv 2021.11) HYBRID BYOL-VIT: EFFICIENT APPROACH TO DEAL WITH SMALL DATASETS, [Paper]
(arXiv 2021.11) Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling, [Paper], [Code]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2021.11) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis, [Paper], [Code]
(arXiv 2021.11) Revisiting spatio-temporal layouts for compositional action recognition, [Paper], [Code]
(arXiv 2021.11) PatchGame: Learning to Signal Mid-level Patches in Referential Games, [Paper], [Code]
(arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]
(arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]
(arXiv 2021.11) With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, [Paper], [Code]
(arXiv 2021.11) IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning, [Paper], [Project]
(arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]
(arXiv 2021.11) VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval, [Paper]
(arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]
(arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]
(arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]
(arXiv 2021.11) ML-Decoder: Scalable and Versatile Classification Head, [Paper], [Code]
(arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.11) SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]
(arXiv 2021.11) Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization, [Paper]
(arXiv 2021.11) Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation, [Paper]
(arXiv 2021.11) Sparse is Enough in Scaling Transformers, [Paper]
(arXiv 2021.11) An implementation of the “Guess who?” game using CLIP, [Paper], [Code]
(arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]
(arXiv 2021.11) A Unified Pruning Framework for Vision Transformers, [Paper]
(arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]
(arXiv 2021.11) AssistSR: Affordance-centric Question-driven Video Segment Retrieval, [Paper], [Code & Data]
(arXiv 2021.11) DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation, [Paper], [Code]
(arXiv 2021.11) , [Paper]
(arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]
(arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]
(arXiv 2021.11) CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning, [Paper]
(arXiv 2021.11) CRIS: CLIP-Driven Referring Image Segmentation, [Paper]
(arXiv 2021.11) Shunted Self-Attention via Multi-Scale Token Aggregation, [Paper], [Code]
(arXiv 2021.11) MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning, [Paper]
(arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions, [Paper], [Code]
(arXiv 2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]
(arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]
(arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]
(arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]
(arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper], [Code]
(arXiv 2021.11) LAFITE : Towards Language-Free Training for Text-to-Image Generation, [Paper]
(arXiv 2021.11) SPARSE DETR: EFFICIENT END-TO-END OBJECT DETECTION WITH LEARNABLE SPARSITY, [Paper], [Code]
(arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]
(arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]
(arXiv 2021.11) Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic, [Paper], [Code]
(arXiv 2021.11) Blended Diffusion for Text-driven Editing of Natural Images, [Paper], [Code]
(arXiv 2021.11) Mask Transfiner for High-Quality Instance Segmentation, [Paper], [Code]
(arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]
(arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper], [COde]
(arXiv 2021.11) Towards Tokenized Human Dynamics Representation, [Paper], [Code]
(arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]
(arXiv 2021.11) VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]
(arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]
(arXiv 2021.11) MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video, [Paper]
(arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]
(arXiv 2021.11) Hierarchical Modular Network for Video Captioning, [Paper]
(arXiv 2021.11) NU¨WA: Visual Synthesis Pre-training for Neural visUal World creAtion, [Paper], [Code]
(arXiv 2021.11) An Image Patch is a Wave: Phase-Aware Vision MLP, [Paper]
(arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper]
(arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]
(arXiv 2021.11) Scaling Up Vision-Language Pre-training for Image Captioning, [Paper]
(arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
(arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]
(arXiv 2021.11) RedCaps: Web-curated image-text data created by the people, for the people, [Paper], [Project]
(arXiv 2021.11) EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching, [Paper], [Code]
(arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper], [Code]
(arXiv 2021.11) Vis-TOP: Visual Transformer Overlay Processor, [Paper]
(arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]
(arXiv 2021.11) Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints, [Paper]
(arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]
(arXiv 2021.11) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions, [Paper]
(arXiv 2021.11) Combined Scaling for Zero-shot Transfer Learning, [Paper]
(arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]
(arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]
(arXiv 2021.11) IBOT: IMAGE BERT PRE-TRAINING WITH ONLINE TOKENIZER, [Paper], [Code]
(arXiv 2021.11) Masked Autoencoders Are Scalable Vision Learners, [Paper]
(arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]
(arXiv 2021.11) Are Transformers More Robust Than CNNs?, [Paper], [Code]
(arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]
(arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2021.11) VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]
(arXiv 2021.11) LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs, [Paper], [Project]
(arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]
(arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]
(arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]

2021.10

(arXiv 2021.10) Visual Keyword Spotting with Attention, [Paper], [[Project]](Visual Keyword Spotting with Attention)
(arXiv 2021.10) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery, [Paper], [Data & Code]
(arXiv 2021.10) Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval, [Paper], [Code]
(arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.10) Scatterbrain: Unifying Sparse and Low-rank Attention Approximation, [Paper]
(arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]
(arXiv 2021.10) UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model, [Paper], [Data & Code]
(arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Project]
(arXiv 2021.10) TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation, [Paper]
(arXiv 2021.10) TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED EMOTION RECOGNITION, [Paper]
(arXiv 2021.10) Contextual Similarity Aggregation with Self-attention for Visual Re-ranking, [Paper], [Code]
(arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2021.10) IMAGE-BASED CLIP-GUIDED ESSENCE TRANSFER, [Paper], [Code]
(arXiv 2021.10) Sinkformers: Transformers with Doubly Stochastic Attention, [Paper]
(arXiv 2021.10) ILLITERATE DALL·E LEARNS TO COMPOSE, [Paper], [Project], [Code]
(arXiv 2021.10) Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering, [Paper]
(arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]
(arXiv 2021.10) Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation, [Paper]
(arXiv 2021.10) TRANSFORMER ACCELERATION WITH DYNAMIC SPARSE ATTENTION, [Paper]
(arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]
(arXiv 2021.10) Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization, [Paper]
(arXiv 2021.10) StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects, [Paper], [Project]
(arXiv 2021.10) Gophormer: Ego-Graph Transformer for Node Classification, [Paper]
(arXiv 2021.10) STRANSGAN: AN EMPIRICAL STUDY ON TRANSFORMER IN GANS, [Paper], [Code]
(arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]
(arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper], [Code]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.10) WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP, [Paper], [Code]
(arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]
(arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]
(arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]
(arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]
(arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]
(arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]
(arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]
(arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.10) Leveraging MoCap Data for Human Mesh Recovery, [Paper]
(arXiv 2021.10) A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models, [Paper]
(arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]
(arXiv 2021.10) Multimodal Dialogue Response Generation, [Paper]
(arXiv 2021.10) Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals, [Paper]
(arXiv 2021.10) COMPOSITIONAL ATTENTION: DISENTANGLING SEARCH AND RETRIEVAL, [Paper], [Code]
(arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]
(arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper]
(arXiv 2021.10) Transformer with a Mixture of Gaussian Keys, [Paper]
(arXiv 2021.10) DIFFUSIONCLIP: TEXT-GUIDED IMAGE MANIPULATION USING DIFFUSION MODELS, [Paper]
(arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]
(arXiv 2021.10) RIPPLE ATTENTION FOR VISUAL PERCEPTION WITH SUB-QUADRATIC COMPLEXITY, [Paper]
(arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]
(arXiv 2021.10) CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation, [Paper]
(arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]
(arXiv 2021.10) SPARSE MOES MEET EFFICIENT ENSEMBLES, [Paper]
(arXiv 2021.10) Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent? [Paper]
(arXiv 2021.10) SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition, [Paper]
(arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper]
(arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]
(arXiv 2021.10) SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM, [Paper], [Code]
(arXiv 2021.10) CLIP4Caption ++: Multi-CLIP for Video Caption, [Paper]
(arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper]
(arXiv 2021.10) VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN, [Paper]
(arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) NVIT: VISION TRANSFORMER COMPRESSION AND PARAMETER REDISTRIBUTION, [Paper]
(arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]
(arXiv 2021.10) CLIP-Adapter: Better Vision-Language Models with Feature Adapters, [Paper], [Code]
(arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Code] ，
(arXiv 2021.10) MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER, [Paper]
(arXiv 2021.10) TOKEN POOLING IN VISION TRANSFORMERS, [Paper]
(arXiv 2021.10) VIDT: AN EFFICIENT AND EFFECTIVE FULLY TRANSFORMER-BASED OBJECT DETECTOR, [Paper], [Code]
(arXiv 2021.10) CLIP4Caption: CLIP for Video Caption, [Paper]
(arXiv 2021.10) OBJECT-REGION VIDEO TRANSFORMERS, [Paper], [Code]
(arXiv 2021.10) LEVERAGING REDUNDANCY IN ATTENTION WITH REUSE TRANSFORMERS, [Paper]
(arXiv 2021.10) Dynamic Inference with Neural Interpreters, [Paper]
(arXiv 2021.10) A CLIP-Enhanced Method for Video-Language Understanding, [Paper]
(arXiv 2021.10) Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries, [Paper]
(arXiv 2021.10) Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection, [Paper]
(arXiv 2021.10) Learning Structural Representations for Recipe Generation and Food Retrieval, [Paper]
(arXiv 2021.10) A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION, [Paper]

2021.09

(arXiv 2021.09) Joint Multimedia Event Extraction from Video and Article, [Paper]
(arXiv 2021.09) Long-Range Transformers for Dynamic Spatiotemporal Forecasting, [Paper]
(arXiv 2021.09) Visually Grounded Concept Composition, [Paper]
(arXiv 2021.09) CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation, [Paper]
(arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]
(arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
(arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]
(arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper], [Code]
(arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]
(arXiv 2021.09) VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, [Paper], [Code]
(arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]
(arXiv 2021.09) CLIP-It! Language-Guided Video Summarization, [Paper], [Project]
(arXiv 2021.09) MFEVIT: A ROBUST LIGHTWEIGHT TRANSFORMER-BASED NETWORK FOR MULTIMODAL 2D+3D FACIAL EXPRESSION RECOGNITION, [Paper]
(arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper], [Code]
(arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]
(arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper]
(arXiv 2021.09) MLIM: VISION-AND-LANGUAGE MODEL PRE-TRAINING WITH MASKED LANGUAGE AND IMAGE MODELING, [Paper]
(arXiv 2021.09) Dense Contrastive Visual-Linguistic Pretraining, [Paper]
(arXiv 2021.09) CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED VISION-LANGUAGE MODELS, [Paper]
(arXiv 2021.09) Localizing ∞-shaped fishes: Sketch-guided object localization in the wild, [Paper], [Code]
(arXiv 2021.09) CLIPORT: What and Where Pathways for Robotic Manipulation, [Paper], [Project], [Code]
(arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]
(arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]
(arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper], [Code]
(arXiv 2021.09) LOTR: Face Landmark Localization Using Localization Transformer, [Paper]
(arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]
(arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]
(arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]
(arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]
(arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]
(arXiv 2021.09) PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR OBJECT DETECTION, [Paper]
(arXiv 2021.09) ActionCLIP: A New Paradigm for Video Action Recognition, [Paper]
(arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]
(arXiv 2021.09) Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering, [Paper], [Code]
(arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper], [Code]
(arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper], [Code]
(arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]
(arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]
(arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper]
(arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]
(arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]
(arXiv 2021.09) Learning to Ground Visual Objects for Visual Dialog, [Paper]
(arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]
(arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.09) IS ATTENTION BETTER THAN MATRIX DECOMPOSITION? [Paper], [Code]
(arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]
(arXiv 2021.09) Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization, [Paper]
(arXiv 2021.09) Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding, [Paper]
(arXiv 2021.09) LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation, [Paper], [Code]
(arXiv 2021.09) Panoptic Narrative Grounding, [Paper]
(arXiv 2021.09) An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA, [Paper]
(arXiv 2021.09) PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks, [Paper], [Project]
(arXiv 2021.09) EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling, [Paper]
(arXiv 2021.09) Scaled ReLU Matters for Training Vision Transformers, [Paper]
(arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]
(arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper]
(arXiv 2021.09) WHYACT: Identifying Action Reasons in Lifestyle Vlogs, [Paper]
(arXiv 2021.09) Zero-Shot Open Set Detection by Extending CLIP, [Paper]
(arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]
(arXiv 2021.09) Learning to Prompt for Vision-Language Models, [Paper], [Code]
(arXiv 2021.09) Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss, [Paper], [Code]
(arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]
(arXiv 2021.09) ConvMLP: Hierarchical Convolutional MLPs for Vision, [Paper], [Code]
(arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]
(arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]
(arXiv 2021.09) Sparse-MLP: A Fully-MLP Architecture with Conditional Computation, [Paper]
(arXiv 2021.09) SORNet: Spatial Object-Centric Representations for Sequential Manipulation, [Paper], [Project]
(arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper]
(arXiv 2021.09) Weakly Supervised Relative Spatial Reasoning for Visual Question Answering, [Paper], [Code]
(arXiv 2021.09) FUSFORMER: A TRANSFORMER-BASED FUSION APPROACH FOR HYPERSPECTRAL IMAGE SUPER-RESOLUTION, [Paper]
(arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]
(arXiv 2021.09) Learning to Generate Scene Graph from Natural Language Supervision, [Paper], [Code]
(arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]
(arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]
(ICCV 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]
(arXiv 2021.09) Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation, [Paper], [Code]
(arXiv 2021.09) Joint Graph Learning and Matching for Semantic Feature Correspondence, [Paper]
(arXiv 2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper], [Code]

2021.08

(arXiv 2021.08) SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation, [Paper]
(arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]
(arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]
(arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]
(arXiv 2021.08) Cross-category Video Highlight Detection via Set-based Learning, [Paper], [Code]
(arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]
(arXiv 2021.08) SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments, [Paper]
(arXiv 2021.08) LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision, [Paper], [Project]
(arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]
(arXiv 2021.08) SIMVLM: SIMPLE VISUAL LANGUAGE MODEL PRETRAINING WITH WEAK SUPERVISION, [Paper]
(arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [[Paper]](https://

transformer-in-vision's People

Contributors

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.