Giter VIP home page Giter VIP logo

transformer-in-vision's Introduction

Transformer-in-Vision

Recent Transformer-based CV and related works. Welcome to comment/contribute!

Keep updated.

Resource

Survey

  • (arXiv 2022.03) A Roadmap for Big Model, [Paper]

  • (arXiv 2022.03) Transformers Meet Visual Learning Understanding: A Comprehensive Review, [[Paper]](https://arxiv.org/pdf/2203.12944.pdf)

  • (arXiv 2022.03) Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work, [Paper], [Project]

  • (arXiv 2022.02) A Survey of Vision-Language Pre-Trained Models, [Paper]

  • (arXiv 2022.02) VLP: A Survey on Vision-Language Pre-training, [Paper]

  • (arXiv 2022.02) Transformer for Graphs: An Overview from Architecture Perspective, [Paper]

  • (arXiv 2022.01) Video Transformers: A Survey, [Paper]

  • (arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP MLP, [Paper]

  • (arXiv 2021.11) A Survey of Visual Transformers, [Paper]

  • (arXiv 2021.09) Survey: Transformer based Video-Language Pre-training, [Paper]

  • (arXiv 2021.06) A Survey of Transformers, [Paper]

  • (arXiv 2021.06) Attention mechanisms and deep learning for machine vision: A survey of the state of the art, [Paper]

  • (arXiv 2021.06) Pre-Trained Models: Past, Present and Future, [Paper]

  • (arXiv 2021.05) Can Attention Enable MLPs To Catch Up With CNNs? [Paper]

  • (arXiv 2021.03) A Practical Survey on Faster and Lighter Transformers, [Paper]

  • (arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]

  • (arXiv 2021.01) A Survey on Visual Transformer, [Paper]

  • (arXiv 2020.9) Efficient Transformers: A Survey, [Paper]

  • (arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

2022.04

  • (arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR VISUAL RELATIONAL REASONING, [Paper]

  • (arXiv 2022.04) Unsupervised Human Action Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance, [Paper], [Code]

  • (arXiv 2022.04) Learning Future Object Prediction with a Spatiotemporal Detection Transformer, [Paper]

  • (arXiv 2022.04) R^2-Trans: Fine-Grained Visual Categorization with Redundancy Reduction, [Paper], [Code]

  • (arXiv 2022.04) A New Dataset and Transformer for Stereoscopic Video Super-Resolution, [Paper], [Code]

  • (arXiv 2022.04) Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization, [Paper]

  • (arXiv 2022.04) Multi-Scale Features and Parallel Transformers Based Image Quality Assessment, [Paper], [Code]

  • (arXiv 2022.04) BTranspose: Bottleneck Transformers for Human Pose Estimation with Self-Supervised Pre-Training, [Paper]

  • (arXiv 2022.04) Human-Object Interaction Detection via Disentangled Transformer, [Paper]

  • (arXiv 2022.04) ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models, [Paper]

  • (arXiv 2022.04) Interactiveness Field in Human-Object Interactions, [Paper], [Code]

  • (arXiv 2022.04) DeiT III: Revenge of the ViT, [Paper]

  • (arXiv 2022.04) Residual Swin Transformer Channel Attention Network for Image Demosaicing, [Paper]

  • (arXiv 2022.04) Neighborhood Attention Transformer, [Paper], [Code]

  • (arXiv 2022.04) MiniViT: Compressing Vision Transformers with Weight Multiplexing, [Paper], [Code]

  • (arXiv 2022.04) ViTOL: Vision Transformer for Weakly Supervised Object Localization, [Paper], [Code]

  • (arXiv 2022.04) What Matters in Language Conditioned Robotic Imitation Learning, [Paper], [Code]

  • (arXiv 2022.04) Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes, [Paper]

  • (arXiv 2022.04) ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension, [Paper]

  • (arXiv 2022.04) Are Multimodal Transformers Robust to Missing Modality? [Paper]

  • (arXiv 2022.04) TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.04) X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks, [Paper]

  • (arXiv 2022.04) Event Transformer, [Paper]

  • (arXiv 2022.04) Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels, [Paper]

  • (arXiv 2022.04) ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation, [Paper], [Code]

  • (arXiv 2022.04) Multimodal Transformer for Nursing Activity Recognition, [Paper], [Code]

  • (arXiv 2022.04) Robust Cross-Modal Representation Learning with Progressive Self-Distillation, [Paper]

  • (arXiv 2022.04) Stripformer: Strip Transformer for Fast Image Deblurring, [Paper]

  • (arXiv 2022.04) No Token Left Behind: Explainability-Aided Image Classification and Generation, [Paper]

  • (arXiv 2022.04) Fashionformer: A Simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition, [Paper], [Code]

  • (arXiv 2022.04) Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation, [Paper], [Code]

  • (arXiv 2022.04) DILEMMA: Self-Supervised Shape and Texture Learning with Transformers, [Paper]

  • (arXiv 2022.04) Learning Trajectory-Aware Transformer for Video Super-Resolution, [Paper], [Code]

  • (arXiv 2022.04) Learning to Induce Causal Structure, [Paper]

  • (arXiv 2022.04) Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection, [Paper], [Code]

  • (arXiv 2022.04) Category-Aware Transformer Network for Better Human-Object Interaction Detection, [Paper]

  • (arXiv 2022.04) Does Robustness on ImageNet Transfer to Downstream Tasks?, [Paper]

  • (arXiv 2022.04) POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition, [Paper], [Code]

  • (arXiv 2022.04) Vision Transformers for Single Image Dehazing, [Paper], [Code]

  • (arXiv 2022.04) Underwater Image Enhancement Using Pre-trained Transformer, [Paper]

  • (arXiv 2022.04) Event Transformer. A sparse-aware solution for efficient event data processing, [Paper], [Code]

  • (arXiv 2022.04) PSTR: End-to-End One-Step Person Search With Transformers, [Paper], [Code]

  • (arXiv 2022.04) Adapting CLIP For Phrase Localization Without Further Training, [Paper], [Code]

  • (arXiv 2022.04) FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment, [Paper], [Project]

  • (arXiv 2022.04) DaViT: Dual Attention Vision Transformers, [Paper], [Code]

  • (arXiv 2022.04) Unsupervised Prompt Learning for Vision-Language Models, [Paper], [Code]

  • (arXiv 2022.04) Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer, [Paper], [Project]

  • (arXiv 2022.04) Unified Contrastive Learning in Image-Text-Label Space, [Paper], [Code]

  • (arXiv 2022.04) HunYuan_tvr for Text-Video Retrivial, [Paper]

  • (arXiv 2022.04) LEARNING TO COMPOSE SOFT PROMPTS FOR COMPOSITIONAL ZERO-SHOT LEARNING, [Paper]

  • (arXiv 2022.04) End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation, [Paper], [Code]

  • (arXiv 2022.04) Temporal Alignment Networks for Long-term Video, [Paper], [Code]

  • (arXiv 2022.04) Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection, [Paper], [Code]

  • (arXiv 2022.04) MixFormer: Mixing Features across Windows and Dimensions, [Paper], [Code]

  • (arXiv 2022.04) CM3: A CAUSAL MASKED MULTIMODAL MODEL OF THE INTERNET, [Paper]

  • (arXiv 2022.04) DO AS I CAN, NOT AS I SAY: GROUNDING LANGUAGE IN ROBOTIC AFFORDANCES, [Paper], [Project]

  • (arXiv 2022.04) TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization, [Paper], [Code]

  • (arXiv 2022.04) Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, [Paper], [Project]

  • (arXiv 2022.04) Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition, [Paper]

  • (arXiv 2022.04) Learning Audio-Video Modalities from Image Captions, [Paper]

  • (arXiv 2022.04) Improving Vision Transformers by Revisiting High-frequency Components, [Paper]

  • (arXiv 2022.04) POS-BERT: Point Cloud One-Stage BERT Pre-Training, [Paper], [Code]

  • (arXiv 2022.04) BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation, [Paper], [Code]

  • (arXiv 2022.04) BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning, [Paper]

  • (arXiv 2022.04) TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting, [Paper]

  • (arXiv 2022.04) Long Movie Clip Classification with State-Space Video Models, [Paper], [Code]

  • (arXiv 2022.04) TALLFormer: Temporal Action Localization with Long-memory Transformer, [Paper], [Code]

  • (arXiv 2022.04) MultiMAE: Multi-modal Multi-task Masked Autoencoders, [Paper], [Project]

  • (arXiv 2022.04) “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations, [Paper]

  • (arXiv 2022.04) SE(3)-Equivariant Attention Networks for Shape Reconstruction in Function Space, [Paper]

  • (arXiv 2022.04) Multi-View Transformer for 3D Visual Grounding, [Paper], [Code]

  • (arXiv 2022.04) VISION TRANSFORMER EQUIPPED WITH NEURAL RESIZER ON FACIAL EXPRESSION RECOGNITION TASK, [Paper]

  • (arXiv 2022.04) Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition, [Paper], [Project]

  • (arXiv 2022.04) Detector-Free Weakly Supervised Group Activity Recognition, [Paper]

  • (arXiv 2022.04) Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos, [Paper], [Project]

  • (arXiv 2022.04) What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions, [Paper]

  • (arXiv 2022.04) MaxViT: Multi-Axis Vision Transformer, [Paper]

2022.03

  • (arXiv 2022.03) Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation, [Paper]

  • (arXiv 2022.03) ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval, [Paper]

  • (arXiv 2022.03) ReSTR: Convolution-free Referring Image Segmentation Using Transformers, [Paper], [Project]

  • (arXiv 2022.03) CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation, [Paper]

  • (arXiv 2022.03) Deformable Video Transformer, [Paper]

  • (arXiv 2022.03) End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps, [Paper]

  • (arXiv 2022.03) CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow, [Paper], [Code]

  • (arXiv 2022.03) VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, [Paper], [App]

  • (arXiv 2022.03) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing, [Paper], [Code]

  • (arXiv 2022.03) BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, [Paper], [Code]

  • (arXiv 2022.03) Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models, [Paper], [Code]

  • (arXiv 2022.03) Bringing Old Films Back to Life, [Paper], [Code]

  • (arXiv 2022.03) Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model, [Paper], [Code]

  • (arXiv 2022.03) SeqTR: A Simple yet Universal Network for Visual Grounding, [Paper], [Code]

  • (arXiv 2022.03) InstaFormer: Instance-Aware Image-to-Image Translation with Transformer, [Paper]

  • (arXiv 2022.03) Omni-DETR: Omni-Supervised Object Detection with Transformers, [Paper], [Code]

  • (arXiv 2022.03) Learning Program Representations for Food Images and Cooking Recipes, [Paper], [Project]

  • (arXiv 2022.03) ITTR: Unpaired Image-to-Image Translation with Transformers, [Paper]

  • (arXiv 2022.03) VPTR: Efficient Transformers for Video Prediction, [Paper], [Code]

  • (arXiv 2022.03) Parameter-efficient Fine-tuning for Vision Transformers, [Paper]

  • (arXiv 2022.03) TubeDETR: Spatio-Temporal Video Grounding with Transformers, [Paper], [Code]

  • (arXiv 2022.03) Exploring Plain Vision Transformer Backbones for Object Detection, [Paper]

  • (arXiv 2022.03) PROMPTDET: EXPAND YOUR DETECTOR VOCABULARY WITH UNCURATED IMAGES, [Paper], [Code]

  • (arXiv 2022.03) Few-Shot Object Detection with Fully Cross-Transformer, [Paper]

  • (arXiv 2022.03) Unified Transformer Tracker for Object Tracking, [Paper]

  • (arXiv 2022.03) X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval, [Paper], [Code]

  • (arXiv 2022.03) Fine-tuning Image Transformers using Learnable Memory, [Paper]

  • (arXiv 2022.03) MAT: Mask-Aware Transformer for Large Hole Image Inpainting, [Paper], [Code]

  • (arXiv 2022.03) mc-BEiT: Multi-choice Discretization for Image BERT Pre-training, [Paper]

  • (arXiv 2022.03) End-to-End Transformer Based Model for Image Captioning, [Paper]

  • (arXiv 2022.03) Hybrid Routing Transformer for Zero-Shot Learning, [Paper]

  • (arXiv 2022.03) TREATMENT LEARNING TRANSFORMER FOR NOISY IMAGE CLASSIFICATION, [Paper]

  • (arXiv 2022.03) Do Vision-Language Pretrained Models Learn Primitive Concepts?, [Paper]

  • (arXiv 2022.03) Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs, [Paper]

  • (arXiv 2022.03) SepViT: Separable Vision Transformer, [Paper]

  • (arXiv 2022.03) MatteFormer: Transformer-Based Image Matting via Prior-Tokens, [Paper], [Code]

  • (arXiv 2022.03) Feature Selective Transformer for Semantic Image Segmentation, [Paper]

  • (arXiv 2022.03) Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos, [Paper], [Code]

  • (arXiv 2022.03) RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution, [Paper], [Code]

  • (arXiv 2022.03) Single-Stream Multi-Level Alignment for Vision-Language Pretraining, [Paper]

  • (arXiv 2022.03) Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers, [Paper], [Code]

  • (arXiv 2022.03) Collaborative Transformers for Grounded Situation Recognition, [Paper], [Code]

  • (arXiv 2022.03) Object Memory Transformer for Object Goal Navigation, [Paper]

  • (arXiv 2022.03) Brain-inspired Multilayer Perceptron with Spiking Neurons, [Paper], [Code]

  • (arXiv 2022.03) HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network, [Paper], [Code]

  • (arXiv 2022.03) REGTR: End-to-end Point Cloud Correspondences with Transformers, [Paper], [Code]

  • (arXiv 2022.03) Automated Progressive Learning for Efficient Training of Vision Transformers, [Paper]

  • (arXiv 2022.03) Stratified Transformer for 3D Point Cloud Segmentation, [Paper], [Code]

  • (arXiv 2022.03) NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge, [Paper]

  • (arXiv 2022.03) FACIAL EXPRESSION RECOGNITION WITH SWIN TRANSFORMER, [Paper]

  • (arXiv 2022.03) Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness, [Paper]

  • (arXiv 2022.03) Efficient Visual Tracking via Hierarchical Cross-Attention Transformer, [Paper], [Code]

  • (arXiv 2022.03) High-Performance Transformer Tracking, [Paper], [Code]

  • (arXiv 2022.03) RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers, [Paper]

  • (arXiv 2022.03) Multi-modal Multi-label Facial Action Unit Detection with Transformer, [Paper]

  • (arXiv 2022.03) MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection, [Paper], [Code]

  • (arXiv 2022.03) Text to Mesh Without 3D Supervision Using Limit Subdivision, [Paper], [Project]

  • (arXiv 2022.03) GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection, [Paper], [Code]

  • (arXiv 2022.03) CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation, [Paper]

  • (arXiv 2022.03) FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks, [Paper], [Code]

  • (arXiv 2022.03) Vision Transformer Compression with Structured Pruning and Low Rank Approximation, [Paper]

  • (arXiv 2022.03) Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers, [Paper]

  • (arXiv 2022.03) MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection, [Paper]

  • (arXiv 2022.03) Learning Patch-to-Cluster Attention in Vision Transformer, [Paper]

  • (arXiv 2022.03) Visual Prompt Tuning, [Paper]

  • (arXiv 2022.03) Training-free Transformer Architecture Search, [Paper]

  • (arXiv 2022.03) VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, [Paper], [Code]

  • (arXiv 2022.03) METAMORPH: LEARNING UNIVERSAL CONTROLLERS WITH TRANSFORMERS, [Paper], [Project]

  • (arXiv 2022.03) A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning, [Paper]

  • (arXiv 2022.03) Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers, [Paper], [Project]

  • (arXiv 2022.03) Associating Objects with Scalable Transformers for Video Object Segmentation, [Paper], [[Project]](https://github.com/z-x-yang/AOT0

  • (arXiv 2022.03) HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation, [Paper], [Code]

  • (arXiv 2022.03) Learning to generate line drawings that convey geometry and semantics, [Paper], [Project]

  • (arXiv 2022.03) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection, [Paper], [Code]

  • (arXiv 2022.03) AIMusicGuru: Music Assisted Human Pose Correction, [Paper]

  • (arXiv 2022.03) What to Hide from Your Students: Attention-Guided Masked Image Modeling, [Paper]

  • (arXiv 2022.03) Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer, [Paper]

  • (arXiv 2022.03) ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator, [Paper]

  • (arXiv 2022.03) Keypoints Tracking via Transformer Networks, [Paper], [Code]

  • (arXiv 2022.03) Beyond Fixation: Dynamic Window Visual Transformer, [Paper], [Code]

  • (arXiv 2022.03) Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors, [Paper]

  • (arXiv 2022.03) Self-supervised Video-centralised Transformer for Video Face Clustering, [Paper]

  • (arXiv 2022.03) Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [Paper]

  • (arXiv 2022.03) Global Tracking Transformers, [Paper], [Code]

  • (arXiv 2022.03) Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer, [Paper], [Code]

  • (arXiv 2022.03) QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation, [Paper]

  • (arXiv 2022.03) Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos, [Paper], [Project]

  • (arXiv 2022.03) GradViT: Gradient Inversion of Vision Transformers, [Paper], [Code]

  • (arXiv 2022.03) Mask Usage Recognition using Vision Transformer with Transfer Learning and Data Augmentation, [Paper]

  • (arXiv 2022.03) Under the Hood of Transformer Networks for Trajectory Forecasting, [Paper]

  • (arXiv 2022.03) Open-Vocabulary DETR with Conditional Matching, [Paper]

  • (arXiv 2022.03) Meta-attention for ViT-backed Continual Learning, [Paper], [Code]

  • (arXiv 2022.03) CNNs and Transformers Perceive Hybrid Images Similar to Humans, [Paper], [Code]

  • (arXiv 2022.03) Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory, [Paper], [Code]

  • (arXiv 2022.03) Affective Feedback Synthesis Towards Multimodal Text and Image Data, [Paper]

  • (arXiv 2022.03) ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers, [Paper]

  • (arXiv 2022.03) CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration, [Paper]

  • (arXiv 2022.03) Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds, [Paper], [Code]

  • (arXiv 2022.03) HIPA: Hierarchical Patch Transformer for Single Image Super Resolution, [Paper]

  • (arXiv 2022.03) DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition, [Paper], [Code]

  • (arXiv 2022.03) MixFormer: End-to-End Tracking with Iterative Mixed Attention, [Paper], [Code]

  • (arXiv 2022.03) PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark, [Paper], [Code]

  • (arXiv 2022.03) Relationformer: A Unified Framework for Image-to-Graph Generation, [Paper], [Code]

  • (arXiv 2022.03) CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning, [Paper], [Code]

  • (arXiv 2022.03) Hyperbolic Vision Transformers: Combining Improvements in Metric Learning, [Paper], [Code]

  • (arXiv 2022.03) MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer, [Paper], [Code]

  • (arXiv 2022.03) Transformer-based HTR for Historical Documents, [Paper]

  • (arXiv 2022.03) simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers, [Paper], [Code]

  • (arXiv 2022.03) End-to-End Human-Gaze-Target Detection with Transformers, [Paper]

  • (arXiv 2022.03) End-to-End Video Text Spotting with Transformer, [Paper], [Code]

  • (arXiv 2022.03) Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation, [Paper], [Code]

  • (arXiv 2022.03) V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer, [Paper]

  • (arXiv 2022.03) LocATe: End-to-end Localization of Actions in 3D with Transformers, [Paper]

  • (arXiv 2022.03) AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder, [Paper]

  • (arXiv 2022.03) ViM: Out-Of-Distribution with Virtual-logit Matching, [Paper], [Code]

  • (arXiv 2022.03) ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer, [Paper]

  • (arXiv 2022.03) Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows, [Paper]

  • (arXiv 2022.03) Vision Transformer with Convolutions Architecture Search, [Paper]

  • (arXiv 2022.03) Cascade Transformers for End-to-End Person Search, [Paper], [Code]

  • (arXiv 2022.03) CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance, [Paper]

  • (arXiv 2022.03) MatchFormer: Interleaving Attention in Transformers for Feature Matching, [Paper], [Code]

  • (arXiv 2022.03) Local-Global Context Aware Transformer for Language-Guided Video Segmentation, [Paper], [Code]

  • (arXiv 2022.03) Three things everyone should know about Vision Transformers, [Paper]

  • (arXiv 2022.03) Are Vision Transformers Robust to Spurious Correlations? [Paper], [Code]

  • (arXiv 2022.03) MUTUAL GENERATIVE TRANSFORMER LEARNING FOR CROSS-VIEW GEO-LOCALIZATION, [Paper]

  • (arXiv 2022.03) DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training, [Paper]

  • (arXiv 2022.03) Semantic-aligned Fusion Transformer for One-shot Object Detection, [Paper]

  • (arXiv 2022.03) UNIMO-2: End-to-End Unified Vision-Language Grounded Learning, [Paper], [Code]

  • (arXiv 2022.03) Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning, [Paper], [Code]

  • (arXiv 2022.03) One-Shot Adaptation of GAN in Just One CLIP, [Paper]

  • (arXiv 2022.03) PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation, [Paper]

  • (arXiv 2022.03) PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer, [Paper]

  • (arXiv 2022.03) Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image, [Paper], [Code]

  • (arXiv 2022.03) Transframer: Arbitrary Frame Prediction with Generative Models, [Paper]

  • (arXiv 2022.03) Towards Data-Efficient Detection Transformers, [Paper], [Code]

  • (arXiv 2022.03) Bi-directional Object-Context Prioritization Learning for Saliency Ranking, [Paper], [Code]

  • (arXiv 2022.03) PATCH-FOOL: ARE VISION TRANSFORMERS ALWAYS ROBUST AGAINST ADVERSARIAL PERTURBATIONS? [Paper], [Code]

  • (arXiv 2022.03) WegFormer: Transformers for Weakly Supervised Semantic Segmentation, [Paper]

  • (arXiv 2022.03) Open Set Recognition using Vision Transformer with an Additional Detection Head, [Paper], [Code]

  • (arXiv 2022.03) UNIFIED VISUAL TRANSFORMER COMPRESSION, [Paper], [Code]

  • (arXiv 2022.03) Towards Practical Certifiable Patch Defense with Vision Transformer, [Paper]

  • (arXiv 2022.03) EDTER: Edge Detection with Transformer, [Paper], [Code]

  • (arXiv 2022.03) ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation, [Paper]

  • (arXiv 2022.03) Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution, [Paper]

  • (arXiv 2022.03) Revitalize Region Feature for Democratizing Video-Language Pre-training, [Paper], [Code]

  • (arXiv 2022.03) Inverted Pyramid Multi-task Transformer for Dense Scene Understanding, [Paper]

  • (arXiv 2022.03) Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.03) Style Transformer for Image Inversion and Editing, [Paper], [Code]

  • (arXiv 2022.03) MotionCLIP: Exposing Human Motion Generation to CLIP Space, [Paper], [Project]

  • (arXiv 2022.03) The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy, [Paper], [Code]

  • (arXiv 2022.03) Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation, [Paper]

  • (arXiv 2022.03) Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning, [Paper], [Code]

  • (arXiv 2022.03) Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting, [Paper]

  • (arXiv 2022.03) DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection, [Paper]

  • (arXiv 2022.03) DATR: Domain-adaptive transformer for multi-domain landmark detection, [Paper]

  • (arXiv 2022.03) EventFormer: AU Event Transformer for Facial Action Unit Event Detection, [Paper]

  • (arXiv 2022.03) Accelerating DETR Convergence via Semantic-Aligned Matching, [Paper], [Code]

  • (arXiv 2022.03) All in One: Exploring Unified Video-Language Pre-training, [Paper], [Code]

  • (arXiv 2022.03) CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment, [Paper]

  • (arXiv 2022.03) EIT: Efficiently Lead Inductive Biases to ViT, [Paper], [Code]

  • (arXiv 2022.03) Self-Promoted Supervision for Few-Shot Transformer, [Paper], [Code]

  • (arXiv 2022.03) MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization, [Paper]

  • (arXiv 2022.03) Disentangled Representation Learning for Text-Video Retrieval, [Paper]

  • (arXiv 2022.03) TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.03) Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding, [Paper], [Dataset]

  • (arXiv 2022.03) Visualizing and Understanding Patch Interactions in Vision Transformer, [Paper]

  • (arXiv 2022.03) ANTI-OVERSMOOTHING IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS: FROM THEORY TO PRACTICE, [Paper], [Code]

  • (arXiv 2022.03) Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision, [Paper], [Code]

  • (arXiv 2022.03) ActiveMLP: An MLP-like Architecture with Active Token Mixer, [Paper], [Code]

  • (arXiv 2022.03) Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding, [Paper]

  • (arXiv 2022.03) TrueType Transformer: Character and Font Style Recognition in Outline Format, [Paper]

  • (arXiv 2022.03) LOOPITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval, [Paper]

  • (arXiv 2022.03) MVP: Multimodality-guided Visual Pre-training, [Paper]

  • (arXiv 2022.03) DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting, [Paper]

  • (arXiv 2022.03) Multi-Modal Mixup for Robust Fine-tuning, [Paper]

  • (arXiv 2022.03) AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant, [Paper], [Project]

  • (arXiv 2022.03) Coarse-to-Fine Vision Transformer, [Paper], [Code]

  • (arXiv 2022.03) Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers, [Paper]

  • (arXiv 2022.03) WAVEMIX: RESOURCE-EFFICIENT TOKEN MIXING FOR IMAGES, [Paper]

  • (arXiv 2022.03) VOVIT: LOW LATENCY GRAPH-BASED AUDIO-VISUAL VOICE SEPARATION TRANSFORMER, [Paper], [Code]

  • (arXiv 2022.03) Graph Attention Transformer Network for Multi-Label Image Classification, [Paper]

  • (arXiv 2022.03) EDGEFORMER: IMPROVING LIGHT-WEIGHT CONVNETS BY LEARNING FROM VISION TRANSFORMERS, [Paper], [Code]

  • (arXiv 2022.03) Skating-Mixer: Multimodal MLP for Scoring Figure Skating, [Paper]

  • (arXiv 2022.03) Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention, [Paper]

  • (arXiv 2022.03) CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction, [Paper]

  • (arXiv 2022.03) Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning, [Paper]

  • (arXiv 2022.03) ChiTransformer: Towards Reliable Stereo from Cues, [Paper]

  • (arXiv 2022.03) A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation,** Co-Saliency Detection** and Video Salient Object Detection, [Paper], [Code]

  • (arXiv 2022.03) Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction, [Paper]

  • (arXiv 2022.03) CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers, [Paper], [Code]

  • (arXiv 2022.03) Multiscale Transformer for Hyperspectral Image Classification, [Paper]

  • (arXiv 2022.03) Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning, [Paper], [Code]

  • (arXiv 2022.03) Autoregressive Image Generation using Residual Quantization, [Paper]

  • (arXiv 2022.03) CONTEXTFORMER: A TRANSFORMER WITH SPATIO-CHANNEL ATTENTION FOR CONTEXT MODELING IN LEARNED IMAGE COMPRESSION, [Paper]

  • (arXiv 2022.03) Patch Similarity Aware Data-Free Quantization for Vision Transformers, [Paper]

  • (arXiv 2022.03) ViT-P: Rethinking Data-efficient Vision Transformers from Locality, [Paper]

  • (arXiv 2022.03) DIT: SELF-SUPERVISED PRE-TRAINING FOR DOCUMENT IMAGE TRANSFORMER, [Paper]

  • (arXiv 2022.03) Towards Efficient and Scalable Sharpness-Aware Minimization, [Paper]

  • (arXiv 2022.03) HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening, [Paper], [Code]

  • (arXiv 2022.03) UVCGAN: UNET VISION TRANSFORMER CYCLE-CONSISTENT GAN FOR UNPAIRED IMAGE-TO-IMAGE TRANSLATION, [Paper], [Code]

  • (arXiv 2022.03) Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, [Paper], [Code]

  • (arXiv 2022.03) PANFORMER: A TRANSFORMER BASED MODEL FOR PAN-SHARPENING, [Paper], [Code]

  • (arXiv 2022.03) Multi-class Token Transformer for Weakly Supervised Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.03) Cross Language Image Matching for Weakly Supervised Semantic Segmentation, [Paper]

  • (arXiv 2022.03) Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers, [Paper], [Code]

  • (arXiv 2022.03) DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, [Paper], [Code]

  • (arXiv 2022.03) MetaFormer : A Unified Meta Framework for Fine-Grained Recognition, [Paper], [Code]

  • (arXiv 2022.03) Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language, [Paper]

  • (arXiv 2022.03) Knowledge Amalgamation for Object Detection with Transformers, [Paper]

  • (arXiv 2022.03) Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos, [Paper]

  • (arXiv 2022.03) Modeling Coreference Relations in Visual Dialog, [Paper], [Code]

  • (arXiv 2022.03) VITRANSPAD: VIDEO TRANSFORMER USING CONVOLUTION AND SELF-ATTENTION FOR FACE PRESENTATION ATTACK DETECTION, [Paper]

  • (arXiv 2022.03) Multi-Tailed Vision Transformer for Efficient Inference, [Paper]

  • (arXiv 2022.03) Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.03) Ensembles of Vision Transformers as a New Paradigm for Automated Classification in Ecology, [Paper]

  • (arXiv 2022.03) LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network, [Paper], [Code]

  • (arXiv 2022.03) LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction, [Paper]

  • (arXiv 2022.03) DCT-Former: Efficient Self-Attention with Discrete Cosine Transform, [Paper], [Code]

  • (arXiv 2022.03) Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment, [Paper]

  • (arXiv 2022.03) Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in Point Cloud, [Paper]

  • (arXiv 2022.03) CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP, [Paper]

  • (arXiv 2022.03) MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video, [Paper]

  • (arXiv 2022.03) X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning, [Paper]

  • (arXiv 2022.03) 3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification, [Paper]

  • (arXiv 2022.03) DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation, [Paper]

  • (arXiv 2022.03) D_2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention, [Paper]

  • (arXiv 2022.03) Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding, [Paper], [Code]

  • (arXiv 2022.03) Self-supervised Transformer for Deepfake Detection, [Paper]

  • (arXiv 2022.03) Aggregated Pyramid Vision Transformer: Splittransform-merge Strategy for Image Recognition without Convolutions, [Paper]

  • (arXiv 2022.03) TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration, [Paper], [Code]

  • (arXiv 2022.03) DN-DETR: Accelerate DETR Training by Introducing Query DeNoising, [Paper], [Code]

  • (arXiv 2022.03) Protecting Celebrities with Identity Consistency Transformer, [Paper]

  • (arXiv 2022.03) Masked Visual Pre-training for Motor Control, [Paper], [Project]

  • (arXiv 2022.03) NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, [Paper], [Code]

  • (arXiv 2022.03) Conditional Prompt Learning for Vision-Language Models, [Paper], [Code]

  • (arXiv 2022.03) Lane Detection with Versatile AtrousFormer and Local Semantic Guidance, [Paper]

  • (arXiv 2022.03) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]

  • (arXiv 2022.03) Forecasting Characteristic 3D Poses of Human Actions , [Paper], [Code]

2022.02

  • (arXiv 2022.02) Bayesian Structure Learning with Generative Flow Networks, [Paper]

  • (arXiv 2022.02) Towards Unsupervised Domain Adaptation via Domain-Transformer, [Paper]

  • (arXiv 2022.02) An End-to-End Transformer Model for Crowd Localization, [Paper]

  • (arXiv 2022.02) Instantaneous Physiological Estimation using Video Transformers, [Paper], [Code]

  • (arXiv 2022.02) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation, [Paper], [Code]

  • (arXiv 2022.02) ATTENTION ENABLES ZERO APPROXIMATION ERROR, [Paper]

  • (arXiv 2022.02) When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection, [Paper], [Code]

  • (arXiv 2022.02) AUTO-SCALING VISION TRANSFORMERS WITHOUT TRAINING, [Paper], [Code]

  • (arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation, [Paper], [Project]

  • (arXiv 2022.02) LEARNING TO MERGE TOKENS IN VISION TRANSFORMERS, [Paper]

  • (arXiv 2022.02) ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers, [Paper], [Code]

  • (arXiv 2022.02) SELF-SUPERVISED TRANSFORMERS FOR UNSUPERVISED OBJECT DISCOVERY USING NORMALIZED CUT, [Paper], [Project]

  • (arXiv 2022.02) Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis, [Paper]

  • (arXiv 2022.02) CaMEL: Mean Teacher Learning for Image Captioning, [Paper]

  • (arXiv 2022.02) Hierarchical Perceiver, [Paper]

  • (arXiv 2022.02) Movies2Scenes: Learning Scene Representations Using Movie Similarities, [Paper]

  • (arXiv 2022.02) GroupViT: Semantic Segmentation Emerges from Text Supervision, [Paper], [[Code

  • (arXiv 2022.02) Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer, [Paper], [Code]

  • (arXiv 2022.02) Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations, [Paper]

  • (arXiv 2022.02) ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond, [Paper]

  • (arXiv 2022.02) PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths, [Paper], [Code]

  • (arXiv 2022.02) DataMUX: Data Multiplexing for Neural Networks, [Paper], [Code]

  • (arXiv 2022.02) On Guiding Visual Attention with Language Specification, [Paper]

  • (arXiv 2022.02) SPATIO-TEMPORAL OUTDOOR LIGHTING AGGREGATION ON IMAGE SEQUENCES USING TRANSFORMER NETWORKS, [Paper]

  • (arXiv 2022.02) MISINFORMATION DETECTION IN SOCIAL MEDIA VIDEO POSTS, [Paper]

  • (arXiv 2022.02) Can Deep Learning be Applied to Model-Based Multi-Object Tracking? [Paper]

  • (arXiv 2022.02) NOT ALL PATCHES ARE WHAT YOU NEED: EXPEDITING VISION TRANSFORMERS VIA TOKEN REORGANIZATIONS, [Paper], [Code]

  • (arXiv 2022.02) ActionFormer: Localizing Moments of Actions with Transformers, [Paper], [Code]

  • (arXiv 2022.02) One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones, [Paper]

  • (arXiv 2022.02) XAI for Transformers: Better Explanations through Conservative Propagation, [Paper]

  • (arXiv 2022.02) MeshLeTemp: Leveraging the Learnable Vertex-Vertex Relationship to Generalize Human Pose and Mesh Reconstruction for In-the-Wild Scenes, [Paper]

  • (arXiv 2022.02) ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer, [Paper]

  • (arXiv 2022.02) Hyper-relationship Learning Network for Scene Graph Generation, [Paper]

  • (arXiv 2022.02) CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval, [Paper]

  • (arXiv 2022.02) Flowformer: Linearizing Transformers with Conservation Flows, [Paper]

  • (arXiv 2022.02) DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following, [Paper], [Code]

  • (arXiv 2022.02) CATs++: Boosting Cost Aggregation with Convolutions and Transformers, [Paper]

  • (arXiv 2022.02) Geometric Transformer for Fast and Robust Point Cloud Registration, [Paper], [Code]

  • (arXiv 2022.02) I-Tuning: Tuning Language Models with Image for Caption Generation, [[Paper]](I-Tuning: Tuning Language Models with Image for Caption Generation)

  • (arXiv 2022.02) Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval, [Paper], [Code]

  • (arXiv 2022.02) Visual Acoustic Matching, [Paper]

  • (arXiv 2022.02) LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling, [Paper]

  • (arXiv 2022.02) BViT: Broad Attention based Vision Transformer, [Paper], [Code]

  • (arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation, [Paper]

  • (arXiv 2022.02) Domain Adaptation via Prompt Learning, [Paper]

  • (arXiv 2022.02) Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs, [Paper], [Code]

  • (arXiv 2022.02) Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework, [Paper], [Project]

  • (arXiv 2022.02) HOW DO VISION TRANSFORMERS WORK? [Paper], [Code]

  • (arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning, [Paper], [Code]

  • (arXiv 2022.02) CLIPasso: Semantically-Aware Object Sketching, [Paper], [Code]

  • (arXiv 2022.02) Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer, [Paper]

  • (arXiv 2022.02) DEEP SOCCER CAPTIONING WITH TRANSFORMER: DATASET, SEMANTICS-RELATED LOSSES, AND MULTI-LEVEL EVALUATION, [Paper], [Project]

  • (arXiv 2022.02) ENTROFORMER: A TRANSFORMER-BASED ENTROPY MODEL FOR LEARNED IMAGE COMPRESSION, [Paper], [Code]

  • (arXiv 2022.02) Image Difference Captioning with Pre-training and Contrastive Learning, [Paper], [Code]

  • (arXiv 2022.02) MaskGIT: Masked Generative Image Transformer, [Paper]

  • (arXiv 2022.02) Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning, [Paper]

  • (arXiv 2022.02) Motion-Aware Transformer For Occluded Person Re-identification, [Paper]

  • (arXiv 2022.02) Conditional Motion In-betweening, [Paper], [Code]

  • (arXiv 2022.02) Memory-based gaze prediction in deep imitation learning for robot manipulation, [Paper]

  • (arXiv 2022.02) Spherical Transformer, [Paper]

  • (arXiv 2022.02) OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context, [Paper]

  • (arXiv 2022.02) The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning, [Paper], [Project]

  • (arXiv 2022.02) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]

  • (arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]

  • (arXiv 2022.02) TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer, [Paper]

  • (arXiv 2022.02) The devil is in the labels: Semantic segmentation from sentences, [Paper]

  • (arXiv 2022.02) Webly Supervised Concept Expansion for General Purpose Vision Models, [Paper], [Project]

  • (arXiv 2022.02) VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG, [Paper]

  • (arXiv 2022.02) UNIFYING ARCHITECTURES, TASKS, AND MODALITIES THROUGH A SIMPLE SEQUENCE-TO-SEQUENCE LEARNING FRAMEWORK, [Paper], [Code]

  • (arXiv 2022.02) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper]

  • (arXiv 2022.02) TRANSDREAMER: REINFORCEMENT LEARNING WITH TRANSFORMER WORLD MODELS, [Paper]

  • (arXiv 2022.02) Vision-Language Pre-Training with Triple Contrastive Learning, [Paper], [Code]

  • (arXiv 2022.02) Corrupted Image Modeling for Self-Supervised Visual Pre-Training, [Paper]

  • (arXiv 2022.02) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, [Paper], [Code]

  • (arXiv 2022.02) DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators, [Paper]

  • (arXiv 2022.02) Interactron: Embodied Adaptive Object Detection, [Paper]

  • (arXiv 2022.02) Local Feature Matching with Transformers for low-end devices LoFTR method adaptation approach, [Paper], [Code]

  • (arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]

  • (arXiv 2022.02) Can Transformers be Strong Treatment Effect Estimators?, [Paper]

  • (arXiv 2022.02) Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers, [Paper]

  • (arXiv 2022.02) Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics, [Paper], [Code]

2022.01

  • (arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [Paper]

  • (arXiv 2022.01) DynaMixer: A Vision MLP Architecture with Dynamic Mixing, [Paper]

  • (arXiv 2022.01) VRT: A Video Restoration Transformer, [Paper], [Code]

  • (arXiv 2022.01) DAB-DETR: DYNAMIC ANCHOR BOXES ARE BETTER QUERIES FOR DETR, [Paper], [Code]

  • (arXiv 2022.01) Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations, [Paper]

  • (arXiv 2022.01) MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment, [Paper]

  • (arXiv 2022.01) VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training, [Paper]

  • (arXiv 2022.01) BOAT: Bilateral Local Attention Vision Transformer, [Paper]

  • (arXiv 2022.01) GRAPH SELF-ATTENTION FOR LEARNING GRAPH REPRESENTATION WITH TRANSFORMER, [Paper]

  • (arXiv 2022.01) Aggregating Global Features into Local Vision Transformer, [Paper], [Code]

  • (arXiv 2022.01) Transformer Module Networks for Systematic Generalization in Visual Question Answering, [Paper]

  • (arXiv 2022.01) Generalised Image Outpainting with U-Transformer, [Paper]

  • (arXiv 2022.01) RelTR: Relation Transformer for Scene Graph Generation, [Paper]

  • (arXiv 2022.01) DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer, [Paper]

  • (arXiv 2022.01) Pre-Trained Language Transformers are Universal Image Classifiers, [Paper]

  • (arXiv 2022.01) Explore and Match: End-to-End Video Grounding with Transformer, [Paper]

  • (arXiv 2022.01) TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network, [Paper]

  • (arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals, [Paper]

  • (arXiv 2022.01) ShapeFormer: Transformer-based Shape Completion via Sparse Representation, [Paper], [Project]

  • (arXiv 2022.01) CONVOLUTIONAL XFORMERS FOR VISION, [Paper], [Code]

  • (arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer, [Paper], [Code]

  • (arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [Paper]

  • (arXiv 2022.01) SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering, [Paper]

  • (arXiv 2022.01) DUAL-TASKS SIAMESE TRANSFORMER FRAMEWORK FOR BUILDING DAMAGE ASSESSMENT, [Paper]

  • (arXiv 2022.01) When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism, [Paper], [Code]

  • (arXiv 2022.01) Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation, [Paper]

  • (arXiv 2022.01) Training Vision Transformers with Only 2040 Images, [Paper]

  • (arXiv 2022.01) Learning To Recognize Procedural Activities with Distant Supervision, [Paper]

  • (arXiv 2022.01) EVALUATING LANGUAGE-BIASED IMAGE CLASSIFICATION BASED ON SEMANTIC REPRESENTATIONS, [Paper]

  • (arXiv 2022.01) A Comprehensive Study of Vision Transformers on Dense Prediction Tasks, [Paper]

  • (arXiv 2022.01) UniFormer: Unifying Convolution and Self-attention for Visual Recognition, [Paper], [Code]

  • (arXiv 2022.01) Patches Are All You Need? [Paper], [Code]

  • (arXiv 2022.01) Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval, [Paper]

  • (arXiv 2022.01) LEARNING TO ACT WITH AFFORDANCE-AWARE MULTIMODAL NEURAL SLAM, [Paper]

  • (arXiv 2022.01) Visual Information Guided Zero-Shot Paraphrase Generation, [Paper]

  • (arXiv 2022.01) TerViT: An Efficient Ternary Vision Transformer, [Paper]

  • (arXiv 2022.01) End-to-end Generative Pretraining for Multimodal Video Captioning, [Paper]

  • (arXiv 2022.01) OMNIVORE: A Single Model for Many Visual Modalities, [Paper], [Project]

  • (arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition, [Paper]

  • (arXiv 2022.01) The CLEAR Benchmark: Continual LEArning on Real-World Imagery, [Paper], [Project]

  • (arXiv 2022.01) ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues, [Paper]

  • (arXiv 2022.01) Cross-modal Contrastive Distillation for Instructional Activity Anticipation, [Paper]

  • (arXiv 2022.01) Transformers in Action: Weakly Supervised Action Segmentation, [Paper]

  • (arXiv 2022.01) VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer, [Paper]

  • (arXiv 2022.01) CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks, [Paper]

  • (arXiv 2022.01) Domain Adaptation via Bidirectional Cross-Attention Transformer, [Paper]

  • (arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for Online Inference, [Paper]

  • (arXiv 2022.01) Motion Inbetweening via Deep ∆-Interpolator, [Paper]

  • (arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper]

  • (arXiv 2022.01) GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events, [Paper]

  • (arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [Paper]

  • (arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]

  • (arXiv 2022.01) Disentangled Latent Transformer for Interpretable Monocular Height Estimation, [Paper], [Project]

  • (arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers*, [Paper]

  • (arXiv 2022.01) SWINUNET3D - A HIERARCHICAL ARCHITECTURE FOR DEEP TRAFFIC PREDICTION USING SHIFTED WINDOW TRANSFORMERS, [Paper], [Code]

  • (arXiv 2022.01) SWIN-POSE: SWIN TRANSFORMER BASED HUMAN POSE ESTIMATION, [Paper]

  • (arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation, [Paper], [Project]

  • (arXiv 2022.01) ViT2Hash: Unsupervised Information-Preserving Hashing, [Paper]

  • (arXiv 2022.01) LANGUAGE-DRIVEN SEMANTIC SEGMENTATION, [Paper], [Code]

  • (arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]

  • (arXiv 2022.01) ImageSubject: A Large-scale Dataset for Subject Detection, [Paper]

  • (arXiv 2022.01) Detecting Twenty-thousand Classes using Image-level Supervision, [Paper], [Code]

  • (arXiv 2022.01) Generalized Category Discovery, [Paper], [Code]

  • (arXiv 2022.01) Video Summarization Based on Video-text Modelling, [Paper]

  • (arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]

  • (arXiv 2022.01) QUADTREE ATTENTION FOR VISION TRANSFORMERS, [Paper], [Code]

  • (arXiv 2022.01) A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval, [Paper], [Project]

  • (arXiv 2022.01) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]

  • (arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]

  • (arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]

  • (arXiv 2022.01) Multiview Transformers for Video Recognition, [Paper]

  • (arXiv 2022.01) HYPERTRANSFORMER: MODEL GENERATION FOR SUPERVISED AND SEMI-SUPERVISED FEW-SHOT LEARNING, [Paper]

  • (arXiv 2022.01) UNIFORMER: UNIFIED TRANSFORMER FOR EFFICIENT SPATIOTEMPORAL REPRESENTATION LEARNING, [Paper], [Code]

  • (arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper], [Project]

  • (arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers, [Paper]

  • (arXiv 2022.01) CLIP-Event: Connecting Text and Images with Event Structures, [Paper], [Code]

  • (arXiv 2022.01) Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training, [Paper]

  • (arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]

  • (arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]

  • (arXiv 2022.01) TransVPR: Transformer-based TransVPR: Transformer-based place recognition with multi-level attention aggregation with multi-level attention aggregation, [Paper]

  • (arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]

  • (arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring, [Paper]

  • (arXiv 2022.01) Stochastic Layers in Vision Transformers, [Paper]

  • (arXiv 2022.01) ERNIE-VILG: UNIFIED GENERATIVE PRE-TRAINING FOR BIDIRECTIONAL VISION-LANGUAGE GENERATION, [Paper]

  • (arXiv 2022.01) InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer, [Paper], [Code]

  • (arXiv 2022.01) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]

  • (arXiv 2022.01) Persformer: A Transformer Architecture for Topological Machine Learning, [Paper]

  • (arXiv 2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]

  • (arXiv 2022.01) Language as Queries for Referring Video Object Segmentation, [Paper], [Code]

  • (arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]

  • (arXiv 2022.01) A TRANSFORMER-BASED SIAMESE NETWORK FOR CHANGE DETECTION, [Paper], [Code]

  • (arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]

  • (arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer, [Paper], [Project]

  • (arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

2021.12

  • (arXiv 2021.12) Multi-Dimensional Model Compression of Vision Transformer, [Paper]

  • (arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]

  • (arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Atention, [Paper], [Code]

  • (arXiv 2021.12) APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers, [Paper]

  • (arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper]

  • (arXiv 2021.12) Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?, [Paper]

  • (arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]

  • (arXiv 2021.12) A FISTFUL OF WORDS: LEARNING TRANSFERABLE VISUAL MODELS FROM BAG-OF-WORDS SUPERVISION, [Paper]

  • (arXiv 2021.12) StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2, [Paper], [Code]

  • (arXiv 2021.12) A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, [Paper], [Code]

  • (arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]

  • (arXiv 2021.12) SIMVIT: EXPLORING A SIMPLE VISION TRANSFORMER WITH SLIDING WINDOWS, [Paper], [Code]

  • (arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]

  • (arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]

  • (arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]

  • (arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]

  • (arXiv 2021.12) ViR: the Vision Reservoir, [Paper]

  • (arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]

  • (arXiv 2021.12) Open-Vocabulary Image Segmentation, [Paper]

  • (arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]

  • (arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]

  • (arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]

  • (arXiv 2021.12) Fine-grained Multi-Modal Self-Supervised Learning, [Paper]

  • (arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper], [Code]

  • (arXiv 2021.12) CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes, [Paper]

  • (arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input Adaptation, [Paper]

  • (arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]

  • (arXiv 2021.12) Contrastive Object Detection Using Knowledge Graph Embeddings, [Paper]

  • (arXiv 2021.12) RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality, [Paper], [Code]

  • (arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]

  • (arXiv 2021.12) MPViT : Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]

  • (arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]

  • (arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]

  • (arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]

  • (arXiv 2021.12) LOCFORMER: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]

  • (arXiv 2021.12) Tell me what you see: A zero-shot action recognition method based on natural language descriptions, [Paper], [Code]

  • (arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]

  • (arXiv 2021.12) ScanQA: 3D Question Answering for Spatial Scene Understanding, [Paper]

  • (arXiv 2021.12) Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [Paper]

  • (arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]

  • (arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]

  • (arXiv 2021.12) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, [Paper], [Code]

  • (arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]

  • (arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]

  • (arXiv 2021.12) Align and Prompt: Video-and-Language Pre-training with Entity Prompts, [Paper], [Code]

  • (arXiv 2021.12) DATA EFFICIENT LANGUAGE-SUPERVISED ZEROSHOT RECOGNITION WITH OPTIMAL TRANSPORT DISTILLATION, [Paper]

  • (arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]

  • (arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]

  • (arXiv 2021.12) ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources, [Paper]

  • (arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper]

  • (arXiv 2021.12) How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation, [Paper]

  • (arXiv 2021.12) Learning to Prompt for Continual Learning, [Paper], [Code]

  • (arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper], [Code]

  • (arXiv 2021.12) Dense Video Captioning Using Unsupervised Semantic Information, [Paper], [Code]

  • (arXiv 2021.12) Looking Outside the Box to Ground Language in 3D Scenes, [Paper], [Code]

  • (arXiv 2021.12) RegionCLIP: Region-based Language-Image Pretraining, [Paper], [Code]

  • (arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper]

  • (arXiv 2021.12) Masked Feature Prediction for Self-Supervised Visual Pre-Training, [Paper]

  • (arXiv 2021.12) SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning, [Paper]

  • (arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]

  • (arXiv 2021.12) Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos, [Paper], [Code]

  • (arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]

  • (arXiv 2021.12) QAHOI: Query-Based Anchors for Human-Object Interaction Detection, [Paper], [Code]

  • (arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]

  • (arXiv 2021.12) CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations, [Paper]

  • (arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]

  • (arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors, [Paper], [Project]

  • (arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]

  • (arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]

  • (arXiv 2021.12) COMPOSER: Compositional Learning of Group Activity in Videos, [Paper]

  • (arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]

  • (arXiv 2021.12) Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection, [Paper]

  • (arXiv 2021.12) SVIP: Sequence VerIfication for Procedures in Videos, [Paper]

  • (arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]

  • (arXiv 2021.12) VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper], [Code]

  • (arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]

  • (arXiv 2021.12) PartGlot: Learning Shape Part Segmentation from Language Reference Games, [Paper]

  • (arXiv 2021.12) Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network, [Paper]

  • (arXiv 2021.12) LEARNING SEMANTIC-ALIGNED FEATURE REPRESENTATION FOR TEXT-BASED PERSON SEARCH, [Paper]

  • (arXiv 2021.12) L-Verse: Bidirectional Generation Between Image and Text, [Paper]

  • (arXiv 2021.12) SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY, [Paper]

  • (arXiv 2021.12) Are Vision Transformers Robust to Patch Perturbations? [Paper]

  • (arXiv 2021.12) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]

  • (arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]

  • (arXiv 2021.12) MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning, [Paper]

  • (arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]

  • (arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]

  • (arXiv 2021.12) Rethinking the Two-Stage Framework for Grounded Situation Recognition, [Paper], [Code]

  • (arXiv 2021.12) CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions, [Paper]

  • (arXiv 2021.12) Couplformer: Rethinking Vision Transformer with Coupling Attention Map, [Paper]

  • (arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]

  • (arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]

  • (arXiv 2021.12) Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training, [Paper], [Code]

  • (arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]

  • (arXiv 2021.12) Grounded Language-Image Pre-training, [Paper], [Code]

  • (arXiv 2021.12) U^2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper]

  • (arXiv 2021.12) ADAPTIVE CHANNEL ENCODING TRANSFORMER FOR POINT CLOUD ANALYSIS, [Paper]

  • (arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]

  • (arXiv 2021.12) VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts, [Paper]

  • (arXiv 2021.12) PointCLIP: Point Cloud Understanding by CLIP, [Paper], [Code]

  • (arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]

  • (arXiv 2021.12) DYNAMIC TOKEN NORMALIZATION IMPROVES VISION TRANSFORMER, [Paper], [Code]

  • (arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]

  • (arXiv 2021.12) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]

  • (arXiv 2021.12) Text2Mesh: Text-Driven Neural Stylization for Meshes, [Paper], [Project]

  • (arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]

  • (arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]

  • (arXiv 2021.12) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization, [Paper], [Code]

  • (arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]

  • (arXiv 2021.12) Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer, [Paper], [Code]

  • (arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection, [Paper]

  • (arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper]

  • (arXiv 2021.12) Transformer based trajectory prediction, [Paper]

  • (arXiv 2021.12) Evaluating Transformers for Lightweight Action Recognition, [Paper]

  • (arXiv 2021.12) Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision, [Paper]

  • (arXiv 2021.12) CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification, [Paper]

  • (arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]

  • (arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]

  • (arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers, [Paper]

  • (arXiv 2021.12) Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning, [Paper]

  • (arXiv 2021.12) AUDIO-VISUAL SYNCHRONISATION IN THE WILD, [Paper], [Project]

  • (arXiv 2021.12) Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs, [Paper]

  • (arXiv 2021.12) Garment4D: Garment Reconstruction from Point Cloud Sequences, [Paper], [Code]

  • (arXiv 2021.12) Locally Shifted Attention**** With Early Global Integration, [Paper], [Code]

  • (arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]

  • (arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Project]

  • (arXiv 2021.12) HairCLIP: Design Your Hair by Text and Reference Image, [Paper], [Project]

  • (arXiv 2021.12) CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields, [Paper], [Code]

  • (arXiv 2021.12) A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code], [Dataset]

  • (arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper], [Code]

  • (arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]

  • (arXiv 2021.12) Fast Point Transformer, [Paper]

  • (arXiv 2021.12) Assistive Tele-op: Leveraging Transformers to Collect Robotic Task Demonstrations, [Paper], [Project]

  • (arXiv 2021.12) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper]

  • (arXiv 2021.12) PatchFormer: An Efficient Point Transformer with Patch Attention, [Paper]

  • (arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]

  • (arXiv 2021.12) MLP Architectures for Vision-and-Language Modeling: An Empirical Study, [Paper], [Code]

  • (arXiv 2021.12) Everything at Once – Multi-modal Fusion Transformer for Video Retrieval, [Paper]

  • (arXiv 2021.12) Prompting Visual-Language Models for Efficient Video Understanding, [Paper], [Project]

  • (arXiv 2021.12) FLAVA: A Foundational Language And Vision Alignment Model, [Paper]

  • (arXiv 2021.12) Embedding Arithmetic for Text-driven Image Transformation, [Paper]

  • (arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper]

  • (arXiv 2021.12) Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos, [Paper], [Project]

  • (arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]

  • (arXiv 2021.12) DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, [Paper], [Code]

  • (arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]

  • (arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]

  • (arXiv 2021.12) Zero-Shot Text-Guided Object Generation with Dream Fields, [Paper], [Project]

  • (arXiv 2021.12) Video-Text Pre-training with Learned Regions, [Paper], [Code]

  • (arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]

  • (arXiv 2021.12) TCTN: A 3D-Temporal Convolutional Transformer Network for Spatiotemporal Predictive Learning, [Paper]

  • (arXiv 2021.12) DenseCLIP: Extract Free Dense Labels from CLIP, [Paper]

  • (arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper]

  • (arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]

  • (arXiv 2021.12) Object-Centric Unsupervised Image Captioning, [Paper]

  • (arXiv 2021.12) Vision Pair Learning: An Efficient Training Framework for Image Classification, [Paper]

  • (arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]

  • (arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]

  • (arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]

  • (arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]

  • (arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]

  • (arXiv 2021.12) Human-Object Interaction Detection via Weak Supervision, [Paper]

  • (arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]

  • (arXiv 2021.12) CLIPstyler: Image Style Transfer with a Single Text Condition, [Paper]

  • (arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]

  • (arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]

  • (arXiv 2021.12) Object-aware Video-language Pre-training for Retrieval, [Paper], [Code]

2021.11

  • (arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]

  • (arXiv 2021.11) Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model, [Paper]

  • (arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]

  • (arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]

  • (arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]

  • (arXiv 2021.11) ADAPTIVE FOURIER NEURAL OPERATORS: EFFICIENT TOKEN MIXERS FOR TRANSFORMERS, [Paper]

  • (arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]

  • (arXiv 2021.11) DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning, [Paper], [Code]

  • (arXiv 2021.11) Ice hockey player identification via transformers, [Paper]

  • (arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]

  • (arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]

  • (arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]

  • (arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]

  • (arXiv 2021.11) DISCRETE REPRESENTATIONS STRENGTHEN VISION TRANSFORMER ROBUSTNESS, [Paper]

  • (arXiv 2021.11) TRAVLR: Now You See It, Now You Don’t! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning, [Paper]

  • (arXiv 2021.11) Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling, [Paper]

  • (arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]

  • (arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]

  • (arXiv 2021.11) ZERO-SHOT CERTIFIED DEFENSE AGAINST ADVERSARIAL PATCHES WITH VISION TRANSFORMERS, [Paper]

  • (arXiv 2021.11) PointMixer: MLP-Mixer for Point Cloud Understanding, [Paper]

  • (arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]

  • (arXiv 2021.11) Florence: A New Foundation Model for Computer Vision, [Paper]

  • (arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]

  • (arXiv 2021.11) Learning to Compose Visual Relations, [Paper], [Project]

  • (arXiv 2021.11) REFERENCE-BASED MAGNETIC RESONANCE IMAGE RECONSTRUCTION USING TEXTURE TRANSFORMER, [Paper]

  • (arXiv 2021.11) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval, [Paper]

  • (arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]

  • (arXiv 2021.11) SimMIM: A Simple Framework for Masked Image Modeling, [Paper], [Code]

  • (arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]

  • (arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]

  • (arXiv 2021.11) ClipCap: CLIP Prefix for Image Captioning, [Paper], [Code]

  • (arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]

  • (arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]

  • (arXiv 2021.11) Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, [Paper], [Code]

  • (arXiv 2021.11) Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning, [Paper], [Code]

  • (arXiv 2021.11) Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement, [Paper], [Code]

  • (arXiv 2021.11) Tracking People with 3D Representations, [Paper], [Code]

  • (arXiv 2021.11) LiT: Zero-Shot Transfer with Locked-image Text Tuning, [Paper]

  • (arXiv 2021.11) FILIP: FINE-GRAINED INTERACTIVE LANGUAGE-IMAGE PRE-TRAINING, [Paper]

  • (arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code]

  • (arXiv 2021.11) Attention Approximates Sparse Distributed Memory, [Paper]

  • (arXiv 2021.11) SLICED RECURSIVE TRANSFORMER, [Paper], [Code]

  • (arXiv 2021.11) HYBRID BYOL-VIT: EFFICIENT APPROACH TO DEAL WITH SMALL DATASETS, [Paper]

  • (arXiv 2021.11) Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling, [Paper], [Code]

  • (arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]

  • (arXiv 2021.11) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis, [Paper], [Code]

  • (arXiv 2021.11) Revisiting spatio-temporal layouts for compositional action recognition, [Paper], [Code]

  • (arXiv 2021.11) PatchGame: Learning to Signal Mid-level Patches in Referential Games, [Paper], [Code]

  • (arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]

  • (arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]

  • (arXiv 2021.11) With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, [Paper], [Code]

  • (arXiv 2021.11) IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning, [Paper], [Project]

  • (arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]

  • (arXiv 2021.11) VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval, [Paper]

  • (arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]

  • (arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]

  • (arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]

  • (arXiv 2021.11) ML-Decoder: Scalable and Versatile Classification Head, [Paper], [Code]

  • (arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]

  • (arXiv 2021.11) SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]

  • (arXiv 2021.11) Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization, [Paper]

  • (arXiv 2021.11) Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation, [Paper]

  • (arXiv 2021.11) Sparse is Enough in Scaling Transformers, [Paper]

  • (arXiv 2021.11) An implementation of the “Guess who?” game using CLIP, [Paper], [Code]

  • (arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]

  • (arXiv 2021.11) A Unified Pruning Framework for Vision Transformers, [Paper]

  • (arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]

  • (arXiv 2021.11) AssistSR: Affordance-centric Question-driven Video Segment Retrieval, [Paper], [Code & Data]

  • (arXiv 2021.11) DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation, [Paper], [Code]

  • (arXiv 2021.11) , [Paper]

  • (arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]

  • (arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]

  • (arXiv 2021.11) CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning, [Paper]

  • (arXiv 2021.11) CRIS: CLIP-Driven Referring Image Segmentation, [Paper]

  • (arXiv 2021.11) Shunted Self-Attention via Multi-Scale Token Aggregation, [Paper], [Code]

  • (arXiv 2021.11) MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning, [Paper]

  • (arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions, [Paper], [Code]

  • (arXiv 2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]

  • (arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]

  • (arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]

  • (arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]

  • (arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper], [Code]

  • (arXiv 2021.11) LAFITE : Towards Language-Free Training for Text-to-Image Generation, [Paper]

  • (arXiv 2021.11) SPARSE DETR: EFFICIENT END-TO-END OBJECT DETECTION WITH LEARNABLE SPARSITY, [Paper], [Code]

  • (arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]

  • (arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]

  • (arXiv 2021.11) Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic, [Paper], [Code]

  • (arXiv 2021.11) Blended Diffusion for Text-driven Editing of Natural Images, [Paper], [Code]

  • (arXiv 2021.11) Mask Transfiner for High-Quality Instance Segmentation, [Paper], [Code]

  • (arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]

  • (arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]

  • (arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper], [COde]

  • (arXiv 2021.11) Towards Tokenized Human Dynamics Representation, [Paper], [Code]

  • (arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]

  • (arXiv 2021.11) VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]

  • (arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]

  • (arXiv 2021.11) MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video, [Paper]

  • (arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]

  • (arXiv 2021.11) Hierarchical Modular Network for Video Captioning, [Paper]

  • (arXiv 2021.11) NU¨WA: Visual Synthesis Pre-training for Neural visUal World creAtion, [Paper], [Code]

  • (arXiv 2021.11) An Image Patch is a Wave: Phase-Aware Vision MLP, [Paper]

  • (arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper]

  • (arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]

  • (arXiv 2021.11) Scaling Up Vision-Language Pre-training for Image Captioning, [Paper]

  • (arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]

  • (arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]

  • (arXiv 2021.11) RedCaps: Web-curated image-text data created by the people, for the people, [Paper], [Project]

  • (arXiv 2021.11) EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching, [Paper], [Code]

  • (arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper], [Code]

  • (arXiv 2021.11) Vis-TOP: Visual Transformer Overlay Processor, [Paper]

  • (arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]

  • (arXiv 2021.11) Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints, [Paper]

  • (arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]

  • (arXiv 2021.11) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions, [Paper]

  • (arXiv 2021.11) Combined Scaling for Zero-shot Transfer Learning, [Paper]

  • (arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]

  • (arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]

  • (arXiv 2021.11) IBOT: IMAGE BERT PRE-TRAINING WITH ONLINE TOKENIZER, [Paper], [Code]

  • (arXiv 2021.11) Masked Autoencoders Are Scalable Vision Learners, [Paper]

  • (arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]

  • (arXiv 2021.11) Are Transformers More Robust Than CNNs?, [Paper], [Code]

  • (arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]

  • (arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]

  • (arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]

  • (arXiv 2021.11) VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]

  • (arXiv 2021.11) LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs, [Paper], [Project]

  • (arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]

  • (arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]

  • (arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]

2021.10

  • (arXiv 2021.10) Visual Keyword Spotting with Attention, [Paper], [[Project]](Visual Keyword Spotting with Attention)

  • (arXiv 2021.10) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery, [Paper], [Data & Code]

  • (arXiv 2021.10) Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval, [Paper], [Code]

  • (arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]

  • (arXiv 2021.10) Scatterbrain: Unifying Sparse and Low-rank Attention Approximation, [Paper]

  • (arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]

  • (arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]

  • (arXiv 2021.10) UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model, [Paper], [Data & Code]

  • (arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]

  • (arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]

  • (arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Project]

  • (arXiv 2021.10) TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation, [Paper]

  • (arXiv 2021.10) TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED EMOTION RECOGNITION, [Paper]

  • (arXiv 2021.10) Contextual Similarity Aggregation with Self-attention for Visual Re-ranking, [Paper], [Code]

  • (arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]

  • (arXiv 2021.10) IMAGE-BASED CLIP-GUIDED ESSENCE TRANSFER, [Paper], [Code]

  • (arXiv 2021.10) Sinkformers: Transformers with Doubly Stochastic Attention, [Paper]

  • (arXiv 2021.10) ILLITERATE DALL·E LEARNS TO COMPOSE, [Paper], [Project], [Code]

  • (arXiv 2021.10) Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering, [Paper]

  • (arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]

  • (arXiv 2021.10) Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation, [Paper]

  • (arXiv 2021.10) TRANSFORMER ACCELERATION WITH DYNAMIC SPARSE ATTENTION, [Paper]

  • (arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]

  • (arXiv 2021.10) Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization, [Paper]

  • (arXiv 2021.10) StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects, [Paper], [Project]

  • (arXiv 2021.10) Gophormer: Ego-Graph Transformer for Node Classification, [Paper]

  • (arXiv 2021.10) STRANSGAN: AN EMPIRICAL STUDY ON TRANSFORMER IN GANS, [Paper], [Code]

  • (arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]

  • (arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper], [Code]

  • (arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]

  • (arXiv 2021.10) WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP, [Paper], [Code]

  • (arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]

  • (arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]

  • (arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]

  • (arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]

  • (arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]

  • (arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]

  • (arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]

  • (arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]

  • (arXiv 2021.10) Leveraging MoCap Data for Human Mesh Recovery, [Paper]

  • (arXiv 2021.10) A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models, [Paper]

  • (arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]

  • (arXiv 2021.10) Multimodal Dialogue Response Generation, [Paper]

  • (arXiv 2021.10) Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals, [Paper]

  • (arXiv 2021.10) COMPOSITIONAL ATTENTION: DISENTANGLING SEARCH AND RETRIEVAL, [Paper], [Code]

  • (arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]

  • (arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]

  • (arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper]

  • (arXiv 2021.10) Transformer with a Mixture of Gaussian Keys, [Paper]

  • (arXiv 2021.10) DIFFUSIONCLIP: TEXT-GUIDED IMAGE MANIPULATION USING DIFFUSION MODELS, [Paper]

  • (arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]

  • (arXiv 2021.10) RIPPLE ATTENTION FOR VISUAL PERCEPTION WITH SUB-QUADRATIC COMPLEXITY, [Paper]

  • (arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]

  • (arXiv 2021.10) CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation, [Paper]

  • (arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]

  • (arXiv 2021.10) SPARSE MOES MEET EFFICIENT ENSEMBLES, [Paper]

  • (arXiv 2021.10) Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent? [Paper]

  • (arXiv 2021.10) SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition, [Paper]

  • (arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper]

  • (arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]

  • (arXiv 2021.10) SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM, [Paper], [Code]

  • (arXiv 2021.10) CLIP4Caption ++: Multi-CLIP for Video Caption, [Paper]

  • (arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper]

  • (arXiv 2021.10) VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN, [Paper]

  • (arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]

  • (arXiv 2021.10) NVIT: VISION TRANSFORMER COMPRESSION AND PARAMETER REDISTRIBUTION, [Paper]

  • (arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]

  • (arXiv 2021.10) CLIP-Adapter: Better Vision-Language Models with Feature Adapters, [Paper], [Code]

  • (arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Code]

  • (arXiv 2021.10) MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER, [Paper]

  • (arXiv 2021.10) TOKEN POOLING IN VISION TRANSFORMERS, [Paper]

  • (arXiv 2021.10) VIDT: AN EFFICIENT AND EFFECTIVE FULLY TRANSFORMER-BASED OBJECT DETECTOR, [Paper], [Code]

  • (arXiv 2021.10) CLIP4Caption: CLIP for Video Caption, [Paper]

  • (arXiv 2021.10) OBJECT-REGION VIDEO TRANSFORMERS, [Paper], [Code]

  • (arXiv 2021.10) LEVERAGING REDUNDANCY IN ATTENTION WITH REUSE TRANSFORMERS, [Paper]

  • (arXiv 2021.10) Dynamic Inference with Neural Interpreters, [Paper]

  • (arXiv 2021.10) A CLIP-Enhanced Method for Video-Language Understanding, [Paper]

  • (arXiv 2021.10) Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries, [Paper]

  • (arXiv 2021.10) Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection, [Paper]

  • (arXiv 2021.10) Learning Structural Representations for Recipe Generation and Food Retrieval, [Paper]

  • (arXiv 2021.10) A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION, [Paper]

2021.09

  • (arXiv 2021.09) Joint Multimedia Event Extraction from Video and Article, [Paper]

  • (arXiv 2021.09) Long-Range Transformers for Dynamic Spatiotemporal Forecasting, [Paper]

  • (arXiv 2021.09) Visually Grounded Concept Composition, [Paper]

  • (arXiv 2021.09) CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation, [Paper]

  • (arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]

  • (arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]

  • (arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]

  • (arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper], [Code]

  • (arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]

  • (arXiv 2021.09) VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, [Paper], [Code]

  • (arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]

  • (arXiv 2021.09) CLIP-It! Language-Guided Video Summarization, [Paper], [Project]

  • (arXiv 2021.09) MFEVIT: A ROBUST LIGHTWEIGHT TRANSFORMER-BASED NETWORK FOR MULTIMODAL 2D+3D FACIAL EXPRESSION RECOGNITION, [Paper]

  • (arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper], [Code]

  • (arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]

  • (arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper]

  • (arXiv 2021.09) MLIM: VISION-AND-LANGUAGE MODEL PRE-TRAINING WITH MASKED LANGUAGE AND IMAGE MODELING, [Paper]

  • (arXiv 2021.09) Dense Contrastive Visual-Linguistic Pretraining, [Paper]

  • (arXiv 2021.09) CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED VISION-LANGUAGE MODELS, [Paper]

  • (arXiv 2021.09) Localizing ∞-shaped fishes: Sketch-guided object localization in the wild, [Paper], [Code]

  • (arXiv 2021.09) CLIPORT: What and Where Pathways for Robotic Manipulation, [Paper], [Project], [Code]

  • (arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]

  • (arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]

  • (arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper], [Code]

  • (arXiv 2021.09) LOTR: Face Landmark Localization Using Localization Transformer, [Paper]

  • (arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]

  • (arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]

  • (arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]

  • (arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]

  • (arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]

  • (arXiv 2021.09) PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR OBJECT DETECTION, [Paper]

  • (arXiv 2021.09) ActionCLIP: A New Paradigm for Video Action Recognition, [Paper]

  • (arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]

  • (arXiv 2021.09) Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering, [Paper], [Code]

  • (arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper], [Code]

  • (arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper], [Code]

  • (arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]

  • (arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]

  • (arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper]

  • (arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]

  • (arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]

  • (arXiv 2021.09) Learning to Ground Visual Objects for Visual Dialog, [Paper]

  • (arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]

  • (arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]

  • (arXiv 2021.09) IS ATTENTION BETTER THAN MATRIX DECOMPOSITION? [Paper], [Code]

  • (arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]

  • (arXiv 2021.09) Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization, [Paper]

  • (arXiv 2021.09) Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding, [Paper]

  • (arXiv 2021.09) LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation, [Paper], [Code]

  • (arXiv 2021.09) Panoptic Narrative Grounding, [Paper]

  • (arXiv 2021.09) An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA, [Paper]

  • (arXiv 2021.09) PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks, [Paper], [Project]

  • (arXiv 2021.09) EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling, [Paper]

  • (arXiv 2021.09) Scaled ReLU Matters for Training Vision Transformers, [Paper]

  • (arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]

  • (arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper]

  • (arXiv 2021.09) WHYACT: Identifying Action Reasons in Lifestyle Vlogs, [Paper]

  • (arXiv 2021.09) Zero-Shot Open Set Detection by Extending CLIP, [Paper]

  • (arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]

  • (arXiv 2021.09) Learning to Prompt for Vision-Language Models, [Paper], [Code]

  • (arXiv 2021.09) Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss, [Paper], [Code]

  • (arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]

  • (arXiv 2021.09) ConvMLP: Hierarchical Convolutional MLPs for Vision, [Paper], [Code]

  • (arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]

  • (arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]

  • (arXiv 2021.09) Sparse-MLP: A Fully-MLP Architecture with Conditional Computation, [Paper]

  • (arXiv 2021.09) SORNet: Spatial Object-Centric Representations for Sequential Manipulation, [Paper], [Project]

  • (arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper]

  • (arXiv 2021.09) Weakly Supervised Relative Spatial Reasoning for Visual Question Answering, [Paper], [Code]

  • (arXiv 2021.09) FUSFORMER: A TRANSFORMER-BASED FUSION APPROACH FOR HYPERSPECTRAL IMAGE SUPER-RESOLUTION, [Paper]

  • (arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]

  • (arXiv 2021.09) Learning to Generate Scene Graph from Natural Language Supervision, [Paper], [Code]

  • (arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]

  • (arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]

  • (ICCV 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]

  • (arXiv 2021.09) Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation, [Paper], [Code]

  • (arXiv 2021.09) Joint Graph Learning and Matching for Semantic Feature Correspondence, [Paper]

  • (arXiv 2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper], [Code]

2021.08

  • (arXiv 2021.08) SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation, [Paper]

  • (arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]

  • (arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]

  • (arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]

  • (arXiv 2021.08) Cross-category Video Highlight Detection via Set-based Learning, [Paper], [Code]

  • (arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]

  • (arXiv 2021.08) SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments, [Paper]

  • (arXiv 2021.08) LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision, [Paper], [Project]

  • (arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]

  • (arXiv 2021.08) SIMVLM: SIMPLE VISUAL LANGUAGE MODEL PRETRAINING WITH WEAK SUPERVISION, [Paper]

  • (arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [[Paper]](https://

transformer-in-vision's People

Contributors

dirtyharrylyl avatar eungbean avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.