-
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
arXiv 08/2023
[paper] [code] -
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation
arXiv 08/2023
[paper] [code] -
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension
arXiv 08/2023
[paper] [code] -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
arXiv 08/2023
[paper] [code] -
LISA: Reasoning Segmentation via Large Language Model
arXiv 08/2023
[paper] [code] -
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
arXiv 07/2023
[paper] [code] -
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
arXiv 07/2023
[paper] [code] -
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 07/2023
[paper] [code] -
Kosmos-2: Grounding Multimodal Large Language Models to the World
arXiv 06/2023
[paper] [code] -
VisorGPT: Learning Visual Prior via Generative Pre-Training
arXiv 05/2023
[paper] [code] -
VideoChat: Chat-Centric Video Understanding
arXiv 05/2023
[paper] [code] -
Contextual Object Detection with Multimodal Large Language Models
arXiv 05/2023
[paper] [code] -
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv 05/2023
[paper] [code] -
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
arXiv 04/2023
[paper] -
What does CLIP know about a red circle? Visual prompt engineering for VLMs
arXiv 04/2023
[paper] -
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
arXiv 02/2023
[paper] [code]
- RefCOCO/RefCOCOg/RefCOCO+
- Visual Genome
- Flickr30k