Related papers: Pixel Aligned Language Models

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Visual grounding seeks to localize the image region corresponding to a free-form text description. Recently, the strong multimodal capabilities of Large Vision-Language Models (LVLMs) have driven substantial improvements in visual…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Seil Kang , Jinyeong Kim , Junhyeok Kim , Seong Jae Hwang

Attention-Based Keyword Localisation in Speech using Visual Grounding

Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model…

Computation and Language · Computer Science 2021-06-24 Kayode Olaleye , Herman Kamper

An Introduction to Vision-Language Modeling

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models…

Machine Learning · Computer Science 2024-05-28 Florian Bordes , Richard Yuanzhe Pang , Anurag Ajay , Alexander C. Li , Adrien Bardes , Suzanne Petryk , Oscar Mañas , Zhiqiu Lin , Anas Mahmoud , Bargav Jayaraman , Mark Ibrahim , Melissa Hall , Yunyang Xiong , Jonathan Lebensold , Candace Ross , Srihari Jayakumar , Chuan Guo , Diane Bouchacourt , Haider Al-Tahan , Karthik Padthe , Vasu Sharma , Hu Xu , Xiaoqing Ellen Tan , Megan Richards , Samuel Lavoie , Pietro Astolfi , Reyhane Askari Hemmat , Jun Chen , Kushal Tirumala , Rim Assouel , Mazda Moayeri , Arjang Talattof , Kamalika Chaudhuri , Zechun Liu , Xilun Chen , Quentin Garrido , Karen Ullrich , Aishwarya Agrawal , Kate Saenko , Asli Celikyilmaz , Vikas Chandra

Learning Visual Grounding from Generative Vision and Language Model

Visual grounding tasks aim to localize image regions based on natural language references. In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Shijie Wang , Dahun Kim , Ali Taalimi , Chen Sun , Weicheng Kuo

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed…

Computer Vision and Pattern Recognition · Computer Science 2024-03-07 Navid Rajabi , Jana Kosecka

A Picture is Worth a Thousand Words: Language Models Plan from Pixels

Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments. In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text…

Computation and Language · Computer Science 2023-03-17 Anthony Z. Liu , Lajanugen Logeswaran , Sungryull Sohn , Honglak Lee

Expressing Visual Relationships via Language

Describing images with text is a fundamental problem in vision-language research. Current studies in this domain mostly focus on single image captioning. However, in various real applications (e.g., image editing, difference interpretation,…

Computation and Language · Computer Science 2019-06-20 Hao Tan , Franck Dernoncourt , Zhe Lin , Trung Bui , Mohit Bansal

Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models

Recent studies show that deep vision-only and language-only models--trained on disjoint modalities--nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network…

Computer Vision and Pattern Recognition · Computer Science 2025-09-26 Zoe Wanying He , Sean Trott , Meenakshi Khosla

Vision language models have difficulty recognizing virtual objects

Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question…

Computer Vision and Pattern Recognition · Computer Science 2025-05-16 Tyler Tran , Sangeet Khemlani , J. G. Trafton

Visually-Augmented Language Modeling

Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on text-only self-supervised training with massive text data, which…

Computation and Language · Computer Science 2023-02-28 Weizhi Wang , Li Dong , Hao Cheng , Haoyu Song , Xiaodong Liu , Xifeng Yan , Jianfeng Gao , Furu Wei

Learning to Generate Grounded Visual Captions without Localization Supervision

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Chih-Yao Ma , Yannis Kalantidis , Ghassan AlRegib , Peter Vajda , Marcus Rohrbach , Zsolt Kira

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

The development of Large Vision-Language Models (LVLMs) is striving to catch up with the success of Large Language Models (LLMs), yet it faces more challenges to be resolved. Very recent works enable LVLMs to localize object-level visual…

Computer Vision and Pattern Recognition · Computer Science 2024-03-20 Zhipeng Huang , Zhizheng Zhang , Zheng-Jun Zha , Yan Lu , Baining Guo

Talking Points: Describing and Localizing Pixels

Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through…

Computer Vision and Pattern Recognition · Computer Science 2025-10-17 Matan Rusanovsky , Shimon Malnick , Shai Avidan

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

Multimodal Large Language Model (MLLMs) leverages Large Language Models as a cognitive framework for diverse visual-language tasks. Recent efforts have been made to equip MLLMs with visual perceiving and grounding capabilities. However,…

Computer Vision and Pattern Recognition · Computer Science 2024-03-26 Junwen He , Yifan Wang , Lijun Wang , Huchuan Lu , Jun-Yan He , Jin-Peng Lan , Bin Luo , Xuansong Xie

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing…

Computer Vision and Pattern Recognition · Computer Science 2024-04-12 Kanchana Ranasinghe , Satya Narayan Shukla , Omid Poursaeed , Michael S. Ryoo , Tsung-Yu Lin

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Large language models (LLMs) have made significant advancements in natural language understanding. However, through that enormous semantic representation that the LLM has learnt, is it somehow possible for it to understand images as well?…

Computer Vision and Pattern Recognition · Computer Science 2024-07-12 Mu Cai , Zeyi Huang , Yuheng Li , Utkarsh Ojha , Haohan Wang , Yong Jae Lee

Physically Grounded Vision-Language Models for Robotic Manipulation

Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models are now well-positioned to reason about the physical world,…

Robotics · Computer Science 2024-03-05 Jensen Gao , Bidipta Sarkar , Fei Xia , Ted Xiao , Jiajun Wu , Brian Ichter , Anirudha Majumdar , Dorsa Sadigh

Towards Understanding Visual Grounding in Visual Language Models

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Georgios Pantazopoulos , Eda B. Özyiğit

3D-LLM: Injecting the 3D World into Large Language Models

Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves…

Computer Vision and Pattern Recognition · Computer Science 2023-07-25 Yining Hong , Haoyu Zhen , Peihao Chen , Shuhong Zheng , Yilun Du , Zhenfang Chen , Chuang Gan