English
Related papers

Related papers: Multimodal Function Vectors for Spatial Relations

200 papers

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Jing Bi , Junjia Guo , Yunlong Tang , Lianggong Bruce Wen , Zhang Liu , Chenliang Xu

The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Brandon Huang , Chancharik Mitra , Assaf Arbelle , Leonid Karlinsky , Trevor Darrell , Roei Herzig

We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning…

Computation and Language · Computer Science 2024-02-27 Eric Todd , Millicent L. Li , Arnab Sen Sharma , Aaron Mueller , Byron C. Wallace , David Bau

Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM,…

Computer Vision and Pattern Recognition · Computer Science 2024-11-12 Fatemeh Shiri , Xiao-Yu Guo , Mona Golestan Far , Xin Yu , Gholamreza Haffari , Yuan-Fang Li

Despite excelling on multimodal benchmarks, vision-language models (VLMs) largely remain a black box. In this paper, we propose a novel interpretability framework to systematically analyze the internal mechanisms of VLMs, focusing on the…

Artificial Intelligence · Computer Science 2025-12-12 Yanbei Jiang , Xueqi Ma , Shu Liu , Sarah Monazam Erfani , Tongliang Liu , James Bailey , Jey Han Lau , Krista A. Ehinger

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently…

Computation and Language · Computer Science 2023-11-29 Utsav Garg , Erhan Bas

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing…

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks. Despite strong performance, LMMs' generative outputs are not specialized for vision-language classification tasks (i.e.,…

Computer Vision and Pattern Recognition · Computer Science 2025-06-10 Chancharik Mitra , Brandon Huang , Tianning Chai , Zhiqiu Lin , Assaf Arbelle , Rogerio Feris , Leonid Karlinsky , Trevor Darrell , Deva Ramanan , Roei Herzig

Multimodal Large Language Models (MLLMs) utilize multimodal contexts consisting of text, images, or videos to solve various multimodal tasks. However, we find that changing the order of multimodal input can cause the model's performance to…

Artificial Intelligence · Computer Science 2024-10-23 Zhijie Tan , Xu Chu , Weiping Li , Tong Mo

This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Guoyuan An , JaeYoon Kim , SungEui Yoon

Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual information relevant to the text token. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-03-06 Seil Kang , Jinyeong Kim , Junhyeok Kim , Seong Jae Hwang

Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant…

Computation and Language · Computer Science 2025-10-14 Shiqi Chen , Tongyao Zhu , Ruochen Zhou , Jinghan Zhang , Siyang Gao , Juan Carlos Niebles , Mor Geva , Junxian He , Jiajun Wu , Manling Li

Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of…

Artificial Intelligence · Computer Science 2026-05-08 Gongli Xi , Ye Tian , Mengyu Yang , Huahui Yi , Liang Lin , Xiaoshuai Hao , Kun Wang , Wendong Wang

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning.…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Runsen Xu , Weiyao Wang , Hao Tang , Xingyu Chen , Xiaodong Wang , Fu-Jen Chu , Matt Feiszli , Kevin J. Liang

Representing relations between concepts is a core prerequisite for intelligent systems to make sense of the world. Recent work using causal mediation analysis has shown that a small set of attention heads encodes task representation in…

Computation and Language · Computer Science 2026-01-14 Andrea Kang , Yingnian Wu , Hongjing Lu

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Lachin Naghashyar , Hunar Batra , Ashkan Khakzar , Philip Torr , Ronald Clark , Christian Schroeder de Witt , Constantin Venhoff

The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs)…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Yana Wei , Liang Zhao , Jianjian Sun , Kangheng Lin , Jisheng Yin , Jingcheng Hu , Yinmin Zhang , En Yu , Haoran Lv , Zejia Weng , Jia Wang , Chunrui Han , Yuang Peng , Qi Han , Zheng Ge , Xiangyu Zhang , Daxin Jiang , Vishal M. Patel

The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving…

Computation and Language · Computer Science 2024-11-22 Mingxu Tao , Quzhe Huang , Kun Xu , Liwei Chen , Yansong Feng , Dongyan Zhao

Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human…

Artificial Intelligence · Computer Science 2026-04-10 Baining Zhao , Ziyou Wang , Jianjie Fang , Zile Zhou , Yanggang Xu , Yatai Ji , Jiacheng Xu , Qian Zhang , Weichen Zhang , Chen Gao , Xinlei Chen

In this work, we propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs) through test-time optimization of a learnable latent variable. We observe that attention, as the core module of MLLMs,…

Computer Vision and Pattern Recognition · Computer Science 2025-01-08 Mingrui Wu , Xinyue Cai , Jiayi Ji , Jiale Li , Oucheng Huang , Gen Luo , Hao Fei , Guannan Jiang , Xiaoshuai Sun , Rongrong Ji
‹ Prev 1 2 3 10 Next ›