Related papers: Multimodal Function Vectors for Spatial Relations

Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Jing Bi , Junjia Guo , Yunlong Tang , Lianggong Bruce Wen , Zhang Liu , Chenliang Xu

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Brandon Huang , Chancharik Mitra , Assaf Arbelle , Leonid Karlinsky , Trevor Darrell , Roei Herzig

Function Vectors in Large Language Models

We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning…

Computation and Language · Computer Science 2024-02-27 Eric Todd , Millicent L. Li , Arnab Sen Sharma , Aaron Mueller , Byron C. Wallace , David Bau

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM,…

Computer Vision and Pattern Recognition · Computer Science 2024-11-12 Fatemeh Shiri , Xiao-Yu Guo , Mona Golestan Far , Xin Yu , Gholamreza Haffari , Yuan-Fang Li

Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules

Despite excelling on multimodal benchmarks, vision-language models (VLMs) largely remain a black box. In this paper, we propose a novel interpretability framework to systematically analyze the internal mechanisms of VLMs, focusing on the…

Artificial Intelligence · Computer Science 2025-12-12 Yanbei Jiang , Xueqi Ma , Shu Liu , Sarah Monazam Erfani , Tongliang Liu , James Bailey , Jey Han Lau , Krista A. Ehinger

On the Performance of Multimodal Language Models

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently…

Computation and Language · Computer Science 2023-11-29 Utsav Garg , Erhan Bas

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing…

Computer Vision and Pattern Recognition · Computer Science 2025-11-04 Xu Zheng , Zihao Dongfang , Lutao Jiang , Boyuan Zheng , Yulong Guo , Zhenquan Zhang , Giuliano Albanese , Runyi Yang , Mengjiao Ma , Zixin Zhang , Chenfei Liao , Dingcheng Zhen , Yuanhuiyi Lyu , Yuqian Fu , Bin Ren , Linfeng Zhang , Danda Pani Paudel , Nicu Sebe , Luc Van Gool , Xuming Hu

Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks. Despite strong performance, LMMs' generative outputs are not specialized for vision-language classification tasks (i.e.,…

Computer Vision and Pattern Recognition · Computer Science 2025-06-10 Chancharik Mitra , Brandon Huang , Tianning Chai , Zhiqiu Lin , Assaf Arbelle , Rogerio Feris , Leonid Karlinsky , Trevor Darrell , Deva Ramanan , Roei Herzig

Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) utilize multimodal contexts consisting of text, images, or videos to solve various multimodal tasks. However, we find that changing the order of multimodal input can cause the model's performance to…

Artificial Intelligence · Computer Science 2024-10-23 Zhijie Tan , Xu Chu , Weiping Li , Tong Mo

Large Language Models Facilitate Vision Reflection in Image Classification

This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Guoyuan An , JaeYoon Kim , SungEui Yoon

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual information relevant to the text token. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-03-06 Seil Kang , Jinyeong Kim , Junhyeok Kim , Seong Jae Hwang

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant…

Computation and Language · Computer Science 2025-10-14 Shiqi Chen , Tongyao Zhu , Ruochen Zhou , Jinghan Zhang , Siyang Gao , Juan Carlos Niebles , Mor Geva , Junxian He , Jiajun Wu , Manling Li

Large Vision-Language Models Get Lost in Attention

Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of…

Artificial Intelligence · Computer Science 2026-05-08 Gongli Xi , Ye Tian , Mengyu Yang , Huahui Yi , Liang Lin , Xiaoshuai Hao , Kun Wang , Wendong Wang

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning.…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Runsen Xu , Weiyao Wang , Hao Tang , Xingyu Chen , Xiaodong Wang , Fu-Jen Chu , Matt Feiszli , Kevin J. Liang

Relational Knowledge Distillation Using Fine-tuned Function Vectors

Representing relations between concepts is a core prerequisite for intelligent systems to make sense of the world. Recent work using causal mediation analysis has shown that a small set of attention heads encodes task representation in…

Computation and Language · Computer Science 2026-01-14 Andrea Kang , Yingnian Wu , Hongjing Lu

Towards Understanding Multimodal Fine-Tuning: Spatial Features

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Lachin Naghashyar , Hunar Batra , Ashkan Khakzar , Philip Torr , Ronald Clark , Christian Schroeder de Witt , Constantin Venhoff

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs)…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Yana Wei , Liang Zhao , Jianjian Sun , Kangheng Lin , Jisheng Yin , Jingcheng Hu , Yinmin Zhang , En Yu , Haoran Lv , Zejia Weng , Jia Wang , Chunrui Han , Yuang Peng , Qi Han , Zheng Ge , Xiangyu Zhang , Daxin Jiang , Vishal M. Patel

Probing Multimodal Large Language Models for Global and Local Semantic Representations

The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving…

Computation and Language · Computer Science 2024-11-22 Mingxu Tao , Quzhe Huang , Kun Xu , Liwei Chen , Yansong Feng , Dongyan Zhao

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human…

Artificial Intelligence · Computer Science 2026-04-10 Baining Zhao , Ziyou Wang , Jianjie Fang , Zile Zhou , Yanggang Xu , Yatai Ji , Jiacheng Xu , Qian Zhang , Weichen Zhang , Chen Gao , Xinlei Chen

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

In this work, we propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs) through test-time optimization of a learnable latent variable. We observe that attention, as the core module of MLLMs,…

Computer Vision and Pattern Recognition · Computer Science 2025-01-08 Mingrui Wu , Xinyue Cai , Jiayi Ji , Jiale Li , Oucheng Huang , Gen Luo , Hao Fei , Guannan Jiang , Xiaoshuai Sun , Rongrong Ji