Related papers: EGM: Efficient Visual Grounding Language Models

Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?

Despite significant progress in multimodal language models (LMs), it remains unclear whether visual grounding enhances their understanding of embodied knowledge compared to text-only models. To address this question, we propose a novel…

Computation and Language · Computer Science 2025-10-21 Zhihui Yang , Yupei Wang , Kaijie Mo , Zhe Zhao , Renfen Hu

Towards Understanding Visual Grounding in Visual Language Models

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Georgios Pantazopoulos , Eda B. Özyiğit

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

To build scalable models for challenging real-world tasks, it is important to learn from diverse, multi-modal data in various forms (e.g., videos, text, and images). Among the existing works, a plethora of them have focused on leveraging…

Computer Vision and Pattern Recognition · Computer Science 2023-11-30 Jacob Zhiyuan Fang , Skyler Zheng , Vasu Sharma , Robinson Piramuthu

Event-Priori-Based Vision-Language Model for Efficient Visual Understanding

Large Language Model (LLM)-based Vision-Language Models (VLMs) have substantially extended the boundaries of visual understanding capabilities. However, their high computational demands hinder deployment on resource-constrained edge…

Computer Vision and Pattern Recognition · Computer Science 2025-06-10 Haotong Qin , Cheng Hu , Michele Magno

Learning Visual Grounding from Generative Vision and Language Model

Visual grounding tasks aim to localize image regions based on natural language references. In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Shijie Wang , Dahun Kim , Ali Taalimi , Chen Sun , Weicheng Kuo

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models

While large vision-language models (VLMs) are increasingly adopted as the perceptual backbone for embodied agents, existing benchmarks often rely on question-answering or multiple-choice formats. These protocols allow models to exploit…

Computer Vision and Pattern Recognition · Computer Science 2026-05-19 Haozhe Shan , Xiancong Ren , Han Dong , Haoyuan Shi , Yingji Zhang , Jiayu Hu , Yi Zhang , Yong Dai , Bin Shen , Lizhen Qu , Zenglin Xu , Xiaozhu Ju

Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models

Large Vision-Language Models (LVLMs) offer remarkable benefits for a variety of vision-language tasks. However, a challenge hindering their application in real-world scenarios, particularly regarding safety, robustness, and reliability, is…

Computer Vision and Pattern Recognition · Computer Science 2024-01-17 Jiaying Lu , Jinmeng Rao , Kezhen Chen , Xiaoyuan Guo , Yawen Zhang , Baochen Sun , Carl Yang , Jie Yang

Towards General Continuous Memory for Vision-Language Models

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang

VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special…

Computer Vision and Pattern Recognition · Computer Science 2025-12-15 Weitai Kang , Jason Kuen , Mengwei Ren , Zijun Wei , Yan Yan , Kangning Liu

A Survey on Efficient Vision-Language Models

Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Gaurav Shinde , Anuradha Ravi , Emon Dey , Shadman Sakib , Milind Rampure , Nirmalya Roy

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning…

Computer Vision and Pattern Recognition · Computer Science 2025-12-23 Siming Yan , Min Bai , Weifeng Chen , Xiong Zhou , Qixing Huang , Li Erran Li

Visual Generation Tuning

Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations,…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Jiahao Guo , Sinan Du , Jingfeng Yao , Wenyu Liu , Bo Li , Haoxiang Cao , Kun Gai , Chun Yuan , Kai Wu , Xinggang Wang

Vision-Language Models for Edge Networks: A Comprehensive Survey

Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains…

Computer Vision and Pattern Recognition · Computer Science 2025-06-18 Ahmed Sharshar , Latif U. Khan , Waseem Ullah , Mohsen Guizani

Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision

Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we…

Computer Vision and Pattern Recognition · Computer Science 2025-10-17 Shengcao Cao , Liang-Yan Gui , Yu-Xiong Wang

Eye Gaze Tells You Where to Compute: Gaze-Driven Efficient VLMs

Vision-Language Models (VLMs) deliver impressive performance in understanding visual content with language instructions. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs, which hinders real-time…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Qinyu Chen , Jiawen Qi

Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-17 Natan Bagrov , Eugene Khvedchenia , Borys Tymchenko , Shay Aharon , Lior Kadoch , Tomer Keren , Ofri Masad , Yonatan Geifman , Ran Zilberstein , Tuomas Rintamaki , Matthieu Le , Andrew Tao

Visual-ERM: Reward Modeling for Visual Equivalence

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs)…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Ziyu Liu , Shengyuan Ding , Xinyu Fang , Xuanlang Dai , Penghui Yang , Jianze Liang , Jiaqi Wang , Kai Chen , Dahua Lin , Yuhang Zang

eP-ALM: Efficient Perceptual Augmentation of Language Models

Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best…

Computer Vision and Pattern Recognition · Computer Science 2023-10-30 Mustafa Shukor , Corentin Dancette , Matthieu Cord

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There…

Computer Vision and Pattern Recognition · Computer Science 2025-01-24 Miao Rang , Zhenni Bi , Chuanjian Liu , Yehui Tang , Kai Han , Yunhe Wang

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen