Related papers: Visual-Redundancy-Controlled Parallel Decoding for…

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs…

Computer Vision and Pattern Recognition · Computer Science 2025-11-20 Duo Li , Zuhao Yang , Xiaoqin Zhang , Ling Shao , Shijian Lu

Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit…

Machine Learning · Computer Science 2026-05-29 Heqiang Qi , Wei Huang , Mingyuan Bai , Xiangming Meng

Enhancing Vision-Language Model Reliability with Uncertainty-Guided Dropout Decoding

Large vision-language models (LVLMs) excel at multimodal tasks but are prone to misinterpreting visual inputs, often resulting in hallucinations and unreliable outputs. We present DROPOUT DECODING, a novel inference-time approach that…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Yixiong Fang , Ziran Yang , Zhaorun Chen , Zhuokai Zhao , Jiawei Zhou

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits…

Computation and Language · Computer Science 2026-04-22 Zhenbang Du , Kejing Xia , Xinrui Zhong , Yonggan Fu , Nicolai Oswald , Binfei Ji , Brucek Khailany , Pavlo Molchanov , Yingyan Lin

Diffusion Large Language Models for Visual Speech Recognition

Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the…

Artificial Intelligence · Computer Science 2026-05-28 Jeong Hun Yeo , Chae Won Kim , Hyeongseop Rha , Yong Man Ro

ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Xingwei Ma , Shiyang Feng , Bo Zhang , Bin Wang

Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Zahra Mahdavi , Zahra Khodakaramimaghsoud , Hooman Khaloo , Sina Bakhshandeh Taleshani , Erfan Hashemi , Javad Mirzapour Kaleybar , Omid Nejati Manzari

Revealing Multi-View Hallucination in Large Vision-Language Models

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Wooje Park , Insu Lee , Soohyun Kim , Jaeyun Jang , Minyoung Noh , Kyuhong Shim , Byonghyo Shim

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Large Language Models (LLMs) have strong instruction-following capability to interpret and execute tasks as directed by human commands. Multimodal Large Language Models (MLLMs) have inferior instruction-following ability compared to LLMs.…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Te Yang , Jian Jia , Xiangyu Zhu , Weisong Zhao , Bo Wang , Yanhua Cheng , Yan Li , Shengyuan Liu , Quan Chen , Peng Jiang , Kun Gai , Zhen Lei

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Jingqi Xu , Jingxi Lu , Chenghao Li , Sreetama Sarkar , Souvik Kundu , Peter A. Beerel

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show…

Computer Vision and Pattern Recognition · Computer Science 2026-02-16 Yizhou Wang , Song Mao , Yang Chen , Yufan Shen , Yinqiao Yan , Pinlong Cai , Ding Wang , Guohang Yan , Zhi Yu , Xuming Hu , Botian Shi

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Sujung Hong , Chanyong Yoon , Seong Jae Hwang

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Vishal Narnaware , Animesh Gupta , Kevin Zhai , Zhenyi Wang , Mubarak Shah

Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning

Although Large Language Models (LLMs) excel in reasoning and generation for language tasks, they are not specifically designed for multimodal challenges. Training Multimodal Large Language Models (MLLMs), however, is resource-intensive and…

Computer Vision and Pattern Recognition · Computer Science 2025-02-18 Yuqi Pang , Bowen Yang , Haoqin Tu , Yun Cao , Zeyu Zhang

Token Sequence Compression for Efficient Multimodal Computing

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Yasmine Omri , Parth Shroff , Thierry Tambe

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Rui Xu , Yunke Wang , Yong Luo , Bo Du

Deferred Commitment Decoding for Diffusion Language Models

Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based…

Computation and Language · Computer Science 2026-01-21 Yingte Shu , Yuchuan Tian , Chao Xu , Yunhe Wang , Hanting Chen

Beyond Intermediate States: Explaining Visual Redundancy through Language

Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dingchen Yang , Bowen Cao , Anran Zhang , Weibo Gu , Winston Hu , Guang Chen

Residual Context Diffusion Language Models

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking"…

Computation and Language · Computer Science 2026-02-02 Yuezhou Hu , Harman Singh , Monishwaran Maheswaran , Haocheng Xi , Coleman Hooper , Jintao Zhang , Aditya Tomar , Michael W. Mahoney , Sewon Min , Mehrdad Farajtabar , Kurt Keutzer , Amir Gholami , Chenfeng Xu