相关论文: Visual-Redundancy-Controlled Parallel Decoding for…

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs…

计算机视觉与模式识别 · 计算机科学 2025-11-20 Duo Li , Zuhao Yang , Xiaoqin Zhang , Ling Shao , Shijian Lu

Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit…

机器学习 · 计算机科学 2026-05-29 Heqiang Qi , Wei Huang , Mingyuan Bai , Xiangming Meng

Enhancing Vision-Language Model Reliability with Uncertainty-Guided Dropout Decoding

Large vision-language models (LVLMs) excel at multimodal tasks but are prone to misinterpreting visual inputs, often resulting in hallucinations and unreliable outputs. We present DROPOUT DECODING, a novel inference-time approach that…

计算机视觉与模式识别 · 计算机科学 2025-12-30 Yixiong Fang , Ziran Yang , Zhaorun Chen , Zhuokai Zhao , Jiawei Zhou

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits…

计算与语言 · 计算机科学 2026-04-22 Zhenbang Du , Kejing Xia , Xinrui Zhong , Yonggan Fu , Nicolai Oswald , Binfei Ji , Brucek Khailany , Pavlo Molchanov , Yingyan Lin

Diffusion Large Language Models for Visual Speech Recognition

Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the…

人工智能 · 计算机科学 2026-05-28 Jeong Hun Yeo , Chae Won Kim , Hyeongseop Rha , Yong Man Ro

ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic…

计算机视觉与模式识别 · 计算机科学 2025-12-30 Xingwei Ma , Shiyang Feng , Bo Zhang , Bin Wang

Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but…

计算机视觉与模式识别 · 计算机科学 2025-12-02 Zahra Mahdavi , Zahra Khodakaramimaghsoud , Hooman Khaloo , Sina Bakhshandeh Taleshani , Erfan Hashemi , Javad Mirzapour Kaleybar , Omid Nejati Manzari

Revealing Multi-View Hallucination in Large Vision-Language Models

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from…

计算机视觉与模式识别 · 计算机科学 2026-03-26 Wooje Park , Insu Lee , Soohyun Kim , Jaeyun Jang , Minyoung Noh , Kyuhong Shim , Byonghyo Shim

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Large Language Models (LLMs) have strong instruction-following capability to interpret and execute tasks as directed by human commands. Multimodal Large Language Models (MLLMs) have inferior instruction-following ability compared to LLMs.…

计算机视觉与模式识别 · 计算机科学 2024-11-26 Te Yang , Jian Jia , Xiangyu Zhu , Weisong Zhao , Bo Wang , Yanhua Cheng , Yan Li , Shengyuan Liu , Quan Chen , Peng Jiang , Kun Gai , Zhen Lei

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive…

计算机视觉与模式识别 · 计算机科学 2025-11-18 Jingqi Xu , Jingxi Lu , Chenghao Li , Sreetama Sarkar , Souvik Kundu , Peter A. Beerel

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

计算与语言 · 计算机科学 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show…

计算机视觉与模式识别 · 计算机科学 2026-02-16 Yizhou Wang , Song Mao , Yang Chen , Yufan Shen , Yinqiao Yan , Pinlong Cai , Ding Wang , Guohang Yan , Zhi Yu , Xuming Hu , Botian Shi

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite…

计算机视觉与模式识别 · 计算机科学 2026-05-20 Sujung Hong , Chanyong Yoon , Seong Jae Hwang

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an…

计算机视觉与模式识别 · 计算机科学 2026-03-27 Vishal Narnaware , Animesh Gupta , Kevin Zhai , Zhenyi Wang , Mubarak Shah

Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning

Although Large Language Models (LLMs) excel in reasoning and generation for language tasks, they are not specifically designed for multimodal challenges. Training Multimodal Large Language Models (MLLMs), however, is resource-intensive and…

计算机视觉与模式识别 · 计算机科学 2025-02-18 Yuqi Pang , Bowen Yang , Haoqin Tu , Yun Cao , Zeyu Zhang

Token Sequence Compression for Efficient Multimodal Computing

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency…

计算机视觉与模式识别 · 计算机科学 2025-04-28 Yasmine Omri , Parth Shroff , Thierry Tambe

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial…

计算机视觉与模式识别 · 计算机科学 2026-03-03 Rui Xu , Yunke Wang , Yong Luo , Bo Du

Deferred Commitment Decoding for Diffusion Language Models

Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based…

计算与语言 · 计算机科学 2026-01-21 Yingte Shu , Yuchuan Tian , Chao Xu , Yunhe Wang , Hanting Chen

Beyond Intermediate States: Explaining Visual Redundancy through Language

Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token…

计算机视觉与模式识别 · 计算机科学 2025-03-27 Dingchen Yang , Bowen Cao , Anran Zhang , Weibo Gu , Winston Hu , Guang Chen

Residual Context Diffusion Language Models

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking"…

计算与语言 · 计算机科学 2026-02-02 Yuezhou Hu , Harman Singh , Monishwaran Maheswaran , Haocheng Xi , Coleman Hooper , Jintao Zhang , Aditya Tomar , Michael W. Mahoney , Sewon Min , Mehrdad Farajtabar , Kurt Keutzer , Amir Gholami , Chenfeng Xu