English
Related papers

Related papers: Visual-Redundancy-Controlled Parallel Decoding for…

200 papers

Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs…

Computer Vision and Pattern Recognition · Computer Science 2025-11-20 Duo Li , Zuhao Yang , Xiaoqin Zhang , Ling Shao , Shijian Lu

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit…

Machine Learning · Computer Science 2026-05-29 Heqiang Qi , Wei Huang , Mingyuan Bai , Xiangming Meng

Large vision-language models (LVLMs) excel at multimodal tasks but are prone to misinterpreting visual inputs, often resulting in hallucinations and unreliable outputs. We present DROPOUT DECODING, a novel inference-time approach that…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Yixiong Fang , Ziran Yang , Zhaorun Chen , Zhuokai Zhao , Jiawei Zhou

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits…

Computation and Language · Computer Science 2026-04-22 Zhenbang Du , Kejing Xia , Xinrui Zhong , Yonggan Fu , Nicolai Oswald , Binfei Ji , Brucek Khailany , Pavlo Molchanov , Yingyan Lin

Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the…

Artificial Intelligence · Computer Science 2026-05-28 Jeong Hun Yeo , Chae Won Kim , Hyeongseop Rha , Yong Man Ro

Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Xingwei Ma , Shiyang Feng , Bo Zhang , Bin Wang

Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Zahra Mahdavi , Zahra Khodakaramimaghsoud , Hooman Khaloo , Sina Bakhshandeh Taleshani , Erfan Hashemi , Javad Mirzapour Kaleybar , Omid Nejati Manzari

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Wooje Park , Insu Lee , Soohyun Kim , Jaeyun Jang , Minyoung Noh , Kyuhong Shim , Byonghyo Shim

Large Language Models (LLMs) have strong instruction-following capability to interpret and execute tasks as directed by human commands. Multimodal Large Language Models (MLLMs) have inferior instruction-following ability compared to LLMs.…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Te Yang , Jian Jia , Xiangyu Zhu , Weisong Zhao , Bo Wang , Yanhua Cheng , Yan Li , Shengyuan Liu , Quan Chen , Peng Jiang , Kun Gai , Zhen Lei

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Jingqi Xu , Jingxi Lu , Chenghao Li , Sreetama Sarkar , Souvik Kundu , Peter A. Beerel

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show…

Computer Vision and Pattern Recognition · Computer Science 2026-02-16 Yizhou Wang , Song Mao , Yang Chen , Yufan Shen , Yinqiao Yan , Pinlong Cai , Ding Wang , Guohang Yan , Zhi Yu , Xuming Hu , Botian Shi

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Sujung Hong , Chanyong Yoon , Seong Jae Hwang

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Vishal Narnaware , Animesh Gupta , Kevin Zhai , Zhenyi Wang , Mubarak Shah

Although Large Language Models (LLMs) excel in reasoning and generation for language tasks, they are not specifically designed for multimodal challenges. Training Multimodal Large Language Models (MLLMs), however, is resource-intensive and…

Computer Vision and Pattern Recognition · Computer Science 2025-02-18 Yuqi Pang , Bowen Yang , Haoqin Tu , Yun Cao , Zeyu Zhang

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Yasmine Omri , Parth Shroff , Thierry Tambe

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Rui Xu , Yunke Wang , Yong Luo , Bo Du

Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based…

Computation and Language · Computer Science 2026-01-21 Yingte Shu , Yuchuan Tian , Chao Xu , Yunhe Wang , Hanting Chen

Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dingchen Yang , Bowen Cao , Anran Zhang , Weibo Gu , Winston Hu , Guang Chen

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking"…

‹ Prev 1 2 3 10 Next ›