中文
相关论文

相关论文: Visual-Redundancy-Controlled Parallel Decoding for…

200 篇论文

Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs…

计算机视觉与模式识别 · 计算机科学 2025-11-20 Duo Li , Zuhao Yang , Xiaoqin Zhang , Ling Shao , Shijian Lu

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit…

机器学习 · 计算机科学 2026-05-29 Heqiang Qi , Wei Huang , Mingyuan Bai , Xiangming Meng

Large vision-language models (LVLMs) excel at multimodal tasks but are prone to misinterpreting visual inputs, often resulting in hallucinations and unreliable outputs. We present DROPOUT DECODING, a novel inference-time approach that…

计算机视觉与模式识别 · 计算机科学 2025-12-30 Yixiong Fang , Ziran Yang , Zhaorun Chen , Zhuokai Zhao , Jiawei Zhou

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits…

计算与语言 · 计算机科学 2026-04-22 Zhenbang Du , Kejing Xia , Xinrui Zhong , Yonggan Fu , Nicolai Oswald , Binfei Ji , Brucek Khailany , Pavlo Molchanov , Yingyan Lin

Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the…

人工智能 · 计算机科学 2026-05-28 Jeong Hun Yeo , Chae Won Kim , Hyeongseop Rha , Yong Man Ro

Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic…

计算机视觉与模式识别 · 计算机科学 2025-12-30 Xingwei Ma , Shiyang Feng , Bo Zhang , Bin Wang

Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but…

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from…

计算机视觉与模式识别 · 计算机科学 2026-03-26 Wooje Park , Insu Lee , Soohyun Kim , Jaeyun Jang , Minyoung Noh , Kyuhong Shim , Byonghyo Shim

Large Language Models (LLMs) have strong instruction-following capability to interpret and execute tasks as directed by human commands. Multimodal Large Language Models (MLLMs) have inferior instruction-following ability compared to LLMs.…

计算机视觉与模式识别 · 计算机科学 2024-11-26 Te Yang , Jian Jia , Xiangyu Zhu , Weisong Zhao , Bo Wang , Yanhua Cheng , Yan Li , Shengyuan Liu , Quan Chen , Peng Jiang , Kun Gai , Zhen Lei

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive…

计算机视觉与模式识别 · 计算机科学 2025-11-18 Jingqi Xu , Jingxi Lu , Chenghao Li , Sreetama Sarkar , Souvik Kundu , Peter A. Beerel

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

计算与语言 · 计算机科学 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show…

计算机视觉与模式识别 · 计算机科学 2026-02-16 Yizhou Wang , Song Mao , Yang Chen , Yufan Shen , Yinqiao Yan , Pinlong Cai , Ding Wang , Guohang Yan , Zhi Yu , Xuming Hu , Botian Shi

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite…

计算机视觉与模式识别 · 计算机科学 2026-05-20 Sujung Hong , Chanyong Yoon , Seong Jae Hwang

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an…

计算机视觉与模式识别 · 计算机科学 2026-03-27 Vishal Narnaware , Animesh Gupta , Kevin Zhai , Zhenyi Wang , Mubarak Shah

Although Large Language Models (LLMs) excel in reasoning and generation for language tasks, they are not specifically designed for multimodal challenges. Training Multimodal Large Language Models (MLLMs), however, is resource-intensive and…

计算机视觉与模式识别 · 计算机科学 2025-02-18 Yuqi Pang , Bowen Yang , Haoqin Tu , Yun Cao , Zeyu Zhang

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency…

计算机视觉与模式识别 · 计算机科学 2025-04-28 Yasmine Omri , Parth Shroff , Thierry Tambe

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial…

计算机视觉与模式识别 · 计算机科学 2026-03-03 Rui Xu , Yunke Wang , Yong Luo , Bo Du

Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based…

计算与语言 · 计算机科学 2026-01-21 Yingte Shu , Yuchuan Tian , Chao Xu , Yunhe Wang , Hanting Chen

Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token…

计算机视觉与模式识别 · 计算机科学 2025-03-27 Dingchen Yang , Bowen Cao , Anran Zhang , Weibo Gu , Winston Hu , Guang Chen

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking"…

‹ 上一页 1 2 3 10 下一页 ›