Related papers: Attention-aware Inference Optimizations for Large …

A-VL: Adaptive Attention for Large Vision-Language Models

The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention…

Artificial Intelligence · Computer Science 2025-02-10 Junyang Zhang , Mu Yuan , Ruiguang Zhong , Puhan Luo , Huiyou Zhan , Ningkang Zhang , Chengchen Hu , Xiangyang Li

Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime…

Computation and Language · Computer Science 2026-04-15 Jun Zhang , Yicheng Ji , Feiyang Ren , Yihang Li , Bowen Zeng , Zonghao Chen , Ke Chen , Lidan Shou , Gang Chen , Huan Li

Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of…

Machine Learning · Computer Science 2026-04-14 Surendra Pathak , Bo Han

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification

The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Yefei He , Feng Chen , Jing Liu , Wenqi Shao , Hong Zhou , Kaipeng Zhang , Bohan Zhuang

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse…

Computer Vision and Pattern Recognition · Computer Science 2026-02-16 Omer Faruk Deniz , Ruiyu Mao , Ruochen Li , Yapeng Tian , Latifur Khan

Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference

Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution…

Computer Vision and Pattern Recognition · Computer Science 2025-12-25 Putu Indah Githa Cahyani , Komang David Dananjaya Suartana , Novanto Yudistira

Attention Guided Alignment in Efficient Vision-Language Models

Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs) to integrate visual and textual information. This paper presents a comprehensive analysis of…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Shweta Mahajan , Hoang Le , Hyojin Park , Farzad Farhadzadeh , Munawar Hayat , Fatih Porikli

D-Attn: Decomposed Attention for Large Vision-and-Language Models

Large vision-and-language models (LVLMs) have traditionally integrated visual and textual tokens by concatenating them into a single homogeneous input for large language models (LLMs), thereby maximally preserving the pre-trained language…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Chia-Wen Kuo , Sijie Zhu , Fan Chen , Xiaohui Shen , Longyin Wen

AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms

Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their…

Computation and Language · Computer Science 2025-02-24 Feiyang Chen , Yu Cheng , Lei Wang , Yuqing Xia , Ziming Miao , Lingxiao Ma , Fan Yang , Jilong Xue , Zhi Yang , Mao Yang , Haibo Chen

PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a range of multimodal tasks. However, their inference efficiency is constrained by the large number of visual tokens processed during decoding. To address…

Computer Vision and Pattern Recognition · Computer Science 2025-02-21 Yu Meng , Kaiyuan Li , Chenran Huang , Chen Gao , Xinlei Chen , Yong Li , Xiaoping Zhang

Towards General Continuous Memory for Vision-Language Models

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

Despite achieving remarkable performance on various vision-language tasks, Transformer-based Vision-Language Models (VLMs) suffer from redundancy in inputs and parameters, significantly hampering their efficiency in real-world applications.…

Computation and Language · Computer Science 2024-02-27 Zekun Wang , Jingchang Chen , Wangchunshu Zhou , Haichao Zhu , Jiafeng Liang , Liping Shan , Ming Liu , Dongliang Xu , Qing Yang , Bing Qin

Large Language Model Partitioning for Low-Latency Inference at the Edge

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

Attention Debiasing for Token Pruning in Vision Language Models

Vision-language models (VLMs) typically encode substantially more visual tokens than text tokens, resulting in significant token redundancy. Pruning uninformative visual tokens is therefore crucial for improving computational efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-19 Kai Zhao , Wubang Yuan , Yuchen Lin , Liting Ruan , Xiaofeng Lu , Deng-Ping Fan , Ming-Ming Cheng , Dan Zeng

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Wangbo Zhao , Yizeng Han , Jiasheng Tang , Zhikai Li , Yibing Song , Kai Wang , Zhangyang Wang , Yang You

Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation

Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators.…

Machine Learning · Computer Science 2025-04-11 Shaoyuan Chen , Wencong Xiao , Yutong Lin , Mingxing Zhang , Yingdi Shan , Jinlei Jiang , Kang Chen , Yongwei Wu

AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models

Vision-language models (VLMs) show remarkable performance in multimodal tasks. However, excessively long multimodal inputs lead to oversized Key-Value (KV) caches, resulting in significant memory consumption and I/O bottlenecks. Previous KV…

Computation and Language · Computer Science 2025-01-28 Zunhai Su , Wang Shen , Linge Li , Zhe Chen , Hanyu Wei , Huangqi Yu , Kehong Yuan

AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs

Multimodal Large Language Models (MLLMs) have demonstrated substantial value in unified text-image understanding and reasoning, primarily by converting images into sequences of patch-level tokens that align with their architectural…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Xinliang Zhang , Lei Zhu , Hangzhou He , Shuang Zeng , Ourui Fu , Jiakui Hu , Zhengjian Yao , Yanye Lu

Vision Remember: Recovering Visual Information in Efficient LVLM with Vision Feature Resampling

The computational expense of redundant vision tokens in Large Vision-Language Models (LVLMs) has led many existing methods to compress them via a vision projector. However, this compression may lose visual information that is crucial for…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Ze Feng , Jiang-jiang Liu , Sen Yang , Lingyu Xiao , Zhibin Quan , Zhenhua Feng , Wankou Yang , Jingdong Wang

LatentLLM: Attention-Aware Joint Tensor Compression

Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension…

Machine Learning · Computer Science 2025-05-27 Toshiaki Koike-Akino , Xiangyu Chen , Jing Liu , Ye Wang , Pu , Wang , Matthew Brand