Related papers: Input-Adaptive Visual Preprocessing for Efficient …

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Zichuan Lin , Yicheng Liu , Yang Yang , Lvfang Tao , Deheng Ye

A-VL: Adaptive Attention for Large Vision-Language Models

The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention…

Artificial Intelligence · Computer Science 2025-02-10 Junyang Zhang , Mu Yuan , Ruiguang Zhong , Puhan Luo , Huiyou Zhan , Ningkang Zhang , Chengchen Hu , Xiangyang Li

AdaFV: Rethinking of Visual-Language alignment for VLM acceleration

The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of…

Computer Vision and Pattern Recognition · Computer Science 2025-02-04 Jiayi Han , Liang Du , Yiwen Wu , Xiangguo Zhou , Hongwei Du , Weibo Zheng

FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference

Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency due to autoregressive decoding. In this work, we…

Machine Learning · Computer Science 2025-10-28 Divya Jyoti Bajpai , Manjesh Kumar Hanawal

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Pre-trained vision-language models (VLMs) have achieved impressive results in a range of vision-language tasks. However, popular VLMs usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and…

Computation and Language · Computer Science 2022-10-17 Tiannan Wang , Wangchunshu Zhou , Yan Zeng , Xinsong Zhang

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs' comprehension. In this paper, we present AdaptVision, a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-02 Yonghui Wang , Wengang Zhou , Hao Feng , Houqiang Li

FastVLM: Efficient Vision Encoding for Vision Language Models

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high…

Computer Vision and Pattern Recognition · Computer Science 2025-05-19 Pavan Kumar Anasosalu Vasu , Fartash Faghri , Chun-Liang Li , Cem Koc , Nate True , Albert Antony , Gokul Santhanam , James Gabriel , Peter Grasch , Oncel Tuzel , Hadi Pouransari

Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Fatih Ilhan , Gaowen Liu , Ramana Rao Kompella , Selim Furkan Tekin , Tiansheng Huang , Zachary Yahn , Yichang Xu , Ling Liu

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Recent advancements in multimodal fusion have witnessed the remarkable success of vision-language (VL) models, which excel in various multimodal applications such as image captioning and visual question answering. However, building VL…

Computer Vision and Pattern Recognition · Computer Science 2024-10-24 Zhiwei Hao , Jianyuan Guo , Li Shen , Yong Luo , Han Hu , Yonggang Wen

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during…

Computer Vision and Pattern Recognition · Computer Science 2025-03-24 Wenxuan Huang , Zijie Zhai , Yunhang Shen , Shaosheng Cao , Fei Zhao , Xiangfeng Xu , Zheyu Ye , Yao Hu , Shaohui Lin

Rethinking Visual Intelligence: Insights from Video Pretraining

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the…

Computer Vision and Pattern Recognition · Computer Science 2025-11-04 Pablo Acuaviva , Aram Davtyan , Mariam Hassan , Sebastian Stapf , Ahmad Rahimi , Alexandre Alahi , Paolo Favaro

Harnessing Input-Adaptive Inference for Efficient VLN

An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate…

Computer Vision and Pattern Recognition · Computer Science 2025-08-14 Dongwoo Kang , Akhil Perincherry , Zachary Coalson , Aiden Gabriel , Stefan Lee , Sanghyun Hong

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-30 Yiwu Zhong , Zhuoming Liu , Yin Li , Liwei Wang

Infusing fine-grained visual knowledge to Vision-Language Models

Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs) capable of generating representations (embeddings) effective for a wide variety of visual and multimodal tasks. However, these pretrained embeddings…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Nikolaos-Antonios Ypsilantis , Kaifeng Chen , André Araujo , Ondřej Chum

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with…

Computer Vision and Pattern Recognition · Computer Science 2025-10-13 Ruyi Xu , Guangxuan Xiao , Yukang Chen , Liuning He , Kelly Peng , Yao Lu , Song Han

AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering

The advancement of Multimodal Large Language Models (MLLMs) has driven significant progress in Visual Question Answering (VQA), evolving from Single to Multi Image VQA (MVQA). However, the increased number of images in MVQA inevitably…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Kang Zeng , Guojin Zhong , Jintao Cheng , Jin Yuan , Zhiyong Li

NVILA: Efficient Frontier Visual Language Models

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimize efficiency…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Zhijian Liu , Ligeng Zhu , Baifeng Shi , Zhuoyang Zhang , Yuming Lou , Shang Yang , Haocheng Xi , Shiyi Cao , Yuxian Gu , Dacheng Li , Xiuyu Li , Yunhao Fang , Yukang Chen , Cheng-Yu Hsieh , De-An Huang , An-Chieh Cheng , Vishwesh Nath , Jinyi Hu , Sifei Liu , Ranjay Krishna , Daguang Xu , Xiaolong Wang , Pavlo Molchanov , Jan Kautz , Hongxu Yin , Song Han , Yao Lu

DocVLM: Make Your VLM an Efficient Reader

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Mor Shpigel Nacson , Aviad Aberdam , Roy Ganz , Elad Ben Avraham , Alona Golts , Yair Kittenplon , Shai Mazor , Ron Litman

Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning

Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Xuchen Li , Xuzhao Li , Jiahui Gao , Renjie Pi , Shiyu Hu , Wentao Zhang

AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance

Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering, image captioning and so on, but their inference cost remains a significant challenge due to the large number…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Weichen Zhang , Zhui Zhu , Ningbo Li , Shilong Tao , Kebin Liu , Yunhao Liu