English
Related papers

Related papers: Input-Adaptive Visual Preprocessing for Efficient …

200 papers

Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Zichuan Lin , Yicheng Liu , Yang Yang , Lvfang Tao , Deheng Ye

The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention…

Artificial Intelligence · Computer Science 2025-02-10 Junyang Zhang , Mu Yuan , Ruiguang Zhong , Puhan Luo , Huiyou Zhan , Ningkang Zhang , Chengchen Hu , Xiangyang Li

The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of…

Computer Vision and Pattern Recognition · Computer Science 2025-02-04 Jiayi Han , Liang Du , Yiwen Wu , Xiangguo Zhou , Hongwei Du , Weibo Zheng

Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency due to autoregressive decoding. In this work, we…

Machine Learning · Computer Science 2025-10-28 Divya Jyoti Bajpai , Manjesh Kumar Hanawal

Pre-trained vision-language models (VLMs) have achieved impressive results in a range of vision-language tasks. However, popular VLMs usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and…

Computation and Language · Computer Science 2022-10-17 Tiannan Wang , Wangchunshu Zhou , Yan Zeng , Xinsong Zhang

Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs' comprehension. In this paper, we present AdaptVision, a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-02 Yonghui Wang , Wengang Zhou , Hao Feng , Houqiang Li

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high…

Computer Vision and Pattern Recognition · Computer Science 2025-05-19 Pavan Kumar Anasosalu Vasu , Fartash Faghri , Chun-Liang Li , Cem Koc , Nate True , Albert Antony , Gokul Santhanam , James Gabriel , Peter Grasch , Oncel Tuzel , Hadi Pouransari

Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Fatih Ilhan , Gaowen Liu , Ramana Rao Kompella , Selim Furkan Tekin , Tiansheng Huang , Zachary Yahn , Yichang Xu , Ling Liu

Recent advancements in multimodal fusion have witnessed the remarkable success of vision-language (VL) models, which excel in various multimodal applications such as image captioning and visual question answering. However, building VL…

Computer Vision and Pattern Recognition · Computer Science 2024-10-24 Zhiwei Hao , Jianyuan Guo , Li Shen , Yong Luo , Han Hu , Yonggang Wen

Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during…

Computer Vision and Pattern Recognition · Computer Science 2025-03-24 Wenxuan Huang , Zijie Zhai , Yunhang Shen , Shaosheng Cao , Fei Zhao , Xiangfeng Xu , Zheyu Ye , Yao Hu , Shaohui Lin

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the…

Computer Vision and Pattern Recognition · Computer Science 2025-11-04 Pablo Acuaviva , Aram Davtyan , Mariam Hassan , Sebastian Stapf , Ahmad Rahimi , Alexandre Alahi , Paolo Favaro

An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate…

Computer Vision and Pattern Recognition · Computer Science 2025-08-14 Dongwoo Kang , Akhil Perincherry , Zachary Coalson , Aiden Gabriel , Stefan Lee , Sanghyun Hong

Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-30 Yiwu Zhong , Zhuoming Liu , Yin Li , Liwei Wang

Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs) capable of generating representations (embeddings) effective for a wide variety of visual and multimodal tasks. However, these pretrained embeddings…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Nikolaos-Antonios Ypsilantis , Kaifeng Chen , André Araujo , Ondřej Chum

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with…

Computer Vision and Pattern Recognition · Computer Science 2025-10-13 Ruyi Xu , Guangxuan Xiao , Yukang Chen , Liuning He , Kelly Peng , Yao Lu , Song Han

The advancement of Multimodal Large Language Models (MLLMs) has driven significant progress in Visual Question Answering (VQA), evolving from Single to Multi Image VQA (MVQA). However, the increased number of images in MVQA inevitably…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Kang Zeng , Guojin Zhong , Jintao Cheng , Jin Yuan , Zhiyong Li

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimize efficiency…

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Mor Shpigel Nacson , Aviad Aberdam , Roy Ganz , Elad Ben Avraham , Alona Golts , Yair Kittenplon , Shai Mazor , Ron Litman

Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Xuchen Li , Xuzhao Li , Jiahui Gao , Renjie Pi , Shiyu Hu , Wentao Zhang

Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering, image captioning and so on, but their inference cost remains a significant challenge due to the large number…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Weichen Zhang , Zhui Zhu , Ningbo Li , Shilong Tao , Kebin Liu , Yunhao Liu
‹ Prev 1 2 3 10 Next ›