Related papers: NVILA: Efficient Frontier Visual Language Models

A Survey on Efficient Vision-Language Models

Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Gaurav Shinde , Anuradha Ravi , Emon Dey , Shadman Sakib , Milind Rampure , Nirmalya Roy

ViLA: Efficient Video-Language Alignment for Video Question Answering

In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable…

Computer Vision and Pattern Recognition · Computer Science 2024-10-02 Xijun Wang , Junbang Liang , Chun-Kai Wang , Kenan Deng , Yu Lou , Ming Lin , Shan Yang

A Survey on Efficient Vision-Language-Action Models

Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. Despite their remarkable performance, foundational VLAs are hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Zhaoshu Yu , Bo Wang , Pengpeng Zeng , Haonan Zhang , Ji Zhang , Zheng Wang , Lianli Gao , Jingkuan Song , Nicu Sebe , Heng Tao Shen

VILA: On Pre-training for Visual Language Models

Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual…

Computer Vision and Pattern Recognition · Computer Science 2024-05-20 Ji Lin , Hongxu Yin , Wei Ping , Yao Lu , Pavlo Molchanov , Andrew Tao , Huizi Mao , Jan Kautz , Mohammad Shoeybi , Song Han

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Pre-trained vision-language models (VLMs) have achieved impressive results in a range of vision-language tasks. However, popular VLMs usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and…

Computation and Language · Computer Science 2022-10-17 Tiannan Wang , Wangchunshu Zhou , Yan Zeng , Xinsong Zhang

NVLM: Open Frontier-Class Multimodal LLMs

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g.,…

Computation and Language · Computer Science 2024-10-24 Wenliang Dai , Nayeon Lee , Boxin Wang , Zhuolin Yang , Zihan Liu , Jon Barker , Tuomas Rintamaki , Mohammad Shoeybi , Bryan Catanzaro , Wei Ping

Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference

Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution…

Computer Vision and Pattern Recognition · Computer Science 2025-12-25 Putu Indah Githa Cahyani , Komang David Dananjaya Suartana , Novanto Yudistira

EdgeVLA: Efficient Vision-Language-Action Models

Vision-Language Models (VLMs) have emerged as a promising approach to address the data scarcity challenge in robotics, enabling the development of generalizable visuomotor control policies. While models like OpenVLA showcase the potential…

Robotics · Computer Science 2025-07-21 Paweł Budzianowski , Wesley Maa , Matthew Freed , Jingxiang Mo , Winston Hsiao , Aaron Xie , Tomasz Młoduchowski , Viraj Tipnis , Benjamin Bolte

FastVLM: Efficient Vision Encoding for Vision Language Models

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high…

Computer Vision and Pattern Recognition · Computer Science 2025-05-19 Pavan Kumar Anasosalu Vasu , Fartash Faghri , Chun-Liang Li , Cem Koc , Nate True , Albert Antony , Gokul Santhanam , James Gabriel , Peter Grasch , Oncel Tuzel , Hadi Pouransari

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Han Wang , Yuxiang Nie , Yongjie Ye , Deng GuanYu , Yanjie Wang , Shuai Li , Haiyang Yu , Jinghui Lu , Can Huang

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during…

Robotics · Computer Science 2025-05-14 Junjie Wen , Yichen Zhu , Jinming Li , Minjie Zhu , Kun Wu , Zhiyuan Xu , Ning Liu , Ran Cheng , Chaomin Shen , Yaxin Peng , Feifei Feng , Jian Tang

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due…

Robotics · Computer Science 2025-10-24 Weifan Guan , Qinghao Hu , Aosheng Li , Jian Cheng

LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation

Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Tongtian Yue , Longteng Guo , Yepeng Tang , Zijia Zhao , Xinxin Zhu , Hua Huang , Jing Liu

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Vision language models (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and language models. However, existing open-source VLMs heavily rely on pretrained…

Computer Vision and Pattern Recognition · Computer Science 2024-07-24 Aristeidis Panos , Rahaf Aljundi , Daniel Olmeda Reino , Richard E Turner

MiniVLM: A Smaller and Faster Vision-Language Model

Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been…

Computer Vision and Pattern Recognition · Computer Science 2021-08-11 Jianfeng Wang , Xiaowei Hu , Pengchuan Zhang , Xiujun Li , Lijuan Wang , Lei Zhang , Jianfeng Gao , Zicheng Liu

Rethinking Visual Intelligence: Insights from Video Pretraining

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the…

Computer Vision and Pattern Recognition · Computer Science 2025-11-04 Pablo Acuaviva , Aram Davtyan , Mariam Hassan , Sebastian Stapf , Ahmad Rahimi , Alexandre Alahi , Paolo Favaro

FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding

Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common…

Computer Vision and Pattern Recognition · Computer Science 2025-04-30 Yanan Guo , Wenhui Dong , Jun Song , Shiding Zhu , Xuan Zhang , Hanqing Yang , Yingbo Wang , Yang Du , Xianing Chen , Bo Zheng

cVLA: Towards Efficient Camera-Space VLAs

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive…

Robotics · Computer Science 2025-12-23 Max Argus , Jelena Bratulic , Houman Masnavi , Maxim Velikanov , Nick Heppert , Abhinav Valada , Thomas Brox

Task-Aware Resolution Optimization for Visual Large Language Models

Real-world vision-language applications demand varying levels of perceptual granularity. However, most existing visual large language models (VLLMs), such as LLaVA, pre-assume a fixed resolution for downstream tasks, which leads to subpar…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Weiqing Luo , Zhen Tan , Yifan Li , Xinyu Zhao , Kwonjoon Lee , Behzad Dariush , Tianlong Chen

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen