English
Related papers

Related papers: NVILA: Efficient Frontier Visual Language Models

200 papers

Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Gaurav Shinde , Anuradha Ravi , Emon Dey , Shadman Sakib , Milind Rampure , Nirmalya Roy

In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable…

Computer Vision and Pattern Recognition · Computer Science 2024-10-02 Xijun Wang , Junbang Liang , Chun-Kai Wang , Kenan Deng , Yu Lou , Ming Lin , Shan Yang

Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. Despite their remarkable performance, foundational VLAs are hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Zhaoshu Yu , Bo Wang , Pengpeng Zeng , Haonan Zhang , Ji Zhang , Zheng Wang , Lianli Gao , Jingkuan Song , Nicu Sebe , Heng Tao Shen

Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual…

Computer Vision and Pattern Recognition · Computer Science 2024-05-20 Ji Lin , Hongxu Yin , Wei Ping , Yao Lu , Pavlo Molchanov , Andrew Tao , Huizi Mao , Jan Kautz , Mohammad Shoeybi , Song Han

Pre-trained vision-language models (VLMs) have achieved impressive results in a range of vision-language tasks. However, popular VLMs usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and…

Computation and Language · Computer Science 2022-10-17 Tiannan Wang , Wangchunshu Zhou , Yan Zeng , Xinsong Zhang

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g.,…

Computation and Language · Computer Science 2024-10-24 Wenliang Dai , Nayeon Lee , Boxin Wang , Zhuolin Yang , Zihan Liu , Jon Barker , Tuomas Rintamaki , Mohammad Shoeybi , Bryan Catanzaro , Wei Ping

Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution…

Computer Vision and Pattern Recognition · Computer Science 2025-12-25 Putu Indah Githa Cahyani , Komang David Dananjaya Suartana , Novanto Yudistira

Vision-Language Models (VLMs) have emerged as a promising approach to address the data scarcity challenge in robotics, enabling the development of generalizable visuomotor control policies. While models like OpenVLA showcase the potential…

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high…

Computer Vision and Pattern Recognition · Computer Science 2025-05-19 Pavan Kumar Anasosalu Vasu , Fartash Faghri , Chun-Liang Li , Cem Koc , Nate True , Albert Antony , Gokul Santhanam , James Gabriel , Peter Grasch , Oncel Tuzel , Hadi Pouransari

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Han Wang , Yuxiang Nie , Yongjie Ye , Deng GuanYu , Yanjie Wang , Shuai Li , Haiyang Yu , Jinghui Lu , Can Huang

Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during…

Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due…

Robotics · Computer Science 2025-10-24 Weifan Guan , Qinghao Hu , Aosheng Li , Jian Cheng

Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Tongtian Yue , Longteng Guo , Yepeng Tang , Zijia Zhao , Xinxin Zhu , Hua Huang , Jing Liu

Vision language models (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and language models. However, existing open-source VLMs heavily rely on pretrained…

Computer Vision and Pattern Recognition · Computer Science 2024-07-24 Aristeidis Panos , Rahaf Aljundi , Daniel Olmeda Reino , Richard E Turner

Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been…

Computer Vision and Pattern Recognition · Computer Science 2021-08-11 Jianfeng Wang , Xiaowei Hu , Pengchuan Zhang , Xiujun Li , Lijuan Wang , Lei Zhang , Jianfeng Gao , Zicheng Liu

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the…

Computer Vision and Pattern Recognition · Computer Science 2025-11-04 Pablo Acuaviva , Aram Davtyan , Mariam Hassan , Sebastian Stapf , Ahmad Rahimi , Alexandre Alahi , Paolo Favaro

Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common…

Computer Vision and Pattern Recognition · Computer Science 2025-04-30 Yanan Guo , Wenhui Dong , Jun Song , Shiding Zhu , Xuan Zhang , Hanqing Yang , Yingbo Wang , Yang Du , Xianing Chen , Bo Zheng

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive…

Robotics · Computer Science 2025-12-23 Max Argus , Jelena Bratulic , Houman Masnavi , Maxim Velikanov , Nick Heppert , Abhinav Valada , Thomas Brox

Real-world vision-language applications demand varying levels of perceptual granularity. However, most existing visual large language models (VLLMs), such as LLaVA, pre-assume a fixed resolution for downstream tasks, which leads to subpar…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Weiqing Luo , Zhen Tan , Yifan Li , Xinyu Zhao , Kwonjoon Lee , Behzad Dariush , Tianlong Chen

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen
‹ Prev 1 2 3 10 Next ›