Related papers: Task-Aware Resolution Optimization for Visual Larg…

Effectiveness Assessment of Recent Large Vision-Language Models

The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation.…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Yao Jiang , Xinyu Yan , Ge-Peng Ji , Keren Fu , Meijun Sun , Huan Xiong , Deng-Ping Fan , Fahad Shahbaz Khan

LVLM-Aided Alignment of Task-Specific Vision Models

In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Alexander Koebler , Lukas Kuhn , Ingo Thon , Florian Buettner

Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Junbo Niu , Yuanhong Zheng , Ziyang Miao , Hejun Dong , Chunjiang Ge , Hao Liang , Ma Lu , Bohan Zeng , Qiahao Zheng , Conghui He , Wentao Zhang

Rethinking VLMs and LLMs for Image Classification

Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable…

Machine Learning · Computer Science 2024-10-22 Avi Cooper , Keizo Kato , Chia-Hsien Shih , Hiroaki Yamane , Kasper Vinken , Kentaro Takemoto , Taro Sunagawa , Hao-Wei Yeh , Jin Yamanaka , Ian Mason , Xavier Boix

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

Having revolutionized natural language processing (NLP) applications, large language models (LLMs) are expanding into the realm of multimodal inputs. Owing to their ability to interpret images, multimodal LLMs (MLLMs) have been primarily…

Computer Vision and Pattern Recognition · Computer Science 2024-02-14 Jusung Lee , Sungguk Cha , Younghyun Lee , Cheoljong Yang

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs…

Computer Vision and Pattern Recognition · Computer Science 2024-10-02 Bin Lin , Yang Ye , Bin Zhu , Jiaxi Cui , Munan Ning , Peng Jin , Li Yuan

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

The development of Large Vision-Language Models (LVLMs) is striving to catch up with the success of Large Language Models (LLMs), yet it faces more challenges to be resolved. Very recent works enable LVLMs to localize object-level visual…

Computer Vision and Pattern Recognition · Computer Science 2024-03-20 Zhipeng Huang , Zhizheng Zhang , Zheng-Jun Zha , Yan Lu , Baining Guo

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing…

Computer Vision and Pattern Recognition · Computer Science 2024-04-12 Kanchana Ranasinghe , Satya Narayan Shukla , Omid Poursaeed , Michael S. Ryoo , Tsung-Yu Lin

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks. However, they often underperform in task-specific applications due to domain gaps between pre-training and…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Yang Bai , Yang Zhou , Jun Zhou , Rick Siow Mong Goh , Daniel Shu Wei Ting , Yong Liu

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and…

Computer Vision and Pattern Recognition · Computer Science 2024-05-31 Siddharth Karamcheti , Suraj Nair , Ashwin Balakrishna , Percy Liang , Thomas Kollar , Dorsa Sadigh

A Survey on Efficient Vision-Language Models

Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Gaurav Shinde , Anuradha Ravi , Emon Dey , Shadman Sakib , Milind Rampure , Nirmalya Roy

VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

Recent research looks to harness the general knowledge and reasoning of large language models (LLMs) into agents that accomplish user-specified goals in interactive environments. Vision-language models (VLMs) extend LLMs to multi-modal data…

Machine Learning · Computer Science 2025-05-07 Jake Grigsby , Yuke Zhu , Michael Ryoo , Juan Carlos Niebles

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Recent advancements in multi-modal large language models (MLLMs) have led to substantial improvements in visual understanding, primarily driven by sophisticated modality alignment strategies. However, predominant approaches prioritize…

Computer Vision and Pattern Recognition · Computer Science 2024-08-29 Jinjin Xu , Liwu Xu , Yuzhe Yang , Xiang Li , Fanyi Wang , Yanchun Xie , Yi-Jie Huang , Yaqian Li

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends…

Computer Vision and Pattern Recognition · Computer Science 2022-03-04 Feng Li , Hao Zhang , Yi-Fan Zhang , Shilong Liu , Jian Guo , Lionel M. Ni , PengChuan Zhang , Lei Zhang

ReplanVLM: Replanning Robotic Tasks with Visual Language Models

Large language models (LLMs) have gained increasing popularity in robotic task planning due to their exceptional abilities in text analytics and generation, as well as their broad knowledge of the world. However, they fall short in decoding…

Robotics · Computer Science 2024-08-01 Aoran Mei , Guo-Niu Zhu , Huaxiang Zhang , Zhongxue Gan

How Well Can Vision Language Models See Image Details?

Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains…

Computer Vision and Pattern Recognition · Computer Science 2024-08-08 Chenhui Gou , Abdulwahab Felemban , Faizan Farooq Khan , Deyao Zhu , Jianfei Cai , Hamid Rezatofighi , Mohamed Elhoseiny

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and…

Computation and Language · Computer Science 2024-06-18 Guiming Hardy Chen , Shunian Chen , Ruifei Zhang , Junying Chen , Xiangbo Wu , Zhiyi Zhang , Zhihong Chen , Jianquan Li , Xiang Wan , Benyou Wang

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

Recent advancements in multimodal techniques open exciting possibilities for models excelling in diverse tasks involving text, audio, and image processing. Models like GPT-4V, blending computer vision and language modeling, excel in complex…

Computation and Language · Computer Science 2023-10-20 Xiang Zhang , Senyu Li , Zijun Wu , Ning Shi

NVILA: Efficient Frontier Visual Language Models

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimize efficiency…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Zhijian Liu , Ligeng Zhu , Baifeng Shi , Zhuoyang Zhang , Yuming Lou , Shang Yang , Haocheng Xi , Shiyi Cao , Yuxian Gu , Dacheng Li , Xiuyu Li , Yunhao Fang , Yukang Chen , Cheng-Yu Hsieh , De-An Huang , An-Chieh Cheng , Vishwesh Nath , Jinyi Hu , Sifei Liu , Ranjay Krishna , Daguang Xu , Xiaolong Wang , Pavlo Molchanov , Jan Kautz , Hongxu Yin , Song Han , Yao Lu

Enhancing Advanced Visual Reasoning Ability of Large Language Models

Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning, challenging models' advanced reasoning ability. Traditional Vision-Language Models (VLMs) perform well in visual perception tasks…

Computer Vision and Pattern Recognition · Computer Science 2024-09-24 Zhiyuan Li , Dongnan Liu , Chaoyi Zhang , Heng Wang , Tengfei Xue , Weidong Cai