English
Related papers

Related papers: Task-Aware Resolution Optimization for Visual Larg…

200 papers

The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation.…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Yao Jiang , Xinyu Yan , Ge-Peng Ji , Keren Fu , Meijun Sun , Huan Xiong , Deng-Ping Fan , Fahad Shahbaz Khan

In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Alexander Koebler , Lukas Kuhn , Ingo Thon , Florian Buettner

Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Junbo Niu , Yuanhong Zheng , Ziyang Miao , Hejun Dong , Chunjiang Ge , Hao Liang , Ma Lu , Bohan Zeng , Qiahao Zheng , Conghui He , Wentao Zhang

Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable…

Having revolutionized natural language processing (NLP) applications, large language models (LLMs) are expanding into the realm of multimodal inputs. Owing to their ability to interpret images, multimodal LLMs (MLLMs) have been primarily…

Computer Vision and Pattern Recognition · Computer Science 2024-02-14 Jusung Lee , Sungguk Cha , Younghyun Lee , Cheoljong Yang

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs…

Computer Vision and Pattern Recognition · Computer Science 2024-10-02 Bin Lin , Yang Ye , Bin Zhu , Jiaxi Cui , Munan Ning , Peng Jin , Li Yuan

The development of Large Vision-Language Models (LVLMs) is striving to catch up with the success of Large Language Models (LLMs), yet it faces more challenges to be resolved. Very recent works enable LVLMs to localize object-level visual…

Computer Vision and Pattern Recognition · Computer Science 2024-03-20 Zhipeng Huang , Zhizheng Zhang , Zheng-Jun Zha , Yan Lu , Baining Guo

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing…

Computer Vision and Pattern Recognition · Computer Science 2024-04-12 Kanchana Ranasinghe , Satya Narayan Shukla , Omid Poursaeed , Michael S. Ryoo , Tsung-Yu Lin

Large vision language models (VLMs) combine large language models with vision encoders, demonstrating promise across various tasks. However, they often underperform in task-specific applications due to domain gaps between pre-training and…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Yang Bai , Yang Zhou , Jun Zhou , Rick Siow Mong Goh , Daniel Shu Wei Ting , Yong Liu

Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and…

Computer Vision and Pattern Recognition · Computer Science 2024-05-31 Siddharth Karamcheti , Suraj Nair , Ashwin Balakrishna , Percy Liang , Thomas Kollar , Dorsa Sadigh

Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Gaurav Shinde , Anuradha Ravi , Emon Dey , Shadman Sakib , Milind Rampure , Nirmalya Roy

Recent research looks to harness the general knowledge and reasoning of large language models (LLMs) into agents that accomplish user-specified goals in interactive environments. Vision-language models (VLMs) extend LLMs to multi-modal data…

Machine Learning · Computer Science 2025-05-07 Jake Grigsby , Yuke Zhu , Michael Ryoo , Juan Carlos Niebles

Recent advancements in multi-modal large language models (MLLMs) have led to substantial improvements in visual understanding, primarily driven by sophisticated modality alignment strategies. However, predominant approaches prioritize…

Computer Vision and Pattern Recognition · Computer Science 2024-08-29 Jinjin Xu , Liwu Xu , Yuzhe Yang , Xiang Li , Fanyi Wang , Yanchun Xie , Yi-Jie Huang , Yaqian Li

This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends…

Computer Vision and Pattern Recognition · Computer Science 2022-03-04 Feng Li , Hao Zhang , Yi-Fan Zhang , Shilong Liu , Jian Guo , Lionel M. Ni , PengChuan Zhang , Lei Zhang

Large language models (LLMs) have gained increasing popularity in robotic task planning due to their exceptional abilities in text analytics and generation, as well as their broad knowledge of the world. However, they fall short in decoding…

Robotics · Computer Science 2024-08-01 Aoran Mei , Guo-Niu Zhu , Huaxiang Zhang , Zhongxue Gan

Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains…

Computer Vision and Pattern Recognition · Computer Science 2024-08-08 Chenhui Gou , Abdulwahab Felemban , Faizan Farooq Khan , Deyao Zhu , Jianfei Cai , Hamid Rezatofighi , Mohamed Elhoseiny

Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and…

Computation and Language · Computer Science 2024-06-18 Guiming Hardy Chen , Shunian Chen , Ruifei Zhang , Junying Chen , Xiangbo Wu , Zhiyi Zhang , Zhihong Chen , Jianquan Li , Xiang Wan , Benyou Wang

Recent advancements in multimodal techniques open exciting possibilities for models excelling in diverse tasks involving text, audio, and image processing. Models like GPT-4V, blending computer vision and language modeling, excel in complex…

Computation and Language · Computer Science 2023-10-20 Xiang Zhang , Senyu Li , Zijun Wu , Ning Shi

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimize efficiency…

Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning, challenging models' advanced reasoning ability. Traditional Vision-Language Models (VLMs) perform well in visual perception tasks…

Computer Vision and Pattern Recognition · Computer Science 2024-09-24 Zhiyuan Li , Dongnan Liu , Chaoyi Zhang , Heng Wang , Tengfei Xue , Weidong Cai
‹ Prev 1 2 3 10 Next ›