English
Related papers

Related papers: Pixel Aligned Language Models

200 papers

Visual grounding seeks to localize the image region corresponding to a free-form text description. Recently, the strong multimodal capabilities of Large Vision-Language Models (LVLMs) have driven substantial improvements in visual…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Seil Kang , Jinyeong Kim , Junhyeok Kim , Seong Jae Hwang

Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model…

Computation and Language · Computer Science 2021-06-24 Kayode Olaleye , Herman Kamper

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models…

Visual grounding tasks aim to localize image regions based on natural language references. In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Shijie Wang , Dahun Kim , Ali Taalimi , Chen Sun , Weicheng Kuo

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed…

Computer Vision and Pattern Recognition · Computer Science 2024-03-07 Navid Rajabi , Jana Kosecka

Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments. In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text…

Computation and Language · Computer Science 2023-03-17 Anthony Z. Liu , Lajanugen Logeswaran , Sungryull Sohn , Honglak Lee

Describing images with text is a fundamental problem in vision-language research. Current studies in this domain mostly focus on single image captioning. However, in various real applications (e.g., image editing, difference interpretation,…

Computation and Language · Computer Science 2019-06-20 Hao Tan , Franck Dernoncourt , Zhe Lin , Trung Bui , Mohit Bansal

Recent studies show that deep vision-only and language-only models--trained on disjoint modalities--nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network…

Computer Vision and Pattern Recognition · Computer Science 2025-09-26 Zoe Wanying He , Sean Trott , Meenakshi Khosla

Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question…

Computer Vision and Pattern Recognition · Computer Science 2025-05-16 Tyler Tran , Sangeet Khemlani , J. G. Trafton

Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on text-only self-supervised training with massive text data, which…

Computation and Language · Computer Science 2023-02-28 Weizhi Wang , Li Dong , Hao Cheng , Haoyu Song , Xiaodong Liu , Xifeng Yan , Jianfeng Gao , Furu Wei

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Chih-Yao Ma , Yannis Kalantidis , Ghassan AlRegib , Peter Vajda , Marcus Rohrbach , Zsolt Kira

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

The development of Large Vision-Language Models (LVLMs) is striving to catch up with the success of Large Language Models (LLMs), yet it faces more challenges to be resolved. Very recent works enable LVLMs to localize object-level visual…

Computer Vision and Pattern Recognition · Computer Science 2024-03-20 Zhipeng Huang , Zhizheng Zhang , Zheng-Jun Zha , Yan Lu , Baining Guo

Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through…

Computer Vision and Pattern Recognition · Computer Science 2025-10-17 Matan Rusanovsky , Shimon Malnick , Shai Avidan

Multimodal Large Language Model (MLLMs) leverages Large Language Models as a cognitive framework for diverse visual-language tasks. Recent efforts have been made to equip MLLMs with visual perceiving and grounding capabilities. However,…

Computer Vision and Pattern Recognition · Computer Science 2024-03-26 Junwen He , Yifan Wang , Lijun Wang , Huchuan Lu , Jun-Yan He , Jin-Peng Lan , Bin Luo , Xuansong Xie

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing…

Computer Vision and Pattern Recognition · Computer Science 2024-04-12 Kanchana Ranasinghe , Satya Narayan Shukla , Omid Poursaeed , Michael S. Ryoo , Tsung-Yu Lin

Large language models (LLMs) have made significant advancements in natural language understanding. However, through that enormous semantic representation that the LLM has learnt, is it somehow possible for it to understand images as well?…

Computer Vision and Pattern Recognition · Computer Science 2024-07-12 Mu Cai , Zeyi Huang , Yuheng Li , Utkarsh Ojha , Haohan Wang , Yong Jae Lee

Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models are now well-positioned to reason about the physical world,…

Robotics · Computer Science 2024-03-05 Jensen Gao , Bidipta Sarkar , Fei Xia , Ted Xiao , Jiajun Wu , Brian Ichter , Anirudha Majumdar , Dorsa Sadigh

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Georgios Pantazopoulos , Eda B. Özyiğit

Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves…

Computer Vision and Pattern Recognition · Computer Science 2023-07-25 Yining Hong , Haoyu Zhen , Peihao Chen , Shuhong Zheng , Yilun Du , Zhenfang Chen , Chuang Gan
‹ Prev 1 2 3 10 Next ›