English
Related papers

Related papers: DeViL: Decoding Vision features into Language

200 papers

Decoding human visual neural representations is a challenging task with great scientific significance in revealing vision-processing mechanisms and developing brain-like intelligent machines. Most existing methods are difficult to…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Changde Du , Kaicheng Fu , Jinpeng Li , Huiguang He

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and…

Computer Vision and Pattern Recognition · Computer Science 2015-04-15 Andrej Karpathy , Li Fei-Fei

Interpretability is an important property for visual models as it helps researchers and users understand the internal mechanism of a complex model. However, generating semantic explanations about the learned representation is challenging…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Yu Yang , Seungbae Kim , Jungseock Joo

Neural networks are widely adopted to solve complex and challenging tasks. Especially in high-stakes decision-making, understanding their reasoning process is crucial, yet proves challenging for modern deep networks. Feature visualization…

Computer Vision and Pattern Recognition · Computer Science 2026-02-18 Ada Gorgun , Bernt Schiele , Jonas Fischer

The analysis of vision-based deep neural networks (DNNs) is highly desirable but it is very challenging due to the difficulty of expressing formal specifications for vision tasks and the lack of efficient verification procedures. In this…

Machine Learning · Computer Science 2024-04-12 Ravi Mangal , Nina Narodytska , Divya Gopinath , Boyue Caroline Hu , Anirban Roy , Susmit Jha , Corina Pasareanu

Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have…

Computer Vision and Pattern Recognition · Computer Science 2025-05-08 Junjie Wang , Bin Chen , Yulin Li , Bin Kang , Yichi Chen , Zhuotao Tian

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Lachin Naghashyar , Hunar Batra , Ashkan Khakzar , Philip Torr , Ronald Clark , Christian Schroeder de Witt , Constantin Venhoff

Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified…

Artificial Intelligence · Computer Science 2025-08-14 Zixian Guo , Ming Liu , Qilong Wang , Zhilong Ji , Jinfeng Bai , Lei Zhang , Wangmeng Zuo

Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring…

Computer Vision and Pattern Recognition · Computer Science 2022-04-07 Zhao Yang , Jiaqi Wang , Yansong Tang , Kai Chen , Hengshuang Zhao , Philip H. S. Torr

Multimodal large language models (MLLMs) are rapidly expanding from general video understanding to finer-grained understanding such as spatio-temporal video grounding (STVG) and reasoning. In these tasks, an MLLM must localize the…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Shida Gao , Feng Xue , Xiangfeng Wang , Anlong Ming , Zhaowen Lin , Haiyang Zhang , Teng Long , Nicu Sebe , Yihua Shao , Haozhe Wang , Wei Wang

Human vision possesses a special type of visual processing systems called peripheral vision. Partitioning the entire visual field into multiple contour regions based on the distance to the center of our gaze, the peripheral vision provides…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Juhong Min , Yucheng Zhao , Chong Luo , Minsu Cho

In this paper, we propose Describe-and-Dissect (DnD), a novel method to describe the roles of hidden neurons in vision networks. DnD utilizes recent advancements in multimodal deep learning to produce complex natural language descriptions,…

Computer Vision and Pattern Recognition · Computer Science 2025-02-20 Nicholas Bai , Rahul A. Iyer , Tuomas Oikarinen , Akshay Kulkarni , Tsui-Wei Weng

This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Guoyuan An , JaeYoon Kim , SungEui Yoon

Large Vision Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains…

Computer Vision and Pattern Recognition · Computer Science 2026-05-14 Beomsik Cho , Jaehyung Kim

Visual perception and language understanding are - fundamental components of human intelligence, enabling them to understand and reason about objects and their interactions. It is crucial for machines to have this capacity to reason using…

Computer Vision and Pattern Recognition · Computer Science 2022-09-27 Thao Minh Le

In the realms of computer vision and natural language processing, Multimodal Large Language Models (MLLMs) have become indispensable tools, proficient in generating textual responses based on visual inputs. Despite their advancements, our…

Computer Vision and Pattern Recognition · Computer Science 2025-08-15 YiFan Zhang , Yang Shi , Weichen Yu , Qingsong Wen , Xue Wang , Wenjing Yang , Zhang Zhang , Liang Wang , Rong Jin

Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as…

Computer Vision and Pattern Recognition · Computer Science 2025-03-20 Estelle Aflalo , Gabriela Ben Melech Stan , Tiep Le , Man Luo , Shachar Rosenman , Sayak Paul , Shao-Yen Tseng , Vasudev Lal

Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We hypothesize that this is due to VLMs adopting pre-trained vision backbones, specifically vision…

Computer Vision and Pattern Recognition · Computer Science 2025-03-05 Ian Covert , Tony Sun , James Zou , Tatsunori Hashimoto

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue. For such tasks, one successful approach is to condition…

Computer Vision and Pattern Recognition · Computer Science 2018-10-15 Florian Strub , Mathieu Seurin , Ethan Perez , Harm de Vries , Jérémie Mary , Philippe Preux , Aaron Courville , Olivier Pietquin
‹ Prev 1 2 3 10 Next ›