Related papers: DeViL: Decoding Vision features into Language

Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features

Decoding human visual neural representations is a challenging task with great scientific significance in revealing vision-processing mechanisms and developing brain-like intelligent machines. Most existing methods are difficult to…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Changde Du , Kaicheng Fu , Jinpeng Li , Huiguang He

Deep Visual-Semantic Alignments for Generating Image Descriptions

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and…

Computer Vision and Pattern Recognition · Computer Science 2015-04-15 Andrej Karpathy , Li Fei-Fei

Explaining Deep Convolutional Neural Networks via Latent Visual-Semantic Filter Attention

Interpretability is an important property for visual models as it helps researchers and users understand the internal mechanism of a complex model. However, generating semantic explanations about the learned representation is challenging…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Yu Yang , Seungbae Kim , Jungseock Joo

VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow

Neural networks are widely adopted to solve complex and challenging tasks. Especially in high-stakes decision-making, understanding their reasoning process is crucial, yet proves challenging for modern deep networks. Feature visualization…

Computer Vision and Pattern Recognition · Computer Science 2026-02-18 Ada Gorgun , Bernt Schiele , Jonas Fischer

Concept-based Analysis of Neural Networks via Vision-Language Models

The analysis of vision-based deep neural networks (DNNs) is highly desirable but it is very challenging due to the difficulty of expressing formal specifications for vision tasks and the lack of efficient verification procedures. In this…

Machine Learning · Computer Science 2024-04-12 Ravi Mangal , Nina Narodytska , Divya Gopinath , Boyue Caroline Hu , Anirban Roy , Susmit Jha , Corina Pasareanu

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have…

Computer Vision and Pattern Recognition · Computer Science 2025-05-08 Junjie Wang , Bin Chen , Yulin Li , Bin Kang , Yichi Chen , Zhuotao Tian

Towards Understanding Multimodal Fine-Tuning: Spatial Features

Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Lachin Naghashyar , Hunar Batra , Ashkan Khakzar , Philip Torr , Ronald Clark , Christian Schroeder de Witt , Constantin Venhoff

Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified…

Artificial Intelligence · Computer Science 2025-08-14 Zixian Guo , Ming Liu , Qilong Wang , Zhilong Ji , Jinfeng Bai , Lei Zhang , Wangmeng Zuo

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring…

Computer Vision and Pattern Recognition · Computer Science 2022-04-07 Zhao Yang , Jiaqi Wang , Yansong Tang , Kai Chen , Hengshuang Zhao , Philip H. S. Torr

Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding

Multimodal large language models (MLLMs) are rapidly expanding from general video understanding to finer-grained understanding such as spatio-temporal video grounding (STVG) and reasoning. In these tasks, an MLLM must localize the…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Shida Gao , Feng Xue , Xiangfeng Wang , Anlong Ming , Zhaowen Lin , Haiyang Zhang , Teng Long , Nicu Sebe , Yihua Shao , Haozhe Wang , Wei Wang

Peripheral Vision Transformer

Human vision possesses a special type of visual processing systems called peripheral vision. Partitioning the entire visual field into multiple contour regions based on the distance to the center of our gaze, the peripheral vision provides…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Juhong Min , Yucheng Zhao , Chong Luo , Minsu Cho

Interpreting Neurons in Deep Vision Networks with Language Models

In this paper, we propose Describe-and-Dissect (DnD), a novel method to describe the roles of hidden neurons in vision networks. DnD utilizes recent advancements in multimodal deep learning to produce complex natural language descriptions,…

Computer Vision and Pattern Recognition · Computer Science 2025-02-20 Nicholas Bai , Rahul A. Iyer , Tuomas Oikarinen , Akshay Kulkarni , Tsui-Wei Weng

Large Language Models Facilitate Vision Reflection in Image Classification

This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Guoyuan An , JaeYoon Kim , SungEui Yoon

Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

Large Vision Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains…

Computer Vision and Pattern Recognition · Computer Science 2026-05-14 Beomsik Cho , Jaehyung Kim

Deep Neural Networks for Visual Reasoning

Visual perception and language understanding are - fundamental components of human intelligence, enabling them to understand and reason about objects and their interactions. It is crucial for machines to have this capacity to reason using…

Computer Vision and Pattern Recognition · Computer Science 2022-09-27 Thao Minh Le

Debiasing Multimodal Large Language Models via Penalization of Language Priors

In the realms of computer vision and natural language processing, Multimodal Large Language Models (MLLMs) have become indispensable tools, proficient in generating textual responses based on visual inputs. Despite their advancements, our…

Computer Vision and Pattern Recognition · Computer Science 2025-08-15 YiFan Zhang , Yang Shi , Weichen Yu , Qingsong Wen , Xue Wang , Wenjing Yang , Zhang Zhang , Liang Wang , Rong Jin

FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability

Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as…

Computer Vision and Pattern Recognition · Computer Science 2025-03-20 Estelle Aflalo , Gabriela Ben Melech Stan , Tiep Le , Man Luo , Shachar Rosenman , Sayak Paul , Shao-Yen Tseng , Vasudev Lal

Locality Alignment Improves Vision-Language Models

Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We hypothesize that this is due to VLMs adopting pre-trained vision backbones, specifically vision…

Computer Vision and Pattern Recognition · Computer Science 2025-03-05 Ian Covert , Tony Sun , James Zou , Tatsunori Hashimoto

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

Visual Reasoning with Multi-hop Feature Modulation

Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue. For such tasks, one successful approach is to condition…

Computer Vision and Pattern Recognition · Computer Science 2018-10-15 Florian Strub , Mathieu Seurin , Ethan Perez , Harm de Vries , Jérémie Mary , Philippe Preux , Aaron Courville , Olivier Pietquin