Related papers: Language-Informed Visual Concept Learning

Bridging the gap to real-world language-grounded visual concept learning

Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Whie Jung , Semin Kim , Junee Kim , Seunghoon Hong

Analyzing Encoded Concepts in Transformer Language Models

We propose a novel framework ConceptX, to analyze how latent concepts are encoded in representations learned within pre-trained language models. It uses clustering to discover the encoded concepts and explains them by aligning with a large…

Computation and Language · Computer Science 2022-06-28 Hassan Sajjad , Nadir Durrani , Fahim Dalvi , Firoj Alam , Abdul Rafae Khan , Jia Xu

Language Model as Visual Explainer

In this paper, we present Language Model as Visual Explainer LVX, a systematic approach for interpreting the internal workings of vision models using a tree-structured linguistic explanation, without the need for model training. Central to…

Computer Vision and Pattern Recognition · Computer Science 2024-12-12 Xingyi Yang , Xinchao Wang

Enhancing Vision Models for Text-Heavy Content Understanding and Interaction

Interacting and understanding with text heavy visual content with multiple images is a major challenge for traditional vision models. This paper is on enhancing vision models' capability to comprehend or understand and learn from images…

Computer Vision and Pattern Recognition · Computer Science 2024-08-31 Adithya TG , Adithya SK , Abhinav R Bharadwaj , Abhiram HA , Surabhi Narayan

Learning Multi-Modal Word Representation Grounded in Visual Context

Representing the semantics of words is a long-standing problem for the natural language processing community. Most methods compute word semantics given their textual context in large corpora. More recently, researchers attempted to…

Computation and Language · Computer Science 2017-11-10 Éloi Zablocki , Benjamin Piwowarski , Laure Soulier , Patrick Gallinari

Visual Storytelling via Predicting Anchor Word Embeddings in the Stories

We propose a learning model for the task of visual storytelling. The main idea is to predict anchor word embeddings from the images and use the embeddings and the image features jointly to generate narrative sentences. We use the embeddings…

Computer Vision and Pattern Recognition · Computer Science 2020-01-15 Bowen Zhang , Hexiang Hu , Fei Sha

Language with Vision: a Study on Grounded Word and Sentence Embeddings

Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many…

Computation and Language · Computer Science 2023-11-01 Hassan Shahmohammadi , Maria Heitmeier , Elnaz Shafaei-Bajestan , Hendrik P. A. Lensch , Harald Baayen

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the distributional semantics but fail to connect to any knowledge about the physical world. In contrast,…

Computation and Language · Computer Science 2021-11-16 Yizhen Zhang , Minkyu Choi , Kuan Han , Zhongming Liu

Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it…

Computer Vision and Pattern Recognition · Computer Science 2017-10-17 Tanmay Gupta , Kevin Shih , Saurabh Singh , Derek Hoiem

Visual Lexicon: Rich Image Features in Language Space

We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 XuDong Wang , Xingyi Zhou , Alireza Fathi , Trevor Darrell , Cordelia Schmid

Learning Visual Representations via Language-Guided Sampling

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-30 Mohamed El Banani , Karan Desai , Justin Johnson

Brain encoding models based on multimodal transformers can transfer across language and vision

Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain…

Computation and Language · Computer Science 2023-05-23 Jerry Tang , Meng Du , Vy A. Vo , Vasudev Lal , Alexander G. Huth

Learning language through pictures

We propose Imaginet, a model of learning visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task…

Computation and Language · Computer Science 2015-06-22 Grzegorz Chrupała , Ákos Kádár , Afra Alishahi

Linearly Mapping from Image to Text Space

The extent to which text-only language models (LMs) learn to represent features of the non-linguistic world is an open question. Prior work has shown that pretrained LMs can be taught to caption images when a vision model's parameters are…

Computation and Language · Computer Science 2023-03-10 Jack Merullo , Louis Castricato , Carsten Eickhoff , Ellie Pavlick

Context Matters: Learning Global Semantics via Object-Centric Representation

Recent advances in language modeling have witnessed the rise of highly desirable emergent capabilities, such as reasoning and in-context learning. However, vision models have yet to exhibit comparable progress in these areas. In this paper,…

Computer Vision and Pattern Recognition · Computer Science 2025-10-10 Jike Zhong , Yuxiang Lai , Xiaofeng Yang , Konstantinos Psounis

Understanding Guided Image Captioning Performance across Domains

Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. On the other hand, VQA models…

Computer Vision and Pattern Recognition · Computer Science 2021-11-12 Edwin G. Ng , Bo Pang , Piyush Sharma , Radu Soricut

Concept Decomposition for Visual Exploration and Inspiration

A creative idea is often born from transforming, combining, and modifying ideas from existing visual examples capturing various concepts. However, one cannot simply copy the concept as a whole, and inspiration is achieved by examining…

Computer Vision and Pattern Recognition · Computer Science 2023-06-01 Yael Vinker , Andrey Voynov , Daniel Cohen-Or , Ariel Shamir

A Concept-Based Explainability Framework for Large Multimodal Models

Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs…

Machine Learning · Computer Science 2024-12-03 Jayneel Parekh , Pegah Khayatan , Mustafa Shukor , Alasdair Newson , Matthieu Cord

Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

Text-to-Image (T2I) models excel at synthesizing concepts such as nouns, appearances, and styles. To enable customized content creation based on a few example images of a concept, methods such as Textual Inversion and DreamBooth invert the…

Computer Vision and Pattern Recognition · Computer Science 2024-09-30 Saman Motamed , Danda Pani Paudel , Luc Van Gool

Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the…

Computer Vision and Pattern Recognition · Computer Science 2025-07-14 Tiezheng Zhang , Yitong Li , Yu-cheng Chou , Jieneng Chen , Alan Yuille , Chen Wei , Junfei Xiao