Related papers: K-LITE: Learning Transferable Visual Models with E…

If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions

Humans can visualize new and unknown concepts from their natural language description, based on their experience and previous knowledge. Insipired by this, we present a way to extend this ability to Vision-Language Models (VLMs), teaching…

Computer Vision and Pattern Recognition · Computer Science 2025-12-18 Carlo Alberto Barbano , Luca Molinaro , Massimiliano Ciranni , Emanuele Aiello , Vito Paolo Pastore , Marco Grangetto

Learning Transferable Visual Models From Natural Language Supervision

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any…

Computer Vision and Pattern Recognition · Computer Science 2021-03-02 Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , Gretchen Krueger , Ilya Sutskever

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Aman Shrivastava , Ramprasaath R. Selvaraju , Nikhil Naik , Vicente Ordonez

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

Pre-trained vision-language models, e.g., CLIP, working with manually designed prompts have demonstrated great capacity of transfer learning. Recently, learnable prompts achieve state-of-the-art performance, which however are prone to…

Computer Vision and Pattern Recognition · Computer Science 2023-08-23 Baoshuo Kan , Teng Wang , Wenpeng Lu , Xiantong Zhen , Weili Guan , Feng Zheng

ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents

Vision-language models (VLMs) have shown remarkable general capabilities, yet embodied agents built on them fail at complex tasks, often skipping critical steps, proposing invalid actions, and repeating mistakes. These failures arise from a…

Artificial Intelligence · Computer Science 2026-03-26 Bingqing Wei , Zhongyu Xia , Dingai Liu , Xiaoyu Zhou , Zhiwei Lin , Yongtao Wang

Boosting Audio-visual Zero-shot Learning with Large Language Models

Audio-visual zero-shot learning aims to recognize unseen classes based on paired audio-visual sequences. Recent methods mainly focus on learning multi-modal features aligned with class names to enhance the generalization ability to unseen…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Haoxing Chen , Yaohui Li , Yan Hong , Zizheng Huang , Zhuoer Xu , Zhangxuan Gu , Jun Lan , Huijia Zhu , Weiqiang Wang

Knowledge-aware Zero-Shot Learning: Survey and Perspective

Zero-shot learning (ZSL) which aims at predicting classes that have never appeared during the training using external knowledge (a.k.a. side information) has been widely investigated. In this paper we present a literature review towards ZSL…

Artificial Intelligence · Computer Science 2021-05-11 Jiaoyan Chen , Yuxia Geng , Zhuo Chen , Ian Horrocks , Jeff Z. Pan , Huajun Chen

Exploring External Knowledge for Accurate modeling of Visual and Language Problems

The interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. The success can be partly attributed to the advancements of deep neural networks made in the sub-fields of AI such as…

Computer Vision and Pattern Recognition · Computer Science 2023-02-20 Xuewen Yang

LATTE: Learning to Think with Vision Specialists

While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Zixian Ma , Jianguo Zhang , Zhiwei Liu , Jieyu Zhang , Juntao Tan , Manli Shu , Juan Carlos Niebles , Shelby Heinecke , Huan Wang , Caiming Xiong , Ranjay Krishna , Silvio Savarese

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Wenhao Wu , Xiaohan Wang , Haipeng Luo , Jingdong Wang , Yi Yang , Wanli Ouyang

Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer

Pre-trained language models are still far from human performance in tasks that need understanding of properties (e.g. appearance, measurable quantity) and affordances of everyday objects in the real world since the text lacks such…

Computation and Language · Computer Science 2022-03-18 Woojeong Jin , Dong-Ho Lee , Chenguang Zhu , Jay Pujara , Xiang Ren

SimLTD: Simple Supervised and Semi-Supervised Long-Tailed Object Detection

While modern visual recognition systems have made significant advancements, many continue to struggle with the open problem of learning from few exemplars. This paper focuses on the task of object detection in the setting where object…

Computer Vision and Pattern Recognition · Computer Science 2025-06-10 Phi Vu Tran

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source vision-language pre-trained…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Wenhao Wu , Zhun Sun , Wanli Ouyang

Zero-Shot Cross-Lingual Transfer with Meta Learning

Learning what to share between tasks has been a topic of great importance recently, as strategic sharing of knowledge has been shown to improve downstream task performance. This is particularly important for multilingual applications, as…

Computation and Language · Computer Science 2020-10-06 Farhad Nooralahzadeh , Giannis Bekoulis , Johannes Bjerva , Isabelle Augenstein

Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime

Self-supervision has shown outstanding results for natural language processing, and more recently, for image recognition. Simultaneously, vision transformers and its variants have emerged as a promising and scalable alternative to…

Computer Vision and Pattern Recognition · Computer Science 2022-02-01 Prarthana Bhattacharyya , Chenge Li , Xiaonan Zhao , István Fehérvári , Jason Sun

Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

Prompt tuning, which involves training a small set of parameters, effectively enhances the pre-trained Vision-Language Models (VLMs) to downstream tasks. However, they often come at the cost of flexibility and adaptability when the tuned…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Mushui Liu , Bozheng Li , Yunlong Yu

EKTVQA: Generalized use of External Knowledge to empower Scene Text in Text-VQA

The open-ended question answering task of Text-VQA often requires reading and reasoning about rarely seen or completely unseen scene-text content of an image. We address this zero-shot nature of the problem by proposing the generalized use…

Computer Vision and Pattern Recognition · Computer Science 2022-07-18 Arka Ujjal Dey , Ernest Valveny , Gaurav Harit

The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual concepts, words, and semantic parsing of sentences without explicit supervision on any of them; instead, our model learns by simply looking at images and…

Computer Vision and Pattern Recognition · Computer Science 2019-04-30 Jiayuan Mao , Chuang Gan , Pushmeet Kohli , Joshua B. Tenenbaum , Jiajun Wu

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Shiming Chen , Wenjin Hou , Salman Khan , Fahad Shahbaz Khan

Cross-Modal Concept Learning and Inference for Vision-Language Models

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the…

Computer Vision and Pattern Recognition · Computer Science 2023-07-31 Yi Zhang , Ce Zhang , Yushun Tang , Zhihai He