Related papers: Evolving Interpretable Visual Classifiers with Lar…

Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Shiming Chen , Bowen Duan , Salman Khan , Fahad Shahbaz Khan

ECOR: Explainable CLIP for Object Recognition

Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. Their open vocabulary feature enhances their value. However, their…

Computer Vision and Pattern Recognition · Computer Science 2024-04-22 Ali Rasekh , Sepehr Kazemi Ranjbar , Milad Heidari , Wolfgang Nejdl

Visual Classification via Description from Large Language Models

Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for…

Computer Vision and Pattern Recognition · Computer Science 2022-12-02 Sachit Menon , Carl Vondrick

What's in a Name? Beyond Class Indices for Image Recognition

Existing machine learning models demonstrate excellent performance in image object recognition after training on a large-scale dataset under full supervision. However, these models only learn to map an image to a predefined class index,…

Computer Vision and Pattern Recognition · Computer Science 2024-07-30 Kai Han , Xiaohu Huang , Yandong Li , Sagar Vaze , Jie Li , Xuhui Jia

Delving into the Openness of CLIP

Contrastive Language-Image Pre-training (CLIP) formulates image classification as an image-to-text matching task, i.e., matching images to the corresponding natural language descriptions instead of discrete category IDs. This allows for…

Computer Vision and Pattern Recognition · Computer Science 2023-05-09 Shuhuai Ren , Lei Li , Xuancheng Ren , Guangxiang Zhao , Xu Sun

Enhancing Visual Classification using Comparative Descriptors

The performance of vision-language models (VLMs), such as CLIP, in visual classification tasks, has been enhanced by leveraging semantic knowledge from large language models (LLMs), including GPT. Recent studies have shown that in zero-shot…

Computer Vision and Pattern Recognition · Computer Science 2024-11-12 Hankyeol Lee , Gawon Seo , Wonseok Choi , Geunyoung Jung , Kyungwoo Song , Jiyoung Jung

Cross-Modal Concept Learning and Inference for Vision-Language Models

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the…

Computer Vision and Pattern Recognition · Computer Science 2023-07-31 Yi Zhang , Ce Zhang , Yushun Tang , Zhihai He

Explaining CLIP Zero-shot Predictions Through Concepts

Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Onat Ozdemir , Anders Christensen , Stephan Alaniz , Zeynep Akata , Emre Akbas

Semantically-Prompted Language Models Improve Visual Descriptions

Language-vision models like CLIP have made significant strides in vision tasks, such as zero-shot image classification (ZSIC). However, generating specific and expressive visual descriptions remains challenging; descriptions produced by…

Computer Vision and Pattern Recognition · Computer Science 2024-11-25 Michael Ogezi , Bradley Hauer , Grzegorz Kondrak

Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models

Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Zhenxiang Lin , Maryam Haghighat , Will Browne , Dimity Miller

Learning Concise and Descriptive Attributes for Visual Recognition

Recent advances in foundation models present new opportunities for interpretable visual recognition -- one can first query Large Language Models (LLMs) to obtain a set of attributes that describe each class, then apply vision-language…

Computer Vision and Pattern Recognition · Computer Science 2023-08-08 An Yan , Yu Wang , Yiwu Zhong , Chengyu Dong , Zexue He , Yujie Lu , William Wang , Jingbo Shang , Julian McAuley

Enhancing Image Retrieval : A Comprehensive Study on Photo Search using the CLIP Mode

Photo search, the task of retrieving images based on textual queries, has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model. CLIP leverages a vision-language pre training…

Computer Vision and Pattern Recognition · Computer Science 2024-01-25 Naresh Kumar Lahajal , Harini S

PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts

Vision-language models like CLIP are widely used in zero-shot image classification due to their ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like…

Computer Vision and Pattern Recognition · Computer Science 2024-03-19 Bang An , Sicheng Zhu , Michael-Andrei Panaitescu-Liess , Chaithanya Kumar Mummadi , Furong Huang

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

The zero-shot performance of existing vision-language models (VLMs) such as CLIP is limited by the availability of large-scale, aligned image and text datasets in specific domains. In this work, we leverage two complementary sources of…

Computer Vision and Pattern Recognition · Computer Science 2024-04-05 Oindrila Saha , Grant Van Horn , Subhransu Maji

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Low-shot image classification, where training images are limited or inaccessible, has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability, e.g. CLIP. Prompt learning methods built with VL…

Computer Vision and Pattern Recognition · Computer Science 2024-04-04 Zhaoheng Zheng , Jingmin Wei , Xuefeng Hu , Haidong Zhu , Ram Nevatia

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

Fine-grained image classification, particularly in zero/few-shot scenarios, presents a significant challenge for vision-language models (VLMs), such as CLIP. These models often struggle with the nuanced task of distinguishing between…

Computation and Language · Computer Science 2024-05-21 Canshi Wei

Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries

Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Kevin Robbins , Xiaotong Liu , Yu Wu , Le Sun , Grady McPeak , Abby Stylianou , Robert Pless

Open Vocabulary Multi-Label Video Classification

Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Rohit Gupta , Mamshad Nayeem Rizve , Jayakrishnan Unnikrishnan , Ashish Tawari , Son Tran , Mubarak Shah , Benjamin Yao , Trishul Chilimbi

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Shaunak Halbe , Junjiao Tian , K J Joseph , James Seale Smith , Katherine Stevo , Vineeth N Balasubramanian , Zsolt Kira

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what objects are present. However, they struggle with pixel-level recognition tasks like…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Heeseong Shin , Chaehyun Kim , Sunghwan Hong , Seokju Cho , Anurag Arnab , Paul Hongsuck Seo , Seungryong Kim