Related papers: Language-Driven Visual Consensus for Zero-Shot Sem…

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while…

Computer Vision and Pattern Recognition · Computer Science 2024-06-07 Yunheng Li , ZhongYu Li , Quansheng Zeng , Qibin Hou , Ming-Ming Cheng

Zero-Shot Semantic Segmentation via Spatial and Multi-Scale Aware Visual Class Embedding

Fully supervised semantic segmentation technologies bring a paradigm shift in scene understanding. However, the burden of expensive labeling cost remains as a challenge. To solve the cost problem, recent studies proposed language model…

Computer Vision and Pattern Recognition · Computer Science 2021-12-21 Sungguk Cha , Yooseung Wang

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding

To bridge the gap between supervised semantic segmentation and real-world applications that acquires one model to recognize arbitrary new concepts, recent zero-shot segmentation attracts a lot of attention by exploring the relationships…

Computer Vision and Pattern Recognition · Computer Science 2022-11-01 Quande Liu , Youpeng Wen , Jianhua Han , Chunjing Xu , Hang Xu , Xiaodan Liang

Zero-shot Object Detection Through Vision-Language Embedding Alignment

Recent approaches have shown that training deep neural networks directly on large-scale image-text pair collections enables zero-shot transfer on various recognition tasks. One central issue is how this can be generalized to object…

Computer Vision and Pattern Recognition · Computer Science 2022-08-30 Johnathan Xie , Shuai Zheng

CLIP-S$^4$: Language-Guided Self-Supervised Semantic Segmentation

Existing semantic segmentation approaches are often limited by costly pixel-wise annotations and predefined classes. In this work, we present CLIP-S$^4$ that leverages self-supervised pixel representation learning and vision-language models…

Computer Vision and Pattern Recognition · Computer Science 2023-05-03 Wenbin He , Suphanut Jamonnak , Liang Gou , Liu Ren

Learning Visually Consistent Label Embeddings for Zero-Shot Learning

In this work, we propose a zero-shot learning method to effectively model knowledge transfer between classes via jointly learning visually consistent word vectors and label embedding model in an end-to-end manner. The main idea is to…

Computer Vision and Pattern Recognition · Computer Science 2019-05-17 Berkan Demirel , Ramazan Gokberk Cinbis , Nazli Ikizler-Cinbis

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language…

Computer Vision and Pattern Recognition · Computer Science 2026-02-19 Huadong Tang , Youpeng Zhao , Yan Huang , Min Xu , Jun Wang , Qiang Wu

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen

ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency

Recently, great success has been made in learning visual representations from text supervision, facilitating the emergence of text-supervised semantic segmentation. However, existing works focus on pixel grouping and cross-modal semantic…

Computer Vision and Pattern Recognition · Computer Science 2023-02-22 Pengzhen Ren , Changlin Li , Hang Xu , Yi Zhu , Guangrun Wang , Jianzhuang Liu , Xiaojun Chang , Xiaodan Liang

Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments

Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work…

Computer Vision and Pattern Recognition · Computer Science 2025-10-30 Manjunath Prasad Holenarasipura Rajiv , B. M. Vidyavathi

LOSC: LiDAR Open-voc Segmentation Consolidator

We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Nermin Samet , Gilles Puy , Renaud Marlet

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good…

Computer Vision and Pattern Recognition · Computer Science 2023-11-29 Lukas Hoyer , David Joseph Tan , Muhammad Ferjad Naeem , Luc Van Gool , Federico Tombari

Exploring Open-Vocabulary Semantic Segmentation without Human Labels

Semantic segmentation is a crucial task in computer vision that involves segmenting images into semantically meaningful regions at the pixel level. However, existing approaches often rely on expensive human annotations as supervision for…

Computer Vision and Pattern Recognition · Computer Science 2023-06-02 Jun Chen , Deyao Zhu , Guocheng Qian , Bernard Ghanem , Zhicheng Yan , Chenchen Zhu , Fanyi Xiao , Mohamed Elhoseiny , Sean Chang Culatana

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either…

Computer Vision and Pattern Recognition · Computer Science 2024-12-05 Siyu Jiao , Hongguang Zhu , Jiannan Huang , Yao Zhao , Yunchao Wei , Humphrey Shi

Image Recognition with Vision and Language Embeddings of VLMs

Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Illia Volkov , Nikita Kisel , Klara Janouskova , Jiri Matas

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image…

Computer Vision and Pattern Recognition · Computer Science 2023-12-04 Ying Nie , Wei He , Kai Han , Yehui Tang , Tianyu Guo , Fanyi Du , Yunhe Wang

MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation

CLIP (Contrastive Language-Image Pretraining) is well-developed for open-vocabulary zero-shot image-level recognition, while its applications in pixel-level tasks are less investigated, where most efforts directly adopt CLIP features…

Computer Vision and Pattern Recognition · Computer Science 2023-04-17 Jie Guo , Qimeng Wang , Yan Gao , Xiaolong Jiang , Xu Tang , Yao Hu , Baochang Zhang

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Feng Liang , Bichen Wu , Xiaoliang Dai , Kunpeng Li , Yinan Zhao , Hang Zhang , Peizhao Zhang , Peter Vajda , Diana Marculescu

Delving into Shape-aware Zero-shot Semantic Segmentation

Thanks to the impressive progress of large-scale vision-language pretraining, recent recognition models can classify arbitrary objects in a zero-shot and open-set manner, with a surprisingly high accuracy. However, translating this success…

Computer Vision and Pattern Recognition · Computer Science 2023-04-18 Xinyu Liu , Beiwen Tian , Zhen Wang , Rui Wang , Kehua Sheng , Bo Zhang , Hao Zhao , Guyue Zhou

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Recently, the contrastive language-image pre-training, e.g., CLIP, has demonstrated promising results on various downstream tasks. The pre-trained model can capture enriched visual concepts for images by learning from a large scale of…

Computer Vision and Pattern Recognition · Computer Science 2023-06-21 Huaishao Luo , Junwei Bao , Youzheng Wu , Xiaodong He , Tianrui Li