English

Learning Visual Proxy for Compositional Zero-Shot Learning

Computer Vision and Pattern Recognition 2025-09-03 v4

Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions by leveraging knowledge from seen compositions. Current methods align textual prototypes with visual features via Vision-Language Models (VLMs), but suffer from two limitations: (1) modality gaps hinder the discrimination of semantically similar pairs, and (2) single-modal textual prototypes lack fine-grained visual cues. In this paper, we introduce Visual Proxy Learning, a method that reduces modality gaps and enhances compositional generalization. We initialize visual proxies for attributes, objects, and their compositions using text representations and optimize the visual space to capture fine-grained cues, improving visual representations. Additionally, we propose Cross-Modal Joint Learning (CMJL), which imposes cross-modal constraints between the text-image and fine-grained visual spaces, improving generalization for unseen compositions and discriminating similar pairs. Experiments show state-of-the-art performance in closed-world scenarios and competitive results in open-world settings across four CZSL benchmarks, demonstrating the effectiveness of our approach in compositional generalization.

Keywords

Cite

@article{arxiv.2501.13859,
  title  = {Learning Visual Proxy for Compositional Zero-Shot Learning},
  author = {Shiyu Zhang and Cheng Yan and Yang Liu and Chenchen Jing and Lei Zhou and Wenjun Wang},
  journal= {arXiv preprint arXiv:2501.13859},
  year   = {2025}
}