English

Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning

Computer Vision and Pattern Recognition 2024-09-02 v1

Abstract

To transfer knowledge from seen attribute-object compositions to recognize unseen ones, recent compositional zero-shot learning (CZSL) methods mainly discuss the optimal classification branches to identify the elements, leading to the popularity of employing a three-branch architecture. However, these methods mix up the underlying relationship among the branches, in the aspect of consistency and diversity. Specifically, consistently providing the highest-level features for all three branches increases the difficulty in distinguishing classes that are superficially similar. Furthermore, a single branch may focus on suboptimal regions when spatial messages are not shared between the personalized branches. Recognizing these issues and endeavoring to address them, we propose a novel method called Focus-Consistent Multi-Level Aggregation (FOMA). Our method incorporates a Multi-Level Feature Aggregation (MFA) module to generate personalized features for each branch based on the image content. Additionally, a Focus-Consistent Constraint encourages a consistent focus on the informative regions, thereby implicitly exchanging spatial information between all branches. Extensive experiments on three benchmark datasets (UT-Zappos, C-GQA, and Clothing16K) demonstrate that our FOMA outperforms SOTA.

Keywords

Cite

@article{arxiv.2408.17083,
  title  = {Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning},
  author = {Fengyuan Dai and Siteng Huang and Min Zhang and Biao Gong and Donglin Wang},
  journal= {arXiv preprint arXiv:2408.17083},
  year   = {2024}
}

Comments

Compositional Zero-Shot Learning

R2 v1 2026-06-28T18:28:31.147Z