Visual Zero-Shot E-Commerce Product Attribute Value Extraction

Jiaying Gong; Ming Cheng; Hongda Shen; Pierre-Yves Vandenbussche; Janet Jenq; Hoda Eldardiry

Visual Zero-Shot E-Commerce Product Attribute Value Extraction

Information Retrieval 2025-02-25 v1 Computer Vision and Pattern Recognition

Authors: Jiaying Gong , Ming Cheng , Hongda Shen , Pierre-Yves Vandenbussche , Janet Jenq , Hoda Eldardiry

Abstract

Existing zero-shot product attribute value (aspect) extraction approaches in e-Commerce industry rely on uni-modal or multi-modal models, where the sellers are asked to provide detailed textual inputs (product descriptions) for the products. However, manually providing (typing) the product descriptions is time-consuming and frustrating for the sellers. Thus, we propose a cross-modal zero-shot attribute value generation framework (ViOC-AG) based on CLIP, which only requires product images as the inputs. ViOC-AG follows a text-only training process, where a task-customized text decoder is trained with the frozen CLIP text encoder to alleviate the modality gap and task disconnection. During the zero-shot inference, product aspects are generated by the frozen CLIP image encoder connected with the trained task-customized text decoder. OCR tokens and outputs from a frozen prompt-based LLM correct the decoded outputs for out-of-domain attribute values. Experiments show that ViOC-AG significantly outperforms other fine-tuned vision-language models for zero-shot attribute value extraction.

Cite

@article{arxiv.2502.15979,
  title  = {Visual Zero-Shot E-Commerce Product Attribute Value Extraction},
  author = {Jiaying Gong and Ming Cheng and Hongda Shen and Pierre-Yves Vandenbussche and Janet Jenq and Hoda Eldardiry},
  journal= {arXiv preprint arXiv:2502.15979},
  year   = {2025}
}

Comments

10 pages, 4 figures, accepted for publication in NAACL 2025 Industry Track

Visual Zero-Shot E-Commerce Product Attribute Value Extraction

Abstract

Cite

Comments

Related papers