English

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Computer Vision and Pattern Recognition 2022-08-02 v1

Abstract

Language modality within the vision language pretraining framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to "discretize" the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ-VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.

Keywords

Cite

@article{arxiv.2208.00475,
  title  = {Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics},
  author = {Xiaoyuan Guo and Jiali Duan and C. -C. Jay Kuo and Judy Wawira Gichoya and Imon Banerjee},
  journal= {arXiv preprint arXiv:2208.00475},
  year   = {2022}
}

Comments

7 pages, 4 figures, ICPR2022. arXiv admin note: text overlap with arXiv:2203.00048

R2 v1 2026-06-25T01:21:46.798Z