Compositional Generalization in Image Captioning

Mitja Nikolaus; Mostafa Abdou; Matthew Lamm; Rahul Aralikatte; Desmond Elliott

doi:10.18653/v1/K19-1009

Compositional Generalization in Image Captioning

Machine Learning 2019-11-12 v2 Computation and Language Computer Vision and Pattern Recognition Machine Learning

Authors: Mitja Nikolaus , Mostafa Abdou , Matthew Lamm , Rahul Aralikatte , Desmond Elliott

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.

Keywords

image captioning image retrieval image representation learning

Cite

@article{arxiv.1909.04402,
  title  = {Compositional Generalization in Image Captioning},
  author = {Mitja Nikolaus and Mostafa Abdou and Matthew Lamm and Rahul Aralikatte and Desmond Elliott},
  journal= {arXiv preprint arXiv:1909.04402},
  year   = {2019}
}

Comments

To appear at CoNLL 2019, EMNLP

Compositional Generalization in Image Captioning

Abstract

Keywords

Cite

Comments

Related papers