English

Generating captions without looking beyond objects

Computer Vision and Pattern Recognition 2016-10-19 v2 Computation and Language

Abstract

This paper explores new evaluation perspectives for image captioning and introduces a noun translation task that achieves comparative image caption generation performance by translating from a set of nouns to captions. This implies that in image captioning, all word categories other than nouns can be evoked by a powerful language model without sacrificing performance on n-gram precision. The paper also investigates lower and upper bounds of how much individual word categories in the captions contribute to the final BLEU score. A large possible improvement exists for nouns, verbs, and prepositions.

Keywords

Cite

@article{arxiv.1610.03708,
  title  = {Generating captions without looking beyond objects},
  author = {Hendrik Heuer and Christof Monz and Arnold W. M. Smeulders},
  journal= {arXiv preprint arXiv:1610.03708},
  year   = {2016}
}

Comments

This paper was presented at the ECCV2016 2nd Workshop on Storytelling with Images and Videos (VisStory)

R2 v1 2026-06-22T16:18:44.187Z