English

Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER

Machine Learning 2023-08-08 v1 Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition

Abstract

The challenge posed by multimodal named entity recognition (MNER) is mainly two-fold: (1) bridging the semantic gap between text and image and (2) matching the entity with its associated object in image. Existing methods fail to capture the implicit entity-object relations, due to the lack of corresponding annotation. In this paper, we propose a bidirectional generative alignment method named BGA-MNER to tackle these issues. Our BGA-MNER consists of \texttt{image2text} and \texttt{text2image} generation with respect to entity-salient content in two modalities. It jointly optimizes the bidirectional reconstruction objectives, leading to aligning the implicit entity-object relations under such direct and powerful constraints. Furthermore, image-text pairs usually contain unmatched components which are noisy for generation. A stage-refined context sampler is proposed to extract the matched cross-modal content for generation. Extensive experiments on two benchmarks demonstrate that our method achieves state-of-the-art performance without image input during inference.

Keywords

Cite

@article{arxiv.2308.02570,
  title  = {Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER},
  author = {Feng Chen and Jiajia Liu and Kaixiang Ji and Wang Ren and Jian Wang and Jingdong Wang},
  journal= {arXiv preprint arXiv:2308.02570},
  year   = {2023}
}
R2 v1 2026-06-28T11:48:27.585Z