English

MIVC: Multiple Instance Visual Component for Visual-Language Models

Computer Vision and Pattern Recognition 2023-12-29 v1 Artificial Intelligence Computation and Language

Abstract

Vision-language models have been widely explored across a wide range of tasks and achieve satisfactory performance. However, it's under-explored how to consolidate entity understanding through a varying number of images and to align it with the pre-trained language models for generative tasks. In this paper, we propose MIVC, a general multiple instance visual component to bridge the gap between various image inputs with off-the-shelf vision-language models by aggregating visual representations in a permutation-invariant fashion through a neural network. We show that MIVC could be plugged into the visual-language models to improve the model performance consistently on visual question answering, classification and captioning tasks on a public available e-commerce dataset with multiple images per product. Furthermore, we show that the component provides insight into the contribution of each image to the downstream tasks.

Keywords

Cite

@article{arxiv.2312.17109,
  title  = {MIVC: Multiple Instance Visual Component for Visual-Language Models},
  author = {Wenyi Wu and Qi Li and Wenliang Zhong and Junzhou Huang},
  journal= {arXiv preprint arXiv:2312.17109},
  year   = {2023}
}

Comments

Accepted at WACV 2024

R2 v1 2026-06-28T14:03:50.974Z