English

Visual Set Program Synthesizer

Multimedia 2026-03-18 v1 Computation and Language Symbolic Computation

Abstract

A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.

Keywords

Cite

@article{arxiv.2603.15997,
  title  = {Visual Set Program Synthesizer},
  author = {Zehua Cheng and Wei Dai and Wenhu Zhang and Thomas Lukasiewicz and Jiahao Sun},
  journal= {arXiv preprint arXiv:2603.15997},
  year   = {2026}
}

Comments

10 pages, IEEE International Conference on Multimedia and Expo 2026

R2 v1 2026-07-01T11:23:21.184Z