English

Visual Program Distillation with Template-Based Augmentation

Computer Vision and Pattern Recognition 2025-11-05 v4 Computation and Language

Abstract

Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference

Keywords

Cite

@article{arxiv.2412.08564,
  title  = {Visual Program Distillation with Template-Based Augmentation},
  author = {Michal Shlapentokh-Rothman and Yu-Xiong Wang and Derek Hoiem},
  journal= {arXiv preprint arXiv:2412.08564},
  year   = {2025}
}

Comments

EMNLP Camera Ready

R2 v1 2026-06-28T20:31:17.456Z