HomeComputation & LanguagearXiv:2605.29496

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

Abstract

Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.

Comments: Project: https://asymmetric-vlm-post-training.github.io/

Cite

@article{arxiv.2605.29496,
  title  = {On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training},
  author = {Xueqing Wu and Yu-Chi Lin and Kai-Wei Chang and Nanyun Peng},
  journal= {arXiv preprint arXiv:2605.29496},
  year   = {2026}
}