VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing
Abstract
Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.
Cite
@article{arxiv.2605.30117,
title = {VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing},
author = {Haoyuan Shi and Xiancong Ren and Yingji Zhang and Qinfan Zhang and Jiayu Hu and Haozhe Shan and Han Dong and Jinpeng Lu and Yinda Chen and Yi Zhang and Yong Dai and Xiaozhu Ju},
journal= {arXiv preprint arXiv:2605.30117},
year = {2026}
}