Understanding Transformer-based Vision Models through Inversion

Jan Rathjens; Shirin Reyhanian; David Kappel; Laurenz Wiskott

Understanding Transformer-based Vision Models through Inversion

Computer Vision and Pattern Recognition 2025-08-15 v4 Artificial Intelligence Machine Learning Neural and Evolutionary Computing

Authors: Jan Rathjens , Shirin Reyhanian , David Kappel , Laurenz Wiskott

View on arXiv ↗ PDF ↗

Abstract

Understanding the mechanisms underlying deep neural networks remains a fundamental challenge in machine learning and computer vision. One promising, yet only preliminarily explored approach, is feature inversion, which attempts to reconstruct images from intermediate representations using trained inverse neural networks. In this study, we revisit feature inversion, introducing a novel, modular variation that enables significantly more efficient application of the technique. We demonstrate how our method can be systematically applied to the large-scale transformer-based vision models, Detection Transformer and Vision Transformer, and how reconstructed images can be qualitatively interpreted in a meaningful way. We further quantitatively evaluate our method, thereby uncovering underlying mechanisms of representing image features that emerge in the two transformer architectures. Our analysis reveals key insights into how these models encode contextual shape and image details, how their layers correlate, and their robustness against color perturbations. These findings contribute to a deeper understanding of transformer-based vision models and their internal representations. The code for reproducing our experiments is available at github.com/wiskott-lab/inverse-tvm.

Keywords

image representation learning vision transformer image reconstruction

Cite

@article{arxiv.2412.06534,
  title  = {Understanding Transformer-based Vision Models through Inversion},
  author = {Jan Rathjens and Shirin Reyhanian and David Kappel and Laurenz Wiskott},
  journal= {arXiv preprint arXiv:2412.06534},
  year   = {2025}
}

Understanding Transformer-based Vision Models through Inversion

Abstract

Keywords

Cite

Related papers