English

Differentiable Hierarchical Visual Tokenization

Computer Vision and Pattern Recognition 2025-11-05 v1

Abstract

Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

Keywords

Cite

@article{arxiv.2511.02652,
  title  = {Differentiable Hierarchical Visual Tokenization},
  author = {Marius Aasan and Martine Hjelkrem-Tan and Nico Catalano and Changkyu Choi and Adín Ramírez Rivera},
  journal= {arXiv preprint arXiv:2511.02652},
  year   = {2025}
}

Comments

NeurIPS 2025 Spotlight

R2 v1 2026-07-01T07:21:25.715Z