Related papers: Differentiable Hierarchical Visual Tokenization

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Marius Aasan , Odd Kolbjørnsen , Anne Schistad Solberg , Adín Ramirez Rivera

Boosting vision transformers for image retrieval

Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional…

Computer Vision and Pattern Recognition · Computer Science 2022-10-24 Chull Hwan Song , Jooyoung Yoon , Shunghyun Choi , Yannis Avrithis

Patch Is Not All You Need

Vision Transformers have achieved great success in computer visions, delivering exceptional performance across various tasks. However, their inherent reliance on sequential input enforces the manual partitioning of images into patch…

Computer Vision and Pattern Recognition · Computer Science 2023-08-22 Changzhen Li , Jie Zhang , Yang Wei , Zhilong Ji , Jinfeng Bai , Shiguang Shan

Vision Transformers with Mixed-Resolution Tokenization

Vision Transformer models process input images by dividing them into a spatially regular grid of equal-size patches. Conversely, Transformers were originally introduced over natural language sequences, where each token represents a subword…

Computer Vision and Pattern Recognition · Computer Science 2023-04-28 Tomer Ronen , Omer Levy , Avram Golbert

Robust Visual Tracking via Hierarchical Convolutional Features

In this paper, we propose to exploit the rich hierarchical features of deep convolutional neural networks to improve the accuracy and robustness of visual tracking. Deep neural networks trained on object recognition datasets consist of…

Computer Vision and Pattern Recognition · Computer Science 2018-08-14 Chao Ma , Jia-Bin Huang , Xiaokang Yang , Ming-Hsuan Yang

Spectral Image Tokenizer

Image tokenizers map images to sequences of discrete tokens, and are a crucial component of autoregressive transformer-based image generation. The tokens are typically associated with spatial locations in the input image, arranged in raster…

Computer Vision and Pattern Recognition · Computer Science 2025-06-12 Carlos Esteves , Mohammed Suhail , Ameesh Makadia

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance;…

Computer Vision and Pattern Recognition · Computer Science 2020-11-23 Bichen Wu , Chenfeng Xu , Xiaoliang Dai , Alvin Wan , Peizhao Zhang , Zhicheng Yan , Masayoshi Tomizuka , Joseph Gonzalez , Kurt Keutzer , Peter Vajda

Scalable Vision Transformers with Hierarchical Pooling

The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a…

Computer Vision and Pattern Recognition · Computer Science 2021-08-19 Zizheng Pan , Bohan Zhuang , Jing Liu , Haoyu He , Jianfei Cai

Wavelet-Based Image Tokenizer for Vision Transformers

Non-overlapping patch-wise convolution is the default image tokenizer for all state-of-the-art vision Transformer (ViT) models. Even though many ViT variants have been proposed to improve its efficiency and accuracy, little research on…

Computer Vision and Pattern Recognition · Computer Science 2024-05-30 Zhenhai Zhu , Radu Soricut

Vision Transformers: From Semantic Segmentation to Dense Prediction

The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image…

Computer Vision and Pattern Recognition · Computer Science 2024-08-05 Li Zhang , Jiachen Lu , Sixiao Zheng , Xinxuan Zhao , Xiatian Zhu , Yanwei Fu , Tao Xiang , Jianfeng Feng , Philip H. S. Torr

Convolutional Transformer-Based Image Compression

In this paper, we present a novel transformer-based architecture for end-to-end image compression. Our architecture incorporates blocks that effectively capture local dependencies between tokens, eliminating the need for positional encoding…

Image and Video Processing · Electrical Eng. & Systems 2024-09-09 Bouzid Arezki , Fangchen Feng , Anissa Mokraoui

Analyzing Vision Transformers for Image Classification in Class Embedding Space

Despite the growing use of transformer models in computer vision, a mechanistic understanding of these networks is still needed. This work introduces a method to reverse-engineer Vision Transformers trained to solve image classification…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Martina G. Vilas , Timothy Schaumlöffel , Gemma Roig

Exploring vision transformer layer choosing for semantic segmentation

Extensive work has demonstrated the effectiveness of Vision Transformers. The plain Vision Transformer tends to obtain multi-scale features by selecting fixed layers, or the last layer of features aiming to achieve higher performance in…

Computer Vision and Pattern Recognition · Computer Science 2023-05-10 Fangjian Lin , Yizhe Ma , Shengwei Tian

Vision Transformers Need Better Token Interaction

Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Linxiang Su

Disentangling Visual Transformers: Patch-level Interpretability for Image Classification

Visual transformers have achieved remarkable performance in image classification tasks, but this performance gain has come at the cost of interpretability. One of the main obstacles to the interpretation of transformers is the…

Computer Vision and Pattern Recognition · Computer Science 2025-04-25 Guillaume Jeanneret , Loïc Simon , Frédéric Jurie

Adaptive Length Image Tokenization via Recurrent Allocation

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational…

Computer Vision and Pattern Recognition · Computer Science 2024-11-05 Shivam Duggal , Phillip Isola , Antonio Torralba , William T. Freeman

HIPA: Hierarchical Patch Transformer for Single Image Super Resolution

Transformer-based architectures start to emerge in single image super resolution (SISR) and have achieved promising performance. Most existing Vision Transformers divide images into the same number of patches with a fixed size, which may…

Computer Vision and Pattern Recognition · Computer Science 2023-06-21 Qing Cai , Yiming Qian , Jinxing Li , Jun Lv , Yee-Hong Yang , Feng Wu , David Zhang

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Jaihyun Lew , Soohyuk Jang , Jaehoon Lee , Seungryong Yoo , Eunji Kim , Saehyung Lee , Jisoo Mok , Siwon Kim , Sungroh Yoon

Vision Transformer with Progressive Sampling

Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image…

Computer Vision and Pattern Recognition · Computer Science 2021-08-05 Xiaoyu Yue , Shuyang Sun , Zhanghui Kuang , Meng Wei , Philip Torr , Wayne Zhang , Dahua Lin

Discrete Representations Strengthen Vision Transformer Robustness

Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Chengzhi Mao , Lu Jiang , Mostafa Dehghani , Carl Vondrick , Rahul Sukthankar , Irfan Essa