Related papers: Disentangling Visual Transformers: Patch-level Int…

Interpretability-Aware Vision Transformer

Vision Transformers (ViTs) have become prominent models for solving various vision tasks. However, the interpretability of ViTs has not kept pace with their promising performance. While there has been a surge of interest in developing {\it…

Computer Vision and Pattern Recognition · Computer Science 2025-05-02 Yao Qiang , Chengyin Li , Prashant Khanduri , Dongxiao Zhu

Nested-TNT: Hierarchical Vision Transformers with Multi-Scale Feature Processing

Transformer has been applied in the field of computer vision due to its excellent performance in natural language processing, surpassing traditional convolutional neural networks and achieving new state-of-the-art. ViT divides an image into…

Computer Vision and Pattern Recognition · Computer Science 2024-04-23 Yuang Liu , Zhiheng Qiu , Xiaokai Qin

A Disentangling Invertible Interpretation Network for Explaining Latent Representations

Neural networks have greatly boosted performance in computer vision by learning powerful representations of input data. The drawback of end-to-end training for maximal overall performance are black-box models whose hidden representations…

Computer Vision and Pattern Recognition · Computer Science 2020-04-29 Patrick Esser , Robin Rombach , Björn Ommer

Interpretable Vision Transformers in Image Classification via SVDA

Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Vasileios Arampatzakis , George Pavlidis , Nikolaos Mitianoudis , Nikos Papamarkos

Visualizing and Understanding Patch Interactions in Vision Transformer

Vision Transformer (ViT) has become a leading tool in various computer vision tasks, owing to its unique self-attention mechanism that learns visual representations explicitly through cross-patch information interactions. Despite having…

Computer Vision and Pattern Recognition · Computer Science 2022-03-14 Jie Ma , Yalong Bai , Bineng Zhong , Wei Zhang , Ting Yao , Tao Mei

Hierarchical Vision Transformer Enhanced by Graph Convolutional Network for Image Classification

Vision Transformer (ViT) has brought new breakthroughs to the field of image classification by introducing the self-attention mechanism and Graph Convolutional Networks(GCN) have been proposed and successfully applied in data representation…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Haibin Jiao

Scalable Vision Transformers with Hierarchical Pooling

The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a…

Computer Vision and Pattern Recognition · Computer Science 2021-08-19 Zizheng Pan , Bohan Zhuang , Jing Liu , Haoyu He , Jianfei Cai

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works…

Computer Vision and Pattern Recognition · Computer Science 2021-12-24 Zizheng Pan , Bohan Zhuang , Haoyu He , Jing Liu , Jianfei Cai

Mechanistic Interpretability of Fine-Tuned Vision Transformers on Distorted Images: Decoding Attention Head Behavior for Transparent and Trustworthy AI

Mechanistic interpretability improves the safety, reliability, and robustness of large AI models. This study examined individual attention heads in vision transformers (ViTs) fine tuned on distorted 2D spectrogram images containing non…

Machine Learning · Computer Science 2025-03-25 Nooshin Bahador

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

Hierarchical structures are popular in recent vision transformers, however, they require sophisticated designs and massive datasets to work well. In this paper, we explore the idea of nesting basic local transformers on non-overlapping…

Computer Vision and Pattern Recognition · Computer Science 2022-01-03 Zizhao Zhang , Han Zhang , Long Zhao , Ting Chen , Sercan O. Arik , Tomas Pfister

Hierarchical Vision Transformer with Prototypes for Interpretable Medical Image Classification

Explainability is a highly demanded requirement for applications in high-risk areas such as medicine. Vision Transformers have mainly been limited to attention extraction to provide insight into the model's reasoning. Our approach combines…

Computer Vision and Pattern Recognition · Computer Science 2025-02-14 Luisa Gallée , Catharina Silvia Lisson , Meinrad Beer , Michael Götz

A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully connected layer to incorporate class information to make predictions, we investigate a…

Computer Vision and Pattern Recognition · Computer Science 2024-06-17 Dipanjyoti Paul , Arpita Chowdhury , Xinqi Xiong , Feng-Ju Chang , David Carlyn , Samuel Stevens , Kaiya L. Provost , Anuj Karpatne , Bryan Carstens , Daniel Rubenstein , Charles Stewart , Tanya Berger-Wolf , Yu Su , Wei-Lun Chao

Vision Transformers: From Semantic Segmentation to Dense Prediction

The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image…

Computer Vision and Pattern Recognition · Computer Science 2024-08-05 Li Zhang , Jiachen Lu , Sixiao Zheng , Xinxuan Zhao , Xiatian Zhu , Yanwei Fu , Tao Xiang , Jianfeng Feng , Philip H. S. Torr

ITTR: Unpaired Image-to-Image Translation with Transformers

Unpaired image-to-image translation is to translate an image from a source domain to a target domain without paired training data. By utilizing CNN in extracting local semantics, various techniques have been developed to improve the…

Computer Vision and Pattern Recognition · Computer Science 2022-03-31 Wanfeng Zheng , Qiang Li , Guoxin Zhang , Pengfei Wan , Zhongyuan Wang

MHITNet: a minimize network with a hierarchical context-attentional filter for segmenting medical ct images

In the field of medical CT image processing, convolutional neural networks (CNNs) have been the dominant technique.Encoder-decoder CNNs utilise locality for efficiency, but they cannot simulate distant pixel interactions properly.Recent…

Image and Video Processing · Electrical Eng. & Systems 2022-11-03 Hongyang He , Feng Ziliang , Yuanhang Zheng , Shudong Huang , HaoBing Gao

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in…

Computer Vision and Pattern Recognition · Computer Science 2021-08-24 Chun-Fu Chen , Quanfu Fan , Rameswar Panda

A Transformer-in-Transformer Network Utilizing Knowledge Distillation for Image Recognition

This paper presents a novel knowledge distillation neural architecture leveraging efficient transformer networks for effective image classification. Natural images display intricate arrangements encompassing numerous extraneous elements.…

Computer Vision and Pattern Recognition · Computer Science 2025-02-25 Dewan Tauhid Rahman , Yeahia Sarker , Antar Mazumder , Md. Shamim Anower

Transformer Interpretability from Perspective of Attention and Gradient

Although researchers' attention is more focused on the performance of Transformer models, the interpretation of Transformer can never be ignored. Gradient is widely utilized in Transformer interpretation. From the perspective of attention…

Artificial Intelligence · Computer Science 2026-05-13 Yongjin Cui , Xiaohui Fan , Huajun Chen

Interpreting vision transformers via residual replacement model

How do vision transformers (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via sparse autoencoders, and by…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Jinyeong Kim , Junhyeok Kim , Yumin Shim , Joohyeok Kim , Sunyoung Jung , Seong Jae Hwang

A Hybrid Vision Transformer Approach for Mathematical Expression Recognition

One of the crucial challenges taken in document analysis is mathematical expression recognition. Unlike text recognition which only focuses on one-dimensional structure images, mathematical expression recognition is a much more complicated…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Anh Duy Le , Van Linh Pham , Vinh Loi Ly , Nam Quan Nguyen , Huu Thang Nguyen , Tuan Anh Tran