Related papers: Holistically Explainable Vision Transformers

B-Cos Aligned Transformers Learn Human-Interpretable Features

Vision Transformers (ViTs) and Swin Transformers (Swin) are currently state-of-the-art in computational pathology. However, domain experts are still reluctant to use these models due to their lack of interpretability. This is not…

Computer Vision and Pattern Recognition · Computer Science 2024-01-19 Manuel Tran , Amal Lahiani , Yashin Dicente Cid , Melanie Boxberg , Peter Lienemann , Christian Matek , Sophia J. Wagner , Fabian J. Theis , Eldad Klaiman , Tingying Peng

B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers

We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training. For this, we propose to replace the linear transformations in DNNs by our novel B-cos…

Computer Vision and Pattern Recognition · Computer Science 2024-01-17 Moritz Böhle , Navdeeppal Singh , Mario Fritz , Bernt Schiele

Making Vision Transformers Truly Shift-Equivariant

For computer vision, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs' output remains sensitive to small spatial shifts in the input, i.e.,…

Computer Vision and Pattern Recognition · Computer Science 2023-11-30 Renan A. Rojas-Gomez , Teck-Yian Lim , Minh N. Do , Raymond A. Yeh

You Only Need Less Attention at Each Stage in Vision Transformers

The advent of Vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computer vision. ViTs capture the global information of images through self-attention modules, which perform dot product computations among…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Shuoxi Zhang , Hanpeng Liu , Stephen Lin , Kun He

B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable

B-cos Networks have been shown to be effective for obtaining highly human interpretable explanations of model decisions by architecturally enforcing stronger alignment between inputs and weight. B-cos variants of convolutional networks…

Computer Vision and Pattern Recognition · Computer Science 2025-01-28 Shreyash Arya , Sukrut Rao , Moritz Böhle , Bernt Schiele

Interpretability-Aware Vision Transformer

Vision Transformers (ViTs) have become prominent models for solving various vision tasks. However, the interpretability of ViTs has not kept pace with their promising performance. While there has been a surge of interest in developing {\it…

Computer Vision and Pattern Recognition · Computer Science 2025-05-02 Yao Qiang , Chengyin Li , Prashant Khanduri , Dongxiao Zhu

B-cos Networks: Alignment is All We Need for Interpretability

We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training. For this, we propose to replace the linear transforms in DNNs by our B-cos transform. As we…

Computer Vision and Pattern Recognition · Computer Science 2022-05-23 Moritz Böhle , Mario Fritz , Bernt Schiele

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Daquan Zhou , Bingyi Kang , Xiaojie Jin , Linjie Yang , Xiaochen Lian , Zihang Jiang , Qibin Hou , Jiashi Feng

ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain

Transformer design is the de facto standard for natural language processing tasks. The success of the transformer design in natural language processing has lately piqued the interest of researchers in the domain of computer vision. When…

Computer Vision and Pattern Recognition · Computer Science 2024-02-29 Md Sohag Mia , Abu Bakor Hayat Arnob , Abdu Naim , Abdullah Al Bary Voban , Md Shariful Islam

Interpreting vision transformers via residual replacement model

How do vision transformers (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via sparse autoencoders, and by…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Jinyeong Kim , Junhyeok Kim , Yumin Shim , Joohyeok Kim , Sunyoung Jung , Seong Jae Hwang

Explainability of Vision Transformers: A Comprehensive Review and New Perspectives

Transformers have had a significant impact on natural language processing and have recently demonstrated their potential in computer vision. They have shown promising results over convolution neural networks in fundamental computer vision…

Computer Vision and Pattern Recognition · Computer Science 2023-11-14 Rojina Kashefi , Leili Barekatain , Mohammad Sabokrou , Fatemeh Aghaeipoor

Towards Evaluating Explanations of Vision Transformers for Medical Imaging

As deep learning models increasingly find applications in critical domains such as medical imaging, the need for transparent and trustworthy decision-making becomes paramount. Many explainability methods provide insights into how these…

Computer Vision and Pattern Recognition · Computer Science 2023-11-09 Piotr Komorowski , Hubert Baniecki , Przemysław Biecek

Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects

Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Michael A. Lepori , Alexa R. Tartaglini , Wai Keen Vong , Thomas Serre , Brenden M. Lake , Ellie Pavlick

ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline

Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Juan Manuel Hernandez , Mariana Fernandez-Espinosa , Denis Parra , Diego Gomez-Zara

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works…

Computer Vision and Pattern Recognition · Computer Science 2021-12-24 Zizheng Pan , Bohan Zhuang , Haoyu He , Jing Liu , Jianfei Cai

Toward Transformer-Based Object Detection

Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first…

Computer Vision and Pattern Recognition · Computer Science 2020-12-21 Josh Beal , Eric Kim , Eric Tzeng , Dong Huk Park , Andrew Zhai , Dmitry Kislyuk

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Transformers are increasingly dominating multi-modal reasoning tasks, such as visual question answering, achieving state-of-the-art results thanks to their ability to contextualize information using the self-attention and co-attention…

Computer Vision and Pattern Recognition · Computer Science 2021-03-30 Hila Chefer , Shir Gur , Lior Wolf

ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers

As Vision Transformers (ViTs) are increasingly adopted in sensitive vision applications, there is a growing demand for improved interpretability. This has led to efforts to forward-align these models with carefully annotated abstract,…

Computer Vision and Pattern Recognition · Computer Science 2025-02-05 Sanchit Sinha , Guangzhi Xiong , Aidong Zhang

A Comprehensive Survey of Transformers for Computer Vision

As a special type of transformer, Vision Transformers (ViTs) are used to various computer vision applications (CV), such as image recognition. There are several potential problems with convolutional neural networks (CNNs) that can be solved…

Computer Vision and Pattern Recognition · Computer Science 2022-11-14 Sonain Jamil , Md. Jalil Piran , Oh-Jin Kwon

eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation

Recently vision transformer models have become prominent models for a range of vision tasks. These models, however, are usually opaque with weak feature interpretability. Moreover, there is no method currently built for an intrinsically…

Computer Vision and Pattern Recognition · Computer Science 2022-07-13 Lu Yu , Wei Xiang , Juan Fang , Yi-Ping Phoebe Chen , Lianhua Chi