Related papers: Exploring Vision Transformers as Diffusion Learner…

ConvNets vs. Transformers: Whose Visual Representations are More Transferable?

Vision transformers have attracted much attention from computer vision researchers as they are not restricted to the spatial inductive bias of ConvNets. However, although Transformer-based backbones have achieved much progress on ImageNet…

Computer Vision and Pattern Recognition · Computer Science 2021-08-18 Hong-Yu Zhou , Chixiang Lu , Sibei Yang , Yizhou Yu

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Recently, diffusion transformers have gained wide attention with its excellent performance in text-to-image and text-to-vidoe models, emphasizing the need for transformers as backbone for diffusion models. Transformer-based models have…

Computer Vision and Pattern Recognition · Computer Science 2024-04-16 Nithin Gopalakrishnan Nair , Jeya Maria Jose Valanarasu , Vishal M. Patel

All are Worth Words: A ViT Backbone for Diffusion Models

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Fan Bao , Shen Nie , Kaiwen Xue , Yue Cao , Chongxuan Li , Hang Su , Jun Zhu

A Survey on Visual Transformer

Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to…

Computer Vision and Pattern Recognition · Computer Science 2023-07-11 Kai Han , Yunhe Wang , Hanting Chen , Xinghao Chen , Jianyuan Guo , Zhenhua Liu , Yehui Tang , An Xiao , Chunjing Xu , Yixing Xu , Zhaohui Yang , Yiman Zhang , Dacheng Tao

ViT-5: Vision Transformers for The Mid-2020s

This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Feng Wang , Sucheng Ren , Tiezheng Zhang , Predrag Neskovic , Anand Bhattad , Cihang Xie , Alan Yuille

Toward Transformer-Based Object Detection

Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first…

Computer Vision and Pattern Recognition · Computer Science 2020-12-21 Josh Beal , Eric Kim , Eric Tzeng , Dong Huk Park , Andrew Zhai , Dmitry Kislyuk

Interpretability-Aware Vision Transformer

Vision Transformers (ViTs) have become prominent models for solving various vision tasks. However, the interpretability of ViTs has not kept pace with their promising performance. While there has been a surge of interest in developing {\it…

Computer Vision and Pattern Recognition · Computer Science 2025-05-02 Yao Qiang , Chengyin Li , Prashant Khanduri , Dongxiao Zhu

Vision Transformers are Robust Learners

Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Sayak Paul , Pin-Yu Chen

Do text-free diffusion models learn discriminative visual representations?

While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which addresses both families of tasks simultaneously. We…

Computer Vision and Pattern Recognition · Computer Science 2024-09-25 Soumik Mukhopadhyay , Matthew Gwilliam , Yosuke Yamaguchi , Vatsal Agarwal , Namitha Padmanabhan , Archana Swaminathan , Tianyi Zhou , Jun Ohya , Abhinav Shrivastava

Vision Transformer for Contrastive Clustering

Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN) with its ability to capture global long-range dependencies for visual representation learning. Besides ViT, contrastive learning is another…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Hua-Bao Ling , Bowen Zhu , Dong Huang , Ding-Hua Chen , Chang-Dong Wang , Jian-Huang Lai

Learning Data Representations with Joint Diffusion Models

Joint machine learning models that allow synthesizing and classifying data often offer uneven performance between those tasks or are unstable to train. In this work, we depart from a set of empirical observations that indicate the…

Machine Learning · Computer Science 2023-04-06 Kamil Deja , Tomasz Trzcinski , Jakub M. Tomczak

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations…

Machine Learning · Computer Science 2024-11-15 Alexander C. Li , Yuandong Tian , Beidi Chen , Deepak Pathak , Xinlei Chen

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

Large-scale pretraining of visual representations has led to state-of-the-art performance on a range of benchmark computer vision tasks, yet the benefits of these techniques at extreme scale in complex production systems has been relatively…

Computer Vision and Pattern Recognition · Computer Science 2021-08-13 Josh Beal , Hao-Yu Wu , Dong Huk Park , Andrew Zhai , Dmitry Kislyuk

Vision Transformer with Deformable Attention

Transformers have recently shown superior performances on various vision tasks. The large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. Nevertheless, simply…

Computer Vision and Pattern Recognition · Computer Science 2022-05-25 Zhuofan Xia , Xuran Pan , Shiji Song , Li Erran Li , Gao Huang

Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

Vision Transformers (ViT) have recently demonstrated the significant potential of transformer architectures for computer vision. To what extent can image-based deep reinforcement learning also benefit from ViT architectures, as compared to…

Machine Learning · Computer Science 2022-05-17 Tianxin Tao , Daniele Reda , Michiel van de Panne

Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding

Transformers have become a common foundation across deep learning, yet 3D scene understanding still relies on specialized backbones with strong domain priors. This keeps the field isolated from the broader Transformer ecosystem, limiting…

Computer Vision and Pattern Recognition · Computer Science 2026-04-22 Kadir Yilmaz , Adrian Kruse , Tristan Höfer , Daan de Geus , Bastian Leibe

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Carmelo Scribano , Mohammad Mahdi , Nedyalko Prisadnikov , Yuqian Fu , Giorgia Franchini , Danda Pani Paudel , Marko Bertogna , Luc Van Gool

Vision Transformers for Dense Prediction

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into…

Computer Vision and Pattern Recognition · Computer Science 2021-03-26 René Ranftl , Alexey Bochkovskiy , Vladlen Koltun

MonoFormer: One Transformer for Both Diffusion and Autoregression

Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for…

Computer Vision and Pattern Recognition · Computer Science 2024-09-25 Chuyang Zhao , Yuxing Song , Wenhao Wang , Haocheng Feng , Errui Ding , Yifan Sun , Xinyan Xiao , Jingdong Wang

GenTron: Diffusion Transformers for Image and Video Generation

In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Shoufa Chen , Mengmeng Xu , Jiawei Ren , Yuren Cong , Sen He , Yanping Xie , Animesh Sinha , Ping Luo , Tao Xiang , Juan-Manuel Perez-Rua