Related papers: What Do Self-Supervised Vision Transformers Learn?

A Theoretical Analysis of Self-Supervised Learning for Vision Transformers

Self-supervised learning has become a cornerstone in computer vision, primarily divided into reconstruction-based methods like masked autoencoders (MAE) and discriminative methods such as contrastive learning (CL). Recent empirical…

Machine Learning · Computer Science 2025-02-06 Yu Huang , Zixin Wen , Yuejie Chi , Yingbin Liang

Revealing the Dark Secrets of Masked Image Modeling

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from…

Computer Vision and Pattern Recognition · Computer Science 2022-05-30 Zhenda Xie , Zigang Geng , Jingcheng Hu , Zheng Zhang , Han Hu , Yue Cao

Teaching Matters: Investigating the Role of Supervision in Vision Transformers

Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, their behavior under different learning paradigms is not well explored. We compare ViTs trained through…

Computer Vision and Pattern Recognition · Computer Science 2023-04-07 Matthew Walmer , Saksham Suri , Kamal Gupta , Abhinav Shrivastava

SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation

This paper represents a neat yet effective framework, named SemanticMIM, to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation. We conduct a thorough comparative analysis…

Computer Vision and Pattern Recognition · Computer Science 2024-06-18 Yike Yuan , Huanzhang Dou , Fengjun Guo , Xi Li

Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations

Recently, both Contrastive Learning (CL) and Mask Image Modeling (MIM) demonstrate that self-supervision is powerful to learn good representations. However, naively combining them is far from success. In this paper, we start by making the…

Computer Vision and Pattern Recognition · Computer Science 2023-03-01 Ziyu Jiang , Yinpeng Chen , Mengchen Liu , Dongdong Chen , Xiyang Dai , Lu Yuan , Zicheng Liu , Zhangyang Wang

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Maithra Raghu , Thomas Unterthiner , Simon Kornblith , Chiyuan Zhang , Alexey Dosovitskiy

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning

Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Bin Ren , Guofeng Mei , Danda Pani Paudel , Weijie Wang , Yawei Li , Mengyuan Liu , Rita Cucchiara , Luc Van Gool , Nicu Sebe

Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

Recently, Masked Image Modeling (MIM) achieves great success in self-supervised visual recognition. However, as a reconstruction-based framework, it is still an open question to understand how MIM works, since MIM appears very different…

Computer Vision and Pattern Recognition · Computer Science 2022-08-09 Xiangwen Kong , Xiangyu Zhang

Masked Image Modeling with Denoising Contrast

Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision…

Computer Vision and Pattern Recognition · Computer Science 2023-01-31 Kun Yi , Yixiao Ge , Xiaotong Li , Shusheng Yang , Dian Li , Jianping Wu , Ying Shan , Xiaohu Qie

Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners

Representation learning has been evolving from traditional supervised training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous works have demonstrated their pros and cons in specific scenarios, i.e., CL and supervised…

Computer Vision and Pattern Recognition · Computer Science 2023-06-29 Bowen Shi , Xiaopeng Zhang , Yaoming Wang , Jin Li , Wenrui Dai , Junni Zou , Hongkai Xiong , Qi Tian

What do Vision Transformers Learn? A Visual Exploration

Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional…

Computer Vision and Pattern Recognition · Computer Science 2022-12-14 Amin Ghiasi , Hamid Kazemi , Eitan Borgnia , Steven Reich , Manli Shu , Micah Goldblum , Andrew Gordon Wilson , Tom Goldstein

MimCo: Masked Image Modeling Pre-training with Contrastive Teacher

Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), which requires the target model to recover the masked part of the input image. Although MIM-based pre-training methods achieve new…

Computer Vision and Pattern Recognition · Computer Science 2023-04-21 Qiang Zhou , Chaohui Yu , Hao Luo , Zhibin Wang , Hao Li

Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their…

Machine Learning · Computer Science 2023-04-27 Shashank Shekhar , Florian Bordes , Pascal Vincent , Ari Morcos

Transformed Multi-view 3D Shape Features with Contrastive Learning

This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Márcus Vinícius Lobo Costa , Sherlon Almeida da Silva , Bárbara Caroline Benato , Leo Sampaio Ferraz Ribeiro , Moacir Antonelli Ponti

Improvements to Self-Supervised Representation Learning for Masked Image Modeling

This paper explores improvements to the masked image modeling (MIM) paradigm. The MIM paradigm enables the model to learn the main object features of the image by masking the input image and predicting the masked part by the unmasked part.…

Computer Vision and Pattern Recognition · Computer Science 2022-05-24 Jiawei Mao , Xuesong Yin , Yuanqi Chang , Honggu Zhou

Vision Transformer for Contrastive Clustering

Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN) with its ability to capture global long-range dependencies for visual representation learning. Besides ViT, contrastive learning is another…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Hua-Bao Ling , Bowen Zhu , Dong Huang , Ding-Hua Chen , Chang-Dong Wang , Jian-Huang Lai

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a…

Computer Vision and Pattern Recognition · Computer Science 2024-07-26 Haoran You , Yunyang Xiong , Xiaoliang Dai , Bichen Wu , Peizhao Zhang , Haoqi Fan , Peter Vajda , Yingyan Celine Lin

Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation

Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text…

Computer Vision and Pattern Recognition · Computer Science 2022-08-25 Yixuan Wei , Han Hu , Zhenda Xie , Zheng Zhang , Yue Cao , Jianmin Bao , Dong Chen , Baining Guo

Do Self-Supervised and Supervised Methods Learn Similar Visual Representations?

Despite the success of a number of recent techniques for visual self-supervised deep learning, there has been limited investigation into the representations that are ultimately learned. By leveraging recent advances in the comparison of…

Computer Vision and Pattern Recognition · Computer Science 2021-12-06 Tom George Grigg , Dan Busbridge , Jason Ramapuram , Russ Webb

Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks

Self-Supervised Learning (SSL) for Vision Transformers (ViTs) has recently demonstrated considerable potential as a pre-training strategy for a variety of computer vision tasks, including image classification and segmentation, both in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Yannis Kaltampanidis , Alexandros Doumanoglou , Dimitrios Zarpalas