English
Related papers

Related papers: What Do Self-Supervised Vision Transformers Learn?

200 papers

Self-supervised learning has become a cornerstone in computer vision, primarily divided into reconstruction-based methods like masked autoencoders (MAE) and discriminative methods such as contrastive learning (CL). Recent empirical…

Machine Learning · Computer Science 2025-02-06 Yu Huang , Zixin Wen , Yuejie Chi , Yingbin Liang

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from…

Computer Vision and Pattern Recognition · Computer Science 2022-05-30 Zhenda Xie , Zigang Geng , Jingcheng Hu , Zheng Zhang , Han Hu , Yue Cao

Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, their behavior under different learning paradigms is not well explored. We compare ViTs trained through…

Computer Vision and Pattern Recognition · Computer Science 2023-04-07 Matthew Walmer , Saksham Suri , Kamal Gupta , Abhinav Shrivastava

This paper represents a neat yet effective framework, named SemanticMIM, to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation. We conduct a thorough comparative analysis…

Computer Vision and Pattern Recognition · Computer Science 2024-06-18 Yike Yuan , Huanzhang Dou , Fengjun Guo , Xi Li

Recently, both Contrastive Learning (CL) and Mask Image Modeling (MIM) demonstrate that self-supervision is powerful to learn good representations. However, naively combining them is far from success. In this paper, we start by making the…

Computer Vision and Pattern Recognition · Computer Science 2023-03-01 Ziyu Jiang , Yinpeng Chen , Mengchen Liu , Dongdong Chen , Xiyang Dai , Lu Yuan , Zicheng Liu , Zhangyang Wang

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Maithra Raghu , Thomas Unterthiner , Simon Kornblith , Chiyuan Zhang , Alexey Dosovitskiy

Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Bin Ren , Guofeng Mei , Danda Pani Paudel , Weijie Wang , Yawei Li , Mengyuan Liu , Rita Cucchiara , Luc Van Gool , Nicu Sebe

Recently, Masked Image Modeling (MIM) achieves great success in self-supervised visual recognition. However, as a reconstruction-based framework, it is still an open question to understand how MIM works, since MIM appears very different…

Computer Vision and Pattern Recognition · Computer Science 2022-08-09 Xiangwen Kong , Xiangyu Zhang

Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision…

Computer Vision and Pattern Recognition · Computer Science 2023-01-31 Kun Yi , Yixiao Ge , Xiaotong Li , Shusheng Yang , Dian Li , Jianping Wu , Ying Shan , Xiaohu Qie

Representation learning has been evolving from traditional supervised training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous works have demonstrated their pros and cons in specific scenarios, i.e., CL and supervised…

Computer Vision and Pattern Recognition · Computer Science 2023-06-29 Bowen Shi , Xiaopeng Zhang , Yaoming Wang , Jin Li , Wenrui Dai , Junni Zou , Hongkai Xiong , Qi Tian

Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional…

Computer Vision and Pattern Recognition · Computer Science 2022-12-14 Amin Ghiasi , Hamid Kazemi , Eitan Borgnia , Steven Reich , Manli Shu , Micah Goldblum , Andrew Gordon Wilson , Tom Goldstein

Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), which requires the target model to recover the masked part of the input image. Although MIM-based pre-training methods achieve new…

Computer Vision and Pattern Recognition · Computer Science 2023-04-21 Qiang Zhou , Chaohui Yu , Hao Luo , Zhibin Wang , Hao Li

Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their…

Machine Learning · Computer Science 2023-04-27 Shashank Shekhar , Florian Bordes , Pascal Vincent , Ari Morcos

This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Márcus Vinícius Lobo Costa , Sherlon Almeida da Silva , Bárbara Caroline Benato , Leo Sampaio Ferraz Ribeiro , Moacir Antonelli Ponti

This paper explores improvements to the masked image modeling (MIM) paradigm. The MIM paradigm enables the model to learn the main object features of the image by masking the input image and predicting the masked part by the unmasked part.…

Computer Vision and Pattern Recognition · Computer Science 2022-05-24 Jiawei Mao , Xuesong Yin , Yuanqi Chang , Honggu Zhou

Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN) with its ability to capture global long-range dependencies for visual representation learning. Besides ViT, contrastive learning is another…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Hua-Bao Ling , Bowen Zhu , Dong Huang , Ding-Hua Chen , Chang-Dong Wang , Jian-Huang Lai

Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a…

Computer Vision and Pattern Recognition · Computer Science 2024-07-26 Haoran You , Yunyang Xiong , Xiaoliang Dai , Bichen Wu , Peizhao Zhang , Haoqi Fan , Peter Vajda , Yingyan Celine Lin

Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text…

Computer Vision and Pattern Recognition · Computer Science 2022-08-25 Yixuan Wei , Han Hu , Zhenda Xie , Zheng Zhang , Yue Cao , Jianmin Bao , Dong Chen , Baining Guo

Despite the success of a number of recent techniques for visual self-supervised deep learning, there has been limited investigation into the representations that are ultimately learned. By leveraging recent advances in the comparison of…

Computer Vision and Pattern Recognition · Computer Science 2021-12-06 Tom George Grigg , Dan Busbridge , Jason Ramapuram , Russ Webb

Self-Supervised Learning (SSL) for Vision Transformers (ViTs) has recently demonstrated considerable potential as a pre-training strategy for a variety of computer vision tasks, including image classification and segmentation, both in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Yannis Kaltampanidis , Alexandros Doumanoglou , Dimitrios Zarpalas
‹ Prev 1 2 3 10 Next ›