Related papers: Patch-level Representation Learning for Self-super…

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Guglielmo Camporese , Elena Izzo , Lamberto Ballan

Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime

Self-supervision has shown outstanding results for natural language processing, and more recently, for image recognition. Simultaneously, vision transformers and its variants have emerged as a promising and scalable alternative to…

Computer Vision and Pattern Recognition · Computer Science 2022-02-01 Prarthana Bhattacharyya , Chenge Li , Xiaonan Zhao , István Fehérvári , Jason Sun

A Survey of the Self Supervised Learning Mechanisms for Vision Transformers

Advances in deep learning are re-defining how visual data is processed and understand by the machines. Vision Transformers (ViTs) have recently demonstrated prominent performance in computer vision related tasks. However, their performance…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Asifullah Khan , Anabia Sohail , Mustansar Fiaz , Mehdi Hassan , Tariq Habib Afridi , Sibghat Ullah Marwat , Farzeen Munir , Safdar Ali , Hannan Naseem , Muhammad Zaigham Zaheer , Kamran Ali , Tangina Sultana , Ziaurrehman Tanoli , Naeem Akhter

Self-supervised pretraining for an iterative image size agnostic vision transformer

Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational…

Computer Vision and Pattern Recognition · Computer Science 2026-04-23 Nedyalko Prisadnikov , Danda Pani Paudel , Yuqian Fu , Luc Van Gool

Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks

Self-Supervised Learning (SSL) for Vision Transformers (ViTs) has recently demonstrated considerable potential as a pre-training strategy for a variety of computer vision tasks, including image classification and segmentation, both in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Yannis Kaltampanidis , Alexandros Doumanoglou , Dimitrios Zarpalas

Self-supervised Vision Transformers for Joint SAR-optical Representation Learning

Self-supervised learning (SSL) has attracted much interest in remote sensing and earth observation due to its ability to learn task-agnostic representations without human annotation. While most of the existing SSL works in remote sensing…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Yi Wang , Conrad M Albrecht , Xiao Xiang Zhu

Analyzing Local Representations of Self-supervised Vision Transformers

In this paper, we present a comparative analysis of various self-supervised Vision Transformers (ViTs), focusing on their local representative power. Inspired by large language models, we examine the abilities of ViTs to perform various…

Computer Vision and Pattern Recognition · Computer Science 2024-03-22 Ani Vanyan , Alvard Barseghyan , Hakob Tamazyan , Vahan Huroyan , Hrant Khachatrian , Martin Danelljan

Self-supervised structured object representation learning

Self-supervised learning (SSL) has emerged as a powerful technique for learning visual representations. While recent SSL approaches achieve strong results in global image understanding, they are limited in capturing the structured…

Computer Vision and Pattern Recognition · Computer Science 2025-08-28 Oussama Hadjerci , Antoine Letienne , Mohamed Abbas Hedjazi , Adel Hafiane

Emerging Properties in Self-Supervised Vision Transformers

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this…

Computer Vision and Pattern Recognition · Computer Science 2021-05-25 Mathilde Caron , Hugo Touvron , Ishan Misra , Hervé Jégou , Julien Mairal , Piotr Bojanowski , Armand Joulin

Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency

Visual domain adaptation (DA) seeks to transfer trained models to unseen, unlabeled domains across distribution shift, but approaches typically focus on adapting convolutional neural network architectures initialized with supervised…

Computer Vision and Pattern Recognition · Computer Science 2022-06-17 Viraj Prabhu , Sriram Yenamandra , Aaditya Singh , Judy Hoffman

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT…

Computer Vision and Pattern Recognition · Computer Science 2023-07-10 Dahyun Kang , Piotr Koniusz , Minsu Cho , Naila Murray

Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach

Self-supervised visual representation learning traditionally focuses on image-level instance discrimination. Our study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into these methodologies. This…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Ali Javidani , Mohammad Amin Sadeghi , Babak Nadjar Araabi

Self-Promoted Supervision for Few-Shot Transformer

The few-shot learning ability of vision transformers (ViTs) is rarely investigated though heavily desired. In this work, we empirically find that with the same few-shot learning frameworks, \eg~Meta-Baseline, replacing the widely used CNN…

Computer Vision and Pattern Recognition · Computer Science 2022-06-10 Bowen Dong , Pan Zhou , Shuicheng Yan , Wangmeng Zuo

Vision Transformers: From Semantic Segmentation to Dense Prediction

The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image…

Computer Vision and Pattern Recognition · Computer Science 2024-08-05 Li Zhang , Jiachen Lu , Sixiao Zheng , Xinxuan Zhao , Xiatian Zhu , Yanwei Fu , Tao Xiang , Jianfeng Feng , Philip H. S. Torr

Representation Separation for Semantic Segmentation with Vision Transformers

Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.We present an efficient framework of representation separation in local-patch level and global-region level for semantic…

Computer Vision and Pattern Recognition · Computer Science 2024-10-28 Yuanduo Hong , Huihui Pan , Weichao Sun , Xinghu Yu , Huijun Gao

Unsupervised Semantic Segmentation Facilitates Model Understanding

Self-supervised learning (SSL) has produced a diverse landscape of vision transformers (ViTs) whose pretrained representations support a wide range of downstream tasks. Towards a better understanding of these models, a body of work has…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Xiaoyan Yu , Lisa Mais , Jannik Franzen , Peter Hirsch , Nick Lechtenbörger , Andreas Mardt , Dagmar Kainmüller

Vision Transformers with Natural Language Semantics

Tokens or patches within Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP). Typically, ViT tokens are associated with rectangular image patches that lack specific…

Computer Vision and Pattern Recognition · Computer Science 2024-02-29 Young Kyung Kim , J. Matías Di Martino , Guillermo Sapiro

Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers

Self-supervised learning (SSL) with vision transformers (ViTs) has proven effective for representation learning as demonstrated by the impressive performance on various downstream tasks. Despite these successes, existing ViT-based SSL…

Computer Vision and Pattern Recognition · Computer Science 2024-06-21 Chaitanya Devaguptapu , Sumukh Aithal , Shrinivas Ramasubramanian , Moyuru Yamada , Manohar Kaul

Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis

This paper investigates the effectiveness of self-supervised pre-trained vision transformers (ViTs) compared to supervised pre-trained ViTs and conventional neural networks (ConvNets) for detecting facial deepfake images and videos. It…

Computer Vision and Pattern Recognition · Computer Science 2024-08-12 Huy H. Nguyen , Junichi Yamagishi , Isao Echizen

Semantic Concentration for Self-Supervised Dense Representations Learning

Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Peisong Wen , Qianqian Xu , Siran Dai , Runmin Cong , Qingming Huang