Related papers: Finding Distributed Object-Centric Properties in S…

Emergence of Human-Like Attention in Self-Supervised Vision Transformers: an eye-tracking study

Many models of visual attention have been proposed so far. Traditional bottom-up models, like saliency models, fail to replicate human gaze patterns, and deep gaze prediction models lack biological plausibility due to their reliance on…

Neurons and Cognition · Quantitative Biology 2025-05-28 Takuto Yamamoto , Hirosato Akahoshi , Shigeru Kitazawa

Oh-A-DINO: Understanding and Enhancing Attribute-Level Information in Self-Supervised Object-Centric Representations

Object-centric understanding is fundamental to human vision and required for complex reasoning. Traditional methods define slot-based bottlenecks to learn object properties explicitly, while recent self-supervised vision models like DINO…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Stefan Sylvius Wagner , Stefan Harmeling

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Shuangrui Ding , Rui Qian , Haohang Xu , Dahua Lin , Hongkai Xiong

Human-like Object Grouping in Self-supervised Vision Transformers

Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Hossein Adeli , Seoyoung Ahn , Andrew Luo , Mengmi Zhang , Nikolaus Kriegeskorte , Gregory Zelinsky

Unified Local and Global Attention Interaction Modeling for Vision Transformers

We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets. ViTs show strong capability for image understanding tasks such as object…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Tan Nguyen , Coy D. Heldermon , Corey Toler-Franklin

Masked Multi-Query Slot Attention for Unsupervised Object Discovery

Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods…

Computer Vision and Pattern Recognition · Computer Science 2024-11-07 Rishav Pramanik , José-Fabian Villa-Vásquez , Marco Pedersoli

Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Transformers trained with self-supervised learning using self-distillation loss (DINO) have been shown to produce attention maps that highlight salient foreground objects. In this paper, we demonstrate a graph-based approach that uses the…

Computer Vision and Pattern Recognition · Computer Science 2022-03-25 Yangtao Wang , Xi Shen , Shell Hu , Yuan Yuan , James Crowley , Dominique Vaufreydaz

Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation

The features of self-supervised vision transformers (ViTs) contain strong semantic and positional information relevant to downstream tasks like object localization and segmentation. Recent works combine these features with traditional…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Ronan Docherty , Antonis Vamvakeros , Samuel J. Cooper

Analyzing Local Representations of Self-supervised Vision Transformers

In this paper, we present a comparative analysis of various self-supervised Vision Transformers (ViTs), focusing on their local representative power. Inspired by large language models, we examine the abilities of ViTs to perform various…

Computer Vision and Pattern Recognition · Computer Science 2024-03-22 Ani Vanyan , Alvard Barseghyan , Hakob Tamazyan , Vahan Huroyan , Hrant Khachatrian , Martin Danelljan

Emerging Properties in Self-Supervised Vision Transformers

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this…

Computer Vision and Pattern Recognition · Computer Science 2021-05-25 Mathilde Caron , Hugo Touvron , Ishan Misra , Hervé Jégou , Julien Mairal , Piotr Bojanowski , Armand Joulin

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Daquan Zhou , Bingyi Kang , Xiaojie Jin , Linjie Yang , Xiaochen Lian , Zihang Jiang , Qibin Hou , Jiashi Feng

Patch-level Representation Learning for Self-supervised Vision Transformers

Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advantages of the…

Computer Vision and Pattern Recognition · Computer Science 2022-07-20 Sukmin Yun , Hankook Lee , Jaehyung Kim , Jinwoo Shin

Deep ViT Features as Dense Visual Descriptors

We study the use of deep features extracted from a pretrained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extractedfrom a self-supervised ViT model (DINO-ViT),…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Shir Amir , Yossi Gandelsman , Shai Bagon , Tali Dekel

Learning Object Focused Attention

We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an…

Computer Vision and Pattern Recognition · Computer Science 2025-04-14 Vivek Trivedy , Amani Almalki , Longin Jan Latecki

DADO: A Depth-Attention framework for Object Discovery

Unsupervised object discovery, the task of identifying and localizing objects in images without human-annotated labels, remains a significant challenge and a growing focus in computer vision. In this work, we introduce a novel model, DADO…

Computer Vision and Pattern Recognition · Computer Science 2025-10-09 Federico Gonzalez , Estefania Talavera , Petia Radeva

Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection

Small Object Detection (SOD) poses significant challenges due to limited information and the model's low class prediction score. While Transformer-based detectors have shown promising performance, their potential for SOD remains largely…

Computer Vision and Pattern Recognition · Computer Science 2025-05-29 Guiping Cao , Wenjian Huang , Xiangyuan Lan , Jianguo Zhang , Dongmei Jiang , Yaowei Wang

Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?

Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores…

Computer Vision and Pattern Recognition · Computer Science 2026-01-22 Yihao Li , Saeed Salehi , Lyle Ungar , Konrad P. Kording

Refiner: Refining Self-attention for Vision Transformers

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more…

Computer Vision and Pattern Recognition · Computer Science 2021-06-08 Daquan Zhou , Yujun Shi , Bingyi Kang , Weihao Yu , Zihang Jiang , Yuan Li , Xiaojie Jin , Qibin Hou , Jiashi Feng

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse…

Computer Vision and Pattern Recognition · Computer Science 2025-09-17 Luca Barsellotti , Lorenzo Bianchi , Nicola Messina , Fabio Carrara , Marcella Cornia , Lorenzo Baraldi , Fabrizio Falchi , Rita Cucchiara

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers, resulting in the vanishing of low-level visual features. However, such features can be helpful to accurately represent and identify…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Anxhelo Diko , Danilo Avola , Marco Cascio , Luigi Cinque