Related papers: Peripheral Vision Transformer
Traditionally, the vision community has devised algorithms to estimate the distance between an original image and images that have been subject to perturbations. Inspiration was usually taken from the human visual perceptual system and how…
Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this…
Large vision and language models learned directly through image-text associations often lack detailed visual substantiation, whereas image segmentation tasks are treated separately from recognition, supervisedly learned without…
Computer graphics seeks to deliver compelling images, generated within a computing budget, targeted at a specific display device, and ultimately viewed by an individual user. The foveated nature of human vision offers an opportunity to…
Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are…
We present a conceptually simple, flexible, and universal visual perception head for variant visual tasks, e.g., classification, object detection, instance segmentation and pose estimation, and different frameworks, such as one-stage or…
Understanding the mechanisms underlying deep neural networks remains a fundamental challenge in machine learning and computer vision. One promising, yet only preliminarily explored approach, is feature inversion, which attempts to…
Over the years, computer vision researchers have spent an immense amount of effort on designing image features for the visual object recognition task. We propose to incorporate this valuable experience to guide the task of training deep…
What are the roles of central and peripheral vision in human scene recognition? Larson and Loschky (2009) showed that peripheral vision contributes more than central vision in obtaining maximum scene recognition accuracy. However, central…
Human perception is structured around objects which form the basis for our higher-level cognition and impressive systematic generalization abilities. Yet most work on representation learning focuses on feature learning without even…
The widespread use of deep neural networks has achieved substantial success in many tasks. However, there still exists a huge gap between the operating mechanism of deep learning models and human-understandable decision making, so that…
Human visual perception carves a scene at its physical joints, decomposing the world into objects, which are selectively attended, tracked, and predicted as we engage our surroundings. Object representations emancipate perception from the…
Holistic methods using CNNs and margin-based losses have dominated research on face recognition. In this work, we depart from this setting in two ways: (a) we employ the Vision Transformer as an architecture for training a very strong…
Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to…
Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image…
This work presents a novel method of exploring human brain-visual representations, with a view towards replicating these processes in machines. The core idea is to learn plausible computational and biological representations by correlating…
Saliency Prediction aims to predict the attention distribution of human eyes given an RGB image. Most of the recent state-of-the-art methods are based on deep image feature representations from traditional CNNs. However, the traditional…
Psychovisual models suggest human vision decouples low-level feature extraction from higher cognition by first forming intermediate abstractions. In contrast, deep learning-based vision models routinely extract and aggregate features using…
Vision transformer (ViT) has recently shown its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural…
The human visual system is an intricate network of brain regions that enables us to recognize the world around us. Despite its abundant lateral and feedback connections, object processing is commonly viewed and studied as a feedforward…