Related papers: Efficient Large-Scale Visual Representation Learni…

e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce

Understanding vision and language representations of product content is vital for search and recommendation applications in e-commerce. As a backbone for online shopping platforms and inspired by the recent success in representation…

Machine Learning · Computer Science 2022-08-23 Wonyoung Shin , Jonghun Park , Taekang Woo , Yongwoo Cho , Kwangjin Oh , Hwanjun Song

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Alon Kaya , Igal Bilik , Inna Stainvas

Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce

In this paper, we present a unified end-to-end approach to build a large scale Visual Search and Recommendation system for e-commerce. Previous works have targeted these problems in isolation. We believe a more effective and elegant…

Computer Vision and Pattern Recognition · Computer Science 2017-03-08 Devashish Shankar , Sujay Narumanchi , H A Ananya , Pramod Kompalli , Krishnendu Chaudhury

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

Large-scale pretraining of visual representations has led to state-of-the-art performance on a range of benchmark computer vision tasks, yet the benefits of these techniques at extreme scale in complex production systems has been relatively…

Computer Vision and Pattern Recognition · Computer Science 2021-08-13 Josh Beal , Hao-Yu Wu , Dong Huk Park , Andrew Zhai , Dmitry Kislyuk

Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition

In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multi-scale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multi-branch network, which has…

Computer Vision and Pattern Recognition · Computer Science 2019-08-01 Chun-Fu Chen , Quanfu Fan , Neil Mallinar , Tom Sercu , Rogerio Feris

Fusing Deep Convolutional Networks for Large Scale Visual Concept Classification

Deep learning architectures are showing great promise in various computer vision domains including image classification, object detection, event detection and action recognition. In this study, we investigate various aspects of…

Computer Vision and Pattern Recognition · Computer Science 2016-08-08 Hilal Ergun , Mustafa Sert

Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

Vision Transformers (ViT) have recently demonstrated the significant potential of transformer architectures for computer vision. To what extent can image-based deep reinforcement learning also benefit from ViT architectures, as compared to…

Machine Learning · Computer Science 2022-05-17 Tianxin Tao , Daniele Reda , Michiel van de Panne

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Maithra Raghu , Thomas Unterthiner , Simon Kornblith , Chiyuan Zhang , Alexey Dosovitskiy

Efficient Self-supervised Vision Transformers for Representation Learning

This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with…

Computer Vision and Pattern Recognition · Computer Science 2022-07-08 Chunyuan Li , Jianwei Yang , Pengchuan Zhang , Mei Gao , Bin Xiao , Xiyang Dai , Lu Yuan , Jianfeng Gao

Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Quan Kong , Yanru Xiao , Yuhao Shen , Cong Wang

Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition

We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition. These architectures have demonstrated significant success in various computer vision tasks…

Computer Vision and Pattern Recognition · Computer Science 2024-10-08 Fernando Alonso-Fernandez , Kevin Hernandez-Diaz , Prayag Tiwari , Josef Bigun

Searching for Efficient Multi-Stage Vision Transformers

Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied and adopted…

Computer Vision and Pattern Recognition · Computer Science 2021-09-03 Yi-Lun Liao , Sertac Karaman , Vivienne Sze

ConvNets vs. Transformers: Whose Visual Representations are More Transferable?

Vision transformers have attracted much attention from computer vision researchers as they are not restricted to the spatial inductive bias of ConvNets. However, although Transformer-based backbones have achieved much progress on ImageNet…

Computer Vision and Pattern Recognition · Computer Science 2021-08-18 Hong-Yu Zhou , Chixiang Lu , Sibei Yang , Yizhou Yu

Adapting Vision-Language Models for E-commerce Understanding at Scale

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is…

Computer Vision and Pattern Recognition · Computer Science 2026-02-13 Matteo Nulli , Vladimir Orshulevich , Tala Bazazo , Christian Herold , Michael Kozielski , Marcin Mazur , Szymon Tuzel , Cees G. M. Snoek , Seyyed Hadi Hashemi , Omar Javed , Yannick Versley , Shahram Khadivi

Image Recognition with Online Lightweight Vision Transformer: A Survey

The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Zherui Zhang , Rongtao Xu , Jie Zhou , Changwei Wang , Xingtian Pei , Wenhao Xu , Jiguang Zhang , Li Guo , Longxiang Gao , Wenbo Xu , Shibiao Xu

Efficient Training of Visual Transformers with Small Datasets

Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger…

Computer Vision and Pattern Recognition · Computer Science 2021-11-16 Yahui Liu , Enver Sangineto , Wei Bi , Nicu Sebe , Bruno Lepri , Marco De Nadai

Transformed Multi-view 3D Shape Features with Contrastive Learning

This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Márcus Vinícius Lobo Costa , Sherlon Almeida da Silva , Bárbara Caroline Benato , Leo Sampaio Ferraz Ribeiro , Moacir Antonelli Ponti

The Surprising Effectiveness of Representation Learning for Visual Imitation

While visual imitation learning offers one of the most effective ways of learning from visual demonstrations, generalizing from them requires either hundreds of diverse demonstrations, task specific priors, or large, hard-to-train…

Robotics · Computer Science 2021-12-07 Jyothish Pari , Nur Muhammad Shafiullah , Sridhar Pandian Arunachalam , Lerrel Pinto

Remote Sensing Image Classification Using Deep Ensemble Learning

Remote sensing imagery plays a crucial role in many applications and requires accurate computerized classification techniques. Reliable classification is essential for transforming raw imagery into structured and usable information. While…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Niful Islam , Md. Rayhan Ahmed , Nur Mohammad Fahad , Salekul Islam , A. K. M. Muzahidul Islam , Saddam Mukta , Swakkhar Shatabda

An attention-driven hierarchical multi-scale representation for visual recognition

Convolutional Neural Networks (CNNs) have revolutionized the understanding of visual content. This is mainly due to their ability to break down an image into smaller pieces, extract multi-scale localized features and compose them to…

Computer Vision and Pattern Recognition · Computer Science 2021-10-26 Zachary Wharton , Ardhendu Behera , Asish Bera