Related papers: RepNeXt: A Fast Multi-Scale CNN using Structural R…

RepViT: Revisiting Mobile CNN From ViT Perspective

Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many…

Computer Vision and Pattern Recognition · Computer Science 2024-03-15 Ao Wang , Hui Chen , Zijia Lin , Jungong Han , Guiguang Ding

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Pavan Kumar Anasosalu Vasu , James Gabriel , Jeff Zhu , Oncel Tuzel , Anurag Ranjan

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Alon Kaya , Igal Bilik , Inna Stainvas

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Sachin Mehta , Mohammad Rastegari

RapidNet: Multi-Level Dilated Convolution Based Mobile Backbone

Vision transformers (ViTs) have dominated computer vision in recent years. However, ViTs are computationally expensive and not well suited for mobile devices; this led to the prevalence of convolutional neural network (CNN) and ViT-based…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Mustafa Munir , Md Mostafijur Rahman , Radu Marculescu

Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment

Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Janek Haberer , Jon Eike Wilhelm , Olaf Landsiedel

FMViT: A multiple-frequency mixing Vision Transformer

The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing…

Computer Vision and Pattern Recognition · Computer Science 2023-11-13 Wei Tan , Yifeng Geng , Xuansong Xie

LiteNeXt: A Novel Lightweight ConvMixer-based Model with Self-embedding Representation Parallel for Medical Image Segmentation

The emergence of deep learning techniques has advanced the image segmentation task, especially for medical images. Many neural network models have been introduced in the last decade bringing the automated segmentation accuracy close to…

Image and Video Processing · Electrical Eng. & Systems 2025-03-11 Ngoc-Du Tran , Thi-Thao Tran , Quang-Huy Nguyen , Manh-Hung Vu , Van-Truong Pham

Recurrent Vision Transformer for Solving Visual Reasoning Problems

Although convolutional neural networks (CNNs) showed remarkable results in many vision tasks, they are still strained by simple yet challenging visual reasoning problems. Inspired by the recent success of the Transformer network in computer…

Computer Vision and Pattern Recognition · Computer Science 2021-11-30 Nicola Messina , Giuseppe Amato , Fabio Carrara , Claudio Gennaro , Fabrizio Falchi

Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling

There are two de facto standard architectures in recent computer vision: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Strong inductive biases of convolutions help the model learn sample effectively, but such strong…

Computer Vision and Pattern Recognition · Computer Science 2022-10-05 Yunsung Lee , Gyuseong Lee , Kwangrok Ryoo , Hyojun Go , Jihye Park , Seungryong Kim

RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a…

Computer Vision and Pattern Recognition · Computer Science 2025-06-03 Xuwei Xu , Yang Li , Yudong Chen , Jiajun Liu , Sen Wang

Lightweight Real-time Semantic Segmentation Network with Efficient Transformer and CNN

In the past decade, convolutional neural networks (CNNs) have shown prominence for semantic segmentation. Although CNN models have very impressive performance, the ability to capture global representation is still insufficient, which…

Computer Vision and Pattern Recognition · Computer Science 2023-02-22 Guoan Xu , Juncheng Li , Guangwei Gao , Huimin Lu , Jian Yang , Dong Yue

Rethinking Vision Transformers for MobileNet Size and Speed

With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate…

Computer Vision and Pattern Recognition · Computer Science 2023-09-06 Yanyu Li , Ju Hu , Yang Wen , Georgios Evangelidis , Kamyar Salahi , Yanzhi Wang , Sergey Tulyakov , Jian Ren

CNN and ViT Efficiency Study on Tiny ImageNet and DermaMNIST Datasets

This study evaluates the trade-offs between convolutional and transformer-based architectures on both medical and general-purpose image classification benchmarks. We use ResNet-18 as our baseline and introduce a fine-tuning strategy applied…

Computer Vision and Pattern Recognition · Computer Science 2026-02-16 Aidar Amangeldi , Angsar Taigonyrov , Muhammad Huzaifa Jawad , Chinedu Emmanuel Mbonu

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and…

Computer Vision and Pattern Recognition · Computer Science 2022-08-17 Jiashi Li , Xin Xia , Wei Li , Huixia Li , Xing Wang , Xuefeng Xiao , Rui Wang , Min Zheng , Xin Pan

CMUNeXt: An Efficient Medical Image Segmentation Network based on Large Kernel and Skip Fusion

The U-shaped architecture has emerged as a crucial paradigm in the design of medical image segmentation networks. However, due to the inherent local limitations of convolution, a fully convolutional segmentation network with U-shaped…

Image and Video Processing · Electrical Eng. & Systems 2023-08-04 Fenghe Tang , Jianrui Ding , Lingtao Wang , Chunping Ning , S. Kevin Zhou

Towards Robust Vision Transformer

Recent advances on Vision Transformer (ViT) and its improved variants have shown that self-attention-based networks surpass traditional Convolutional Neural Networks (CNNs) in most vision tasks. However, existing ViTs focus on the standard…

Computer Vision and Pattern Recognition · Computer Science 2022-05-24 Xiaofeng Mao , Gege Qi , Yuefeng Chen , Xiaodan Li , Ranjie Duan , Shaokai Ye , Yuan He , Hui Xue

AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition

Recent research has successfully adapted vision-based convolutional neural network (CNN) architectures for audio recognition tasks using Mel-Spectrograms. However, these CNNs have high computational costs and memory requirements, limiting…

Sound · Computer Science 2024-04-23 Kin Wai Lau , Yasar Abbas Ur Rehman , Lai-Man Po

VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are two dominant models for image analysis. While CNNs excel at extracting multi-scale features and ViTs effectively capture global dependencies, both suffer from high…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Shicheng Yin , Kaixuan Yin , Weixing Chen , Enbo Huang , Yang Liu

CondenseNeXt: An Ultra-Efficient Deep Neural Network for Embedded Systems

Due to the advent of modern embedded systems and mobile devices with constrained resources, there is a great demand for incredibly efficient deep neural networks for machine learning purposes. There is also a growing concern of privacy and…

Computer Vision and Pattern Recognition · Computer Science 2021-12-02 Priyank Kalgaonkar , Mohamed El-Sharkawy