Related papers: RapidNet: Multi-Level Dilated Convolution Based Mo…

RepViT: Revisiting Mobile CNN From ViT Perspective

Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many…

Computer Vision and Pattern Recognition · Computer Science 2024-03-15 Ao Wang , Hui Chen , Zijia Lin , Jungong Han , Guiguang Ding

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Sachin Mehta , Mohammad Rastegari

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Alon Kaya , Igal Bilik , Inna Stainvas

nnMobileNet: Rethinking CNN for Retinopathy Research

Over the past few decades, convolutional neural networks (CNNs) have been at the forefront of the detection and tracking of various retinal diseases (RD). Despite their success, the emergence of vision transformers (ViT) in the 2020s has…

Image and Video Processing · Electrical Eng. & Systems 2024-04-17 Wenhui Zhu , Peijie Qiu , Xiwen Chen , Xin Li , Natasha Lepore , Oana M. Dumitrascu , Yalin Wang

Rethinking Vision Transformers for MobileNet Size and Speed

With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate…

Computer Vision and Pattern Recognition · Computer Science 2023-09-06 Yanyu Li , Ju Hu , Yang Wen , Georgios Evangelidis , Kamyar Salahi , Yanzhi Wang , Sergey Tulyakov , Jian Ren

HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs

The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model…

Computer Vision and Pattern Recognition · Computer Science 2024-03-19 Ting Yao , Yehao Li , Yingwei Pan , Tao Mei

MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications

Traditionally, convolutional neural networks (CNN) and vision transformers (ViT) have dominated computer vision. However, recently proposed vision graph neural networks (ViG) provide a new avenue for exploration. Unfortunately, for mobile…

Computer Vision and Pattern Recognition · Computer Science 2023-07-04 Mustafa Munir , William Avery , Radu Marculescu

EfficientFormer: Vision Transformers at MobileNet Speed

Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, \textit{e.g.}, attention mechanism,…

Computer Vision and Pattern Recognition · Computer Science 2022-10-12 Yanyu Li , Geng Yuan , Yang Wen , Ju Hu , Georgios Evangelidis , Sergey Tulyakov , Yanzhi Wang , Jian Ren

ConTNet: Why not use convolution and transformer at the same time?

Although convolutional networks (ConvNets) have enjoyed great success in computer vision (CV), it suffers from capturing global information crucial to dense prediction tasks such as object detection and segmentation. In this work, we…

Computer Vision and Pattern Recognition · Computer Science 2021-05-12 Haotian Yan , Zhe Li , Weijian Li , Changhu Wang , Ming Wu , Chuang Zhang

RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization

In the realm of resource-constrained mobile vision tasks, the pursuit of efficiency and performance consistently drives innovation in lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). While ViTs excel at…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Mingshu Zhao , Yi Luo , Yong Ouyang

FMViT: A multiple-frequency mixing Vision Transformer

The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing…

Computer Vision and Pattern Recognition · Computer Science 2023-11-13 Wei Tan , Yifeng Geng , Xuansong Xie

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and…

Computer Vision and Pattern Recognition · Computer Science 2022-08-17 Jiashi Li , Xin Xia , Wei Li , Huixia Li , Xing Wang , Xuefeng Xiao , Rui Wang , Min Zheng , Xin Pan

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

Self-attention based models such as vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. Despite increasingly stronger variants with ever-higher…

Computer Vision and Pattern Recognition · Computer Science 2022-07-25 Junting Pan , Adrian Bulat , Fuwen Tan , Xiatian Zhu , Lukasz Dudziak , Hongsheng Li , Georgios Tzimiropoulos , Brais Martinez

Transformed CNNs: recasting pre-trained convolutional layers with self-attention

Vision Transformers (ViT) have recently emerged as a powerful alternative to convolutional networks (CNNs). Although hybrid models attempt to bridge the gap between these two architectures, the self-attention layers they rely on induce a…

Machine Learning · Computer Science 2021-06-11 Stéphane d'Ascoli , Levent Sagun , Giulio Biroli , Ari Morcos

BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models

With the increasing popularity and the increasing size of vision transformers (ViTs), there has been an increasing interest in making them more efficient and less computationally costly for deployment on edge devices with limited computing…

Computer Vision and Pattern Recognition · Computer Science 2023-07-04 Phuoc-Hoan Charles Le , Xinlin Li

ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices

Neural Architecture Search (NAS) has shown promising performance in the automatic design of vision transformers (ViT) exceeding 1G FLOPs. However, designing lightweight and low-latency ViT models for diverse mobile devices remains a big…

Computer Vision and Pattern Recognition · Computer Science 2023-03-22 Chen Tang , Li Lyna Zhang , Huiqiang Jiang , Jiahang Xu , Ting Cao , Quanlu Zhang , Yuqing Yang , Zhi Wang , Mao Yang

Searching for Efficient Multi-Stage Vision Transformers

Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied and adopted…

Computer Vision and Pattern Recognition · Computer Science 2021-09-03 Yi-Lun Liao , Sertac Karaman , Vivienne Sze

MoViNets: Mobile Video Networks for Efficient Video Recognition

We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Dan Kondratyuk , Liangzhe Yuan , Yandong Li , Li Zhang , Mingxing Tan , Matthew Brown , Boqing Gong

MedViT: A Robust Vision Transformer for Generalized Medical Image Classification

Convolutional Neural Networks (CNNs) have advanced existing medical systems for automatic disease diagnosis. However, there are still concerns about the reliability of deep medical diagnosis systems against the potential threats of…

Computer Vision and Pattern Recognition · Computer Science 2023-03-21 Omid Nejati Manzari , Hamid Ahmadabadi , Hossein Kashiani , Shahriar B. Shokouhi , Ahmad Ayatollahi

CMT: Convolutional Neural Networks Meet Vision Transformers

Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. However, there are still gaps in both performance and computational cost between…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Jianyuan Guo , Kai Han , Han Wu , Yehui Tang , Xinghao Chen , Yunhe Wang , Chang Xu