Related papers: Im2win: Memory Efficient Convolution On SIMD Archi…

Im2win: An Efficient Convolution Paradigm on GPU

Convolution is the most time-consuming operation in deep neural network operations, so its performance is critical to the overall performance of the neural network. The commonly used methods for convolution on GPU include the general matrix…

Neural and Evolutionary Computing · Computer Science 2023-06-27 Shuai Lu , Jun Chu , Luanzheng Guo , Xu T. Liu

High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

Convolution is the core component within deep neural networks and it is computationally intensive and time consuming. Tensor data layouts significantly impact convolution operations in terms of memory access and computational efficiency.…

Machine Learning · Computer Science 2024-08-02 Xiang Fu , Xinpeng Zhang , Jixiang Ma , Peng Zhao , Shuai Lu , Xu T. Liu

MEC: Memory-efficient Convolution for Deep Neural Network

Convolution is a critical component in modern deep neural networks, thus several algorithms for convolution have been developed. Direct convolution is simple but suffers from poor performance. As an alternative, multiple indirect methods…

Machine Learning · Computer Science 2017-06-22 Minsik Cho , Daniel Brand

The Indirect Convolution Algorithm

Deep learning frameworks commonly implement convolution operators with GEMM-based algorithms. In these algorithms, convolution is implemented on top of matrix-matrix multiplication (GEMM) functions, provided by highly optimized BLAS…

Computer Vision and Pattern Recognition · Computer Science 2019-07-05 Marat Dukhan

High Performance and Portable Convolution Operators for ARM-based Multicore Processors

The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these…

Performance · Computer Science 2020-05-14 Pablo San Juan , Adrián Castelló , Manuel F. Dolz , Pedro Alonso-Jordá , Enrique S. Quintana-Ortí

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to compute convolutions is known as the Im2Col + BLAS method. This paper proposes SConv:…

Computer Vision and Pattern Recognition · Computer Science 2023-03-09 Victor Ferrari , Rafael Sousa , Marcio Pereira , João P. L. de Carvalho , José Nelson Amaral , José Moreira , Guido Araujo

Low-memory GEMM-based convolution algorithms for deep neural networks

Deep neural networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. A common approach to implementing DNNs is to recast the most computationally expensive operations as…

Computer Vision and Pattern Recognition · Computer Science 2017-09-12 Andrew Anderson , Aravind Vasudevan , Cormac Keane , David Gregg

Characterizing and Demystifying the Implicit Convolution Algorithm on Commercial Matrix-Multiplication Accelerators

Many of today's deep neural network accelerators, e.g., Google's TPU and NVIDIA's tensor core, are built around accelerating the general matrix multiplication (i.e., GEMM). However, supporting convolution on GEMM-based accelerators is not…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-11 Yangjie Zhou , Mengtian Yang , Cong Guo , Jingwen Leng , Yun Liang , Quan Chen , Minyi Guo , Yuhao Zhu

I/O Lower Bounds for Auto-tuning of Convolutions in CNNs

Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous applications. Due to the complex data dependency and the increase in the amount of model…

Machine Learning · Computer Science 2021-01-01 Xiaoyang Zhang , Junmin Xiao , Guangming Tan

3D-TrIM: A Memory-Efficient Spatial Computing Architecture for Convolution Workloads

The Von Neumann bottleneck, which relates to the energy cost of moving data from memory to on-chip core and vice versa, is a serious challenge in state-of-the-art AI architectures, like Convolutional Neural Networks' (CNNs) accelerators.…

Hardware Architecture · Computer Science 2025-02-27 Cristian Sestito , Ahmed J. Abdelmaksoud , Shady Agwa , Themis Prodromakis

SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution

We present a novel approach for accelerating convolutions during inference for CPU-based architectures. The most common method of computation involves packing the image into the columns of a matrix (im2col) and performing general matrix…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Amir Ofir , Gil Ben-Artzi

An Energy-Efficient Edge Computing Paradigm for Convolution-based Image Upsampling

A novel energy-efficient edge computing paradigm is proposed for real-time deep learning-based image upsampling applications. State-of-the-art deep learning solutions for image upsampling are currently trained using either resize or…

Computer Vision and Pattern Recognition · Computer Science 2021-07-27 Ian Colbert , Ken Kreutz-Delgado , Srinjoy Das

Computing-In-Memory Dataflow for Minimal Buffer Traffic

Computing-In-Memory (CIM) offers a potential solution to the memory wall issue and can achieve high energy efficiency by minimizing data movement, making it a promising architecture for edge AI devices. Lightweight models like MobileNet and…

Hardware Architecture · Computer Science 2025-08-21 Choongseok Song , Doo Seok Jeong

Parallel Multi Channel Convolution using General Matrix Multiplication

Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally intensive parts of CNNs are the convolutional layers, which convolve…

Computer Vision and Pattern Recognition · Computer Science 2017-07-04 Aravind Vasudevan , Andrew Anderson , David Gregg

High Performance Zero-Memory Overhead Direct Convolutions

The computation of convolution layers in deep neural networks typically rely on high performance routines that trade space for time by using additional memory (either for packing purposes or required as part of the algorithm) to improve…

Machine Learning · Computer Science 2018-09-28 Jiyuan Zhang , Franz Franchetti , Tze Meng Low

ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation

Convolution is a compute-intensive operation placed at the heart of Convolution Neural Networks (CNNs). It has led to the development of many high-performance algorithms, such as Im2col-GEMM, Winograd, and Direct-Convolution. However, the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Lucas Alvarenga , Victor Ferrari , Rafael Souza , Marcio Pereira , Guido Araujo

Optimizing Winograd Convolution on ARMv8 processors

As Convolutional Neural Networks (CNNs) gain prominence in deep learning, algorithms like Winograd Convolution have been introduced to enhance computational efficiency. However, existing implementations often face challenges such as high…

Performance · Computer Science 2024-12-30 Haoyuan Gui , Xiaoyu Zhang , Chong Zhang , Zitong Su , Huiyuan Li

LR-CNN: Lightweight Row-centric Convolutional Neural Network Training for Memory Reduction

In the last decade, Convolutional Neural Network with a multi-layer architecture has advanced rapidly. However, training its complex network is very space-consuming, since a lot of intermediate data are preserved across layers, especially…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-23 Zhigang Wang , Hangyu Yang , Ning Wang , Chuanfei Xu , Jie Nie , Zhiqiang Wei , Yu Gu , Ge Yu

BP-Im2col: Implicit Im2col Supporting AI Backpropagation on Systolic Arrays

State-of-the-art systolic array-based accelerators adopt the traditional im2col algorithm to accelerate the inference of convolutional layers. However, traditional im2col cannot efficiently support AI backpropagation. Backpropagation in…

Hardware Architecture · Computer Science 2022-09-21 Jianchao Yang , Mei Wen , Junzhong Shen , Yasong Cao , Minjin Tang , Renyu Yang , Jiawei Fei , Chunyuan Zhang

Accelerating Transposed Convolutions on FPGA-based Edge Devices

Transposed Convolutions (TCONV) enable the up-scaling mechanism within generative Artificial Intelligence (AI) models. However, the predominant Input-Oriented Mapping (IOM) method for implementing TCONV has complex output mapping,…

Hardware Architecture · Computer Science 2025-07-11 Jude Haris , José Cano