Related papers: Accelerating Machine Learning Primitives on Commod…

Sliding Window Sum Algorithms for Deep Neural Networks

Sliding window sums are widely used for string indexing, hashing and time series analysis. We have developed a family of the generic vectorized sliding sum algorithms that provide speedup of O(P/w) for window size $w$ and number of…

Machine Learning · Computer Science 2023-05-29 Roman Snytsar

Low-memory GEMM-based convolution algorithms for deep neural networks

Deep neural networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. A common approach to implementing DNNs is to recast the most computationally expensive operations as…

Computer Vision and Pattern Recognition · Computer Science 2017-09-12 Andrew Anderson , Aravind Vasudevan , Cormac Keane , David Gregg

ZNNi - Maximizing the Inference Throughput of 3D Convolutional Networks on Multi-Core CPUs and GPUs

Sliding window convolutional networks (ConvNets) have become a popular approach to computer vision problems such as image segmentation, and object detection and localization. Here we consider the problem of inference, the application of a…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-21 Aleksandar Zlateski , Kisuk Lee , H. Sebastian Seung

Efficient Distributed Learning over Decentralized Networks with Convoluted Support Vector Machine

This paper addresses the problem of efficiently classifying high-dimensional data over decentralized networks. Penalized support vector machines (SVMs) are widely used for high-dimensional classification tasks. However, the double…

Machine Learning · Statistics 2025-03-11 Canyi Chen , Nan Qiao , Liping Zhu

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

Quantization has emerged to be an effective way to significantly boost the performance of deep neural networks (DNNs) by utilizing low-bit computations. Despite having lower numerical precision, quantized DNNs are able to reduce both memory…

Machine Learning · Computer Science 2019-11-15 Wenlei Bao , Li-Wen Chang , Yang Chen , Ke Deng , Amit Agarwal , Emad Barsoum , Abe Taha

Parallel approach to sliding window sums

Sliding window sums are widely used in bioinformatics applications, including sequence assembly, k-mer generation, hashing and compression. New vector algorithms which utilize the advanced vector extension (AVX) instructions available on…

Data Structures and Algorithms · Computer Science 2019-09-04 Roman Snytsar , Yatish Turakhia

PIM-DRAM: Accelerating Machine Learning Workloads using Processing in Commodity DRAM

Deep Neural Networks (DNNs) have transformed the field of machine learning and are widely deployed in many applications involving image, video, speech and natural language processing. The increasing compute demands of DNNs have been widely…

Machine Learning · Computer Science 2021-08-17 Sourjya Roy , Mustafa Ali , Anand Raghunathan

The Indirect Convolution Algorithm

Deep learning frameworks commonly implement convolution operators with GEMM-based algorithms. In these algorithms, convolution is implemented on top of matrix-matrix multiplication (GEMM) functions, provided by highly optimized BLAS…

Computer Vision and Pattern Recognition · Computer Science 2019-07-05 Marat Dukhan

Accelerating Generative Neural Networks on Unmodified Deep Learning Processors -- A Software Approach

Generative neural network is a new category of neural networks and it has been widely utilized in applications such as content generation, unsupervised learning, segmentation and pose estimation. It typically involves massive…

Machine Learning · Computer Science 2020-04-30 Dawen Xu , Ying Wang , Kaijie Tu , Cheng Liu , Bingsheng He , Lei Zhang

SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution

We present a novel approach for accelerating convolutions during inference for CPU-based architectures. The most common method of computation involves packing the image into the columns of a matrix (im2col) and performing general matrix…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Amir Ofir , Gil Ben-Artzi

Efficient and Generic 1D Dilated Convolution Layer for Deep Learning

Convolutional neural networks (CNNs) have found many applications in tasks involving two-dimensional (2D) data, such as image classification and image processing. Therefore, 2D convolution layers have been heavily optimized on CPUs and…

Machine Learning · Computer Science 2021-04-19 Narendra Chaudhary , Sanchit Misra , Dhiraj Kalamkar , Alexander Heinecke , Evangelos Georganas , Barukh Ziv , Menachem Adelman , Bharat Kaul

Sliding-Window Merging for Compacting Patch-Redundant Layers in LLMs

Depth-wise pruning accelerates LLM inference in resource-constrained scenarios but suffers from performance degradation due to direct removal of entire Transformer layers. This paper reveals ``Patch-like'' redundancy across layers via…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Xuan Ding , Rui Sun , Yunjian Zhang , Xiu Yan , Yueqi Zhou , Kaihao Huang , Suzhong Fu , Angelica I Aviles-Rivero , Chuanlong Xie , Yao Zhu

Evaluation of Convolution Primitives for Embedded Neural Networks on 32-bit Microcontrollers

Deploying neural networks on constrained hardware platforms such as 32-bit microcontrollers is a challenging task because of the large memory, computing and energy requirements of their inference process. To tackle these issues, several…

Machine Learning · Computer Science 2023-03-21 Baptiste Nguyen , Pierre-Alain Moellic , Sylvain Blayac

Compiler-Level Matrix Multiplication Optimization for Deep Learning

An important linear algebra routine, GEneral Matrix Multiplication (GEMM), is a fundamental operator in deep learning. Compilers need to translate these routines into low-level code optimized for specific hardware. Compiler-level…

Machine Learning · Computer Science 2019-09-25 Huaqing Zhang , Xiaolin Cheng , Hui Zang , Dae Hoon Park

Im2win: An Efficient Convolution Paradigm on GPU

Convolution is the most time-consuming operation in deep neural network operations, so its performance is critical to the overall performance of the neural network. The commonly used methods for convolution on GPU include the general matrix…

Neural and Evolutionary Computing · Computer Science 2023-06-27 Shuai Lu , Jun Chu , Luanzheng Guo , Xu T. Liu

VW-SDK: Efficient Convolutional Weight Mapping Using Variable Windows for Processing-In-Memory Architectures

With their high energy efficiency, processing-in-memory (PIM) arrays are increasingly used for convolutional neural network (CNN) inference. In PIM-based CNN inference, the computational latency and energy are dependent on how the CNN…

Machine Learning · Computer Science 2021-12-22 Johnny Rhe , Sungmin Moon , Jong Hwan Ko

High-Performance Deep Learning via a Single Building Block

Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each…

Machine Learning · Computer Science 2019-06-19 Evangelos Georganas , Kunal Banerjee , Dhiraj Kalamkar , Sasikanth Avancha , Anand Venkat , Michael Anderson , Greg Henry , Hans Pabst , Alexander Heinecke

Throughput-Distortion Computation Of Generic Matrix Multiplication: Toward A Computation Channel For Digital Signal Processing Systems

The generic matrix multiply (GEMM) function is the core element of high-performance linear algebra libraries used in many computationally-demanding digital signal processing (DSP) systems. We propose an acceleration technique for GEMM based…

Mathematical Software · Computer Science 2015-05-30 Davide Anastasia , Yiannis Andreopoulos

A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication

The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-15 Yufan Xia , Marco De La Pierre , Amanda S. Barnard , Giuseppe Maria Junior Barca

Learning-Augmented Frequency Estimation in Sliding Windows

We show how to utilize machine learning approaches to improve sliding window algorithms for approximate frequency estimation problems, under the ``algorithms with predictions'' framework. In this dynamic environment, previous…

Data Structures and Algorithms · Computer Science 2024-09-19 Rana Shahout , Ibrahim Sabek , Michael Mitzenmacher