Related papers: Optimizing CNN Model Inference on CPUs

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs

Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-the-art recognition accuracy. Due to the substantial compute and memory operations, however, they require significant execution time. The massive…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-13 Chao Li , Yi Yang , Min Feng , Srimat Chakradhar , Huiyang Zhou

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Modern deep learning applications urge to push the model inference taking place at the edge devices for multiple reasons such as achieving shorter latency, relieving the burden of the network connecting to the cloud, and protecting user…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-05 Leyuan Wang , Zhi Chen , Yizhi Liu , Yao Wang , Lianmin Zheng , Mu Li , Yida Wang

Optimizing CNN Using HPC Tools

This paper optimizes the Convolutional Neural Network (CNN) algorithm using high-performance computing (HPC) technologies. It uses multi-core processors, GPUs, and parallel computing frameworks like OpenMPI and CUDA to speed up CNN model…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-11 Shahrin Rahman

msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML

AI spans from large language models to tiny models running on microcontrollers (MCUs). Extremely memory-efficient model architectures are decisive to fit within an MCU's tiny memory budget e.g., 128kB of RAM. However, inference latency must…

Machine Learning · Computer Science 2025-10-20 Zhaolan Huang , Emmanuel Baccelli

Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

Deploying deep learning models in cloud clusters provides efficient and prompt inference services to accommodate the widespread application of deep learning. These clusters are usually equipped with host CPUs and accelerators with distinct…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-24 Zinuo Cai , Hao Wang , Tao Song , Yang Hua , Ruhui Ma , Haibing Guan

Accelerating CNN inference on FPGAs: A Survey

Convolutional Neural Networks (CNNs) are currently adopted to solve an ever greater number of problems, ranging from speech recognition to image classification and segmentation. The large amount of processing required by CNNs calls for…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-06 Kamel Abdelouahab , Maxime Pelcat , Jocelyn Serot , François Berry

Deep Learning Models on CPUs: A Methodology for Efficient Training

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when…

Machine Learning · Computer Science 2023-06-21 Quchen Fu , Ramesh Chukka , Keith Achorn , Thomas Atta-fosu , Deepak R. Canchi , Zhongwei Teng , Jules White , Douglas C. Schmidt

Inference Performance Optimization for Large Language Models on CPUs

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry.…

Artificial Intelligence · Computer Science 2024-07-11 Pujiang He , Shan Zhou , Wenhuan Huang , Changqing Li , Duyi Wang , Bin Guo , Chen Meng , Sheng Gui , Weifei Yu , Yi Xie

Towards CPU Performance Prediction: New Challenge Benchmark Dataset and Novel Approach

The server central processing unit (CPU) market continues to exhibit robust demand due to the rising global need for computing power. Against this backdrop, CPU benchmark performance prediction is crucial for architecture designers. It…

Performance · Computer Science 2024-10-29 Xiaoman Liu

Inference Acceleration for Large Language Models on CPUs

In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-13 Ditto PS , Jithin VG , Adarsh MS

Split CNN Inference on Networked Microcontrollers

Running deep neural networks on microcontroller units (MCUs) is severely constrained by limited memory resources. While TinyML techniques reduce model size and computation, they often fail in practice due to excessive peak Random Access…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-12 Junyu Lu , Shashwath Suresh , Hao Liu , Qi Hong , Qing Wang

Balanced segmentation of CNNs for multi-TPU inference

In this paper, we propose different alternatives for convolutional neural networks (CNNs) segmentation, addressing inference processes on computing architectures composed by multiple Edge TPUs. Specifically, we compare the inference…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-04 Jorge Villarrubia , Luis Costero , Francisco D. Igual , Katzalin Olcoz

Optimizing Convolutional Neural Networks for Embedded Systems by Means of Neuroevolution

Automated design methods for convolutional neural networks (CNNs) have recently been developed in order to increase the design productivity. We propose a neuroevolution method capable of evolving and optimizing CNNs with respect to the…

Neural and Evolutionary Computing · Computer Science 2019-10-16 Filip Badan , Lukas Sekanina

Neural Network Compression Framework for fast model inference

In this work we present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF). It leverages recent advances of various network compression methods and implements some…

Computer Vision and Pattern Recognition · Computer Science 2021-01-01 Alexander Kozlov , Ivan Lazarevich , Vasily Shamporov , Nikolay Lyalyushkin , Yury Gorbachev

cuConv: A CUDA Implementation of Convolution for CNN Inference

Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs). Current GPU architectures are highly efficient for training and deploying deep CNNs, and hence, these are largely used in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-28 Marc Jordà , Pedro Valero-Lara , Antonio J. Peña

Computation-Performance Optimization of Convolutional Neural Networks with Redundant Kernel Removal

Deep Convolutional Neural Networks (CNNs) are widely employed in modern computer vision algorithms, where the input image is convolved iteratively by many kernels to extract the knowledge behind it. However, with the depth of convolutional…

Computer Vision and Pattern Recognition · Computer Science 2018-04-11 Chih-Ting Liu , Yi-Heng Wu , Yu-Sheng Lin , Shao-Yi Chien

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks. Techniques to do in-situ arithmetic in…

Hardware Architecture · Computer Science 2018-05-11 Charles Eckert , Xiaowei Wang , Jingcheng Wang , Arun Subramaniyan , Ravi Iyer , Dennis Sylvester , David Blaauw , Reetuparna Das

On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems

Convolutional Neural Networks (CNNs) have shown a great deal of success in diverse application domains including computer vision, speech recognition, and natural language processing. However, as the size of datasets and the depth of neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-12-07 Wonje Choi , Karthi Duraisamy , Ryan Gary Kim , Janardhan Rao Doppa , Partha Pratim Pande , Diana Marculescu , Radu Marculescu

Memory-Efficient Dataflow Inference for Deep CNNs on FPGA

Custom dataflow Convolutional Neural Network (CNN) inference accelerators on FPGA are tailored to a specific CNN topology and store parameters in On-Chip Memory (OCM), resulting in high energy efficiency and low inference latency. However,…

Hardware Architecture · Computer Science 2020-11-17 Lucian Petrica , Tobias Alonso , Mairin Kroes , Nicholas Fraser , Sorin Cotofana , Michaela Blott

Optimizing Temporal Convolutional Network inference on FPGA-based accelerators

Convolutional Neural Networks are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition, and segmentation. Recent research results demonstrate that…

Signal Processing · Electrical Eng. & Systems 2020-05-11 Marco Carreras , Gianfranco Deriu , Luigi Raffo , Luca Benini , Paolo Meloni