Related papers: DGEMM performance is data-dependent

Input-Dependent Power Usage in GPUs

GPUs are known to be power-hungry, and due to the boom in artificial intelligence, they are currently the major contributors to the high power demands of upcoming datacenters. Most GPU usage in these popular workloads consist of large…

Artificial Intelligence · Computer Science 2024-09-30 Theo Gregersen , Pratyush Patel , Esha Choukse

Energy Complexity of Software in Embedded Systems

The importance of low power consumption is widely acknowledged due to the increasing use of portable devices, which require minimizing the consumption of energy. The energy in a computational system depends heavily on the software being…

Adaptation and Self-Organizing Systems · Physics 2024-04-15 Kostas Zotos , Andreas Litke , Alexander Chatzigeorgiou , Spyros Nikolaidis , George Stephanides

Understanding the Impact of Input Entropy on FPU, CPU, and GPU Power

Power is increasingly becoming a limiting resource in high-performance, GPU-accelerated computing systems. Understanding the range and sources of power variation is essential in setting realistic bounds on rack and system peak power, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-20 Sridutt Bhalachandra , Brian Austin , Samuel Williams , Nicholas J. Wright

Performance Analysis of Matrix Multiplication for Deep Learning on the Edge

The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the…

Hardware Architecture · Computer Science 2024-03-13 Cristian Ramírez , Adrián Castelló , Héctor Martínez , Enrique S. Quintana-Ortí

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-13 Yuki Uchino , Katsuhisa Ozaki , Toshiyuki Imamura

Batched matrix operations on distributed GPUs with application in theoretical physics

One of the most important and commonly used operations in many linear algebra functions is matrix-matrix multiplication (GEMM), which is also a key component in obtaining high performance of many scientific codes. It is a computationally…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-18 Nenad Mijić , Davor Davidović

NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference

The inherent diversity of computation types within the deep neural network (DNN) models often requires a variety of specialized units in hardware processors, which limits computational efficiency, increasing both inference latency and power…

Machine Learning · Computer Science 2024-08-21 Ruiqi Sun , Siwei Ye , Jie Zhao , Xin He , Jianzhe Lin , Yiran Li , An Zou

Throughput-Distortion Computation Of Generic Matrix Multiplication: Toward A Computation Channel For Digital Signal Processing Systems

The generic matrix multiply (GEMM) function is the core element of high-performance linear algebra libraries used in many computationally-demanding digital signal processing (DSP) systems. We propose an acceleration technique for GEMM based…

Mathematical Software · Computer Science 2015-05-30 Davide Anastasia , Yiannis Andreopoulos

Optimising Resource Management for Embedded Machine Learning

Machine learning inference is increasingly being executed locally on mobile and embedded platforms, due to the clear advantages in latency, privacy and connectivity. In this paper, we present approaches for online resource management in…

Computer Vision and Pattern Recognition · Computer Science 2021-05-11 Lei Xun , Long Tran-Thanh , Bashir M Al-Hashimi , Geoff V. Merrett

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of…

Hardware Architecture · Computer Science 2020-12-02 Benjamin Y. Cho , Jeageun Jung , Mattan Erez

tuGEMM: Area-Power-Efficient Temporal Unary GEMM Architecture for Low-Precision Edge AI

General matrix multiplication (GEMM) is a ubiquitous computing kernel/algorithm for data processing in diverse applications, including artificial intelligence (AI) and deep learning (DL). Recent shift towards edge computing has inspired…

Hardware Architecture · Computer Science 2024-12-25 Harideep Nair , Prabhu Vellaisamy , Albert Chen , Joseph Finn , Anna Li , Manav Trivedi , John Paul Shen

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-21 Qiao Zhang , Rabab Alomairy , Dali Wang , Zhuowei Gu , Qinglei Cao

Matrix Element Regression with Deep Neural Networks -- breaking the CPU barrier

The Matrix Element Method (MEM) is a powerful method to extract information from measured events at collider experiments. Compared to multivariate techniques built on large sets of experimental data, the MEM does not rely on an…

High Energy Physics - Experiment · Physics 2021-04-07 Florian Bury , Christophe Delaere

Optimizing Irregular-Shaped Matrix-Matrix Multiplication on Multi-Core DSPs

General Matrix Multiplication (GEMM) has a wide range of applications in scientific simulation and artificial intelligence. Although traditional libraries can achieve high performance on large regular-shaped GEMMs, they often behave not…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-12 Shangfei Yin , Qinglin Wang , Ruochen Hao , Tianyang Zhou , Songzhu Mei , Jie Liu

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

GEneral Matrix Multiply (GEMM) is a central operation in deep learning and corresponds to the largest chunk of the compute footprint. Therefore, improving its efficiency is an active topic of ongoing research. A popular strategy is the use…

Machine Learning · Computer Science 2024-03-13 Zhanpeng Zeng , Karthikeyan Sankaralingam , Vikas Singh

MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications

Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-13 Bo Fang , Xinyi Li , Harvey Dam , Cheng Tan , Siva Kumar Sastry Hari , Timothy Tsai , Ignacio Laguna , Dingwen Tao , Ganesh Gopalakrishnan , Prashant Nair , Kevin Barker , Ang Li

Implementation of digital MemComputing using standard electronic components

Digital MemComputing machines (DMMs), which employ nonlinear dynamical systems with memory (time non-locality), have proven to be a robust and scalable unconventional computing approach for solving a wide variety of combinatorial…

Emerging Technologies · Computer Science 2024-07-16 Yuan-Hang Zhang , Massimiliano Di Ventra

Toward matrix multiplication for deep learning inference on the Xilinx Versal

The remarkable positive impact of Deep Neural Networks on many Artificial Intelligence (AI) tasks has led to the development of various high performance algorithms as well as specialized processors and accelerators. In this paper we address…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-16 Jie Lei , José Flich , Enrique S. Quintana-Ortí

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

General Matrix Multiplication (GEMM) is a fundamental operation widely used in scientific computations. Its performance and accuracy significantly impact the performance and accuracy of applications that depend on it. One such application…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-12 Fumiya Kono , Naohito Nakasato , Maho Nakata

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-03 Shixun Wu , Yujia Zhai , Jinyang Liu , Jiajun Huang , Zizhe Jian , Bryan M. Wong , Zizhong Chen