Related papers: Accelerating Bandwidth-Bound Deep Learning Inferen…

An Experimental Exploration of In-Memory Computing for Multi-Layer Perceptrons

In modern computer architectures, the performance of many memory-bound workloads (e.g., machine learning, graph processing, databases) is limited by the data movement bottleneck that emerges when transferring large amounts of data between…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-12 Pedro Carrinho , Hamid Moghadaspour , Oscar Ferraz , João Dinis Ferreira , Yann Falevoz , Vitor Silva , Gabriel Falcao

CD-PIM: A High-Bandwidth and Compute-Efficient LPDDR5-Based PIM for Low-Batch LLM Acceleration on Edge-Device

Edge deployment of low-batch large language models (LLMs) faces critical memory bandwidth bottlenecks when executing memory-intensive general matrix-vector multiplications (GEMV) operations. While digital processing-in-memory (PIM)…

Hardware Architecture · Computer Science 2026-01-21 Ye Lin , Chao Fang , Xiaoyong Song , Qi Wu , Anying Jiang , Yichuan Bai , Li Du

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of…

Machine Learning · Computer Science 2019-08-27 Youngeun Kwon , Yunjae Lee , Minsoo Rhu

LP-GEMM: Integrating Layout Propagation into GEMM Operations

In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-07 César Guedes Carneiro , Lucas Alvarenga , Guido Araujo , Sandro Rigo

Membrane: Accelerating Database Analytics with Bank-Level DRAM-PIM Filtering

In-memory database query processing frequently involves substantial data transfers between the CPU and memory, leading to inefficiencies due to Von Neumann bottleneck. Processing-in-Memory (PIM) architectures offer a viable solution to…

Hardware Architecture · Computer Science 2025-04-10 Akhil Shekar , Kevin Gaffney , Martin Prammer , Khyati Kiyawat , Lingxi Wu , Helena Caminal , Zhenxing Fan , Yimin Gao , Ashish Venkat , José F. Martínez , Jignesh Patel , Kevin Skadron

High-Performance Deep Learning via a Single Building Block

Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each…

Machine Learning · Computer Science 2019-06-19 Evangelos Georganas , Kunal Banerjee , Dhiraj Kalamkar , Sasikanth Avancha , Anand Venkat , Michael Anderson , Greg Henry , Hans Pabst , Alexander Heinecke

HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference

The deployment of large language models (LLMs) presents significant challenges due to their enormous memory footprints, low arithmetic intensity, and stringent latency requirements, particularly during the autoregressive decoding stage.…

Hardware Architecture · Computer Science 2025-11-03 Cenlin Duan , Jianlei Yang , Rubing Yang , Yikun Wang , Yiou Wang , Lingkun Long , Yingjie Qi , Xiaolin He , Ao Zhou , Xueyan Wang , Weisheng Zhao

AME-PIM: Can Memory be Your Next Tensor Accelerator?

High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited instruction sets and require…

Hardware Architecture · Computer Science 2026-05-01 Emanuele Venieri , Simone Manoni , Alberto Florian , Jaehyun Park , Kyomin Sohn , Andrea Bartolini

MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness

Large language models (LLMs) face significant inference latency due to inefficiencies in GEMM operations, weight access, and KV cache access, especially in real-time scenarios. This highlights the need for a versatile compute-memory…

Hardware Architecture · Computer Science 2025-09-15 Huizheng Wang , Zichuan Wang , Zhiheng Yue , Yousheng Long , Taiquan Wei , Jianxun Yang , Yang Wang , Chao Li , Shaojun Wei , Yang Hu , Shouyi Yin

PIM-DRAM: Accelerating Machine Learning Workloads using Processing in Commodity DRAM

Deep Neural Networks (DNNs) have transformed the field of machine learning and are widely deployed in many applications involving image, video, speech and natural language processing. The increasing compute demands of DNNs have been widely…

Machine Learning · Computer Science 2021-08-17 Sourjya Roy , Mustafa Ali , Anand Raghunathan

MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration

General matrix-vector multiplication (GeMV) remains a critical latency bottleneck in large language model (LLM) inference, even with quantized low-bit models. Processing-Using-DRAM (PUD), an analog in-DRAM computing technique, has the…

Hardware Architecture · Computer Science 2025-09-24 Tatsuya Kubo , Daichi Tokuda , Tomoya Nagatani , Masayuki Usui , Lei Qu , Ting Cao , Shinya Takamaeda-Yamazaki

UPMEM Unleashed: Software Secrets for Speed

Developing kernels for Processing-In-Memory (PIM) platforms poses unique challenges in data management and parallel programming on limited processing units. Although software development kits (SDKs) for PIM, such as the UPMEM SDK, provide…

Hardware Architecture · Computer Science 2025-10-21 Krystian Chmielewski , Jarosław Ławnicki , Uladzislau Lukyanau , Tadeusz Kobus , Maciej Maciejewski

UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture

Deep Learning Recommendation Models (DLRMs) have gained popularity in recommendation systems due to their effectiveness in handling large-scale recommendation tasks. The embedding layers of DLRMs have become the performance bottleneck due…

Information Retrieval · Computer Science 2024-10-10 Sitian Chen , Haobin Tan , Amelie Chi Zhou , Yusen Li , Pavan Balaji

Toward matrix multiplication for deep learning inference on the Xilinx Versal

The remarkable positive impact of Deep Neural Networks on many Artificial Intelligence (AI) tasks has led to the development of various high performance algorithms as well as specialized processors and accelerators. In this paper we address…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-16 Jie Lei , José Flich , Enrique S. Quintana-Ortí

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more…

Hardware Architecture · Computer Science 2024-04-18 Endri Taka , Dimitrios Gourounas , Andreas Gerstlauer , Diana Marculescu , Aman Arora

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched…

Hardware Architecture · Computer Science 2024-06-21 Guseul Heo , Sangyeop Lee , Jaehong Cho , Hyunmin Choi , Sanghyeon Lee , Hyungkyu Ham , Gwangsun Kim , Divya Mahajan , Jongse Park

The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic.…

Computation and Language · Computer Science 2025-06-16 Héctor Martínez , Adrián Castelló , Francisco D. Igual , Enrique S. Quintana-Ortí

A Fresh Perspective on DNN Accelerators by Performing Holistic Analysis Across Paradigms

Traditional computers with von Neumann architecture are unable to meet the latency and scalability challenges of Deep Neural Network (DNN) workloads. Various DNN accelerators based on Conventional compute Hardware Accelerator (CHA),…

Hardware Architecture · Computer Science 2022-08-11 Tom Glint , Chandan Kumar Jha , Manu Awasthi , Joycee Mekie

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from…

Hardware Architecture · Computer Science 2023-09-07 Juan Gómez-Luna , Yuxin Guo , Sylvan Brocard , Julien Legriel , Remy Cimadomo , Geraldo F. Oliveira , Gagandeep Singh , Onur Mutlu

DL-PIM: Improving Data Locality in Processing-in-Memory Systems

PIM architectures aim to reduce data transfer costs between processors and memory by integrating processing units within memory layers. Prior PIM architectures have shown potential to improve energy efficiency and performance. However, such…

Hardware Architecture · Computer Science 2025-10-10 Parker Hao Tian , Zahra Yousefijamarani , Alaa Alameldeen