English
Related papers

Related papers: Accelerating Bandwidth-Bound Deep Learning Inferen…

200 papers

In modern computer architectures, the performance of many memory-bound workloads (e.g., machine learning, graph processing, databases) is limited by the data movement bottleneck that emerges when transferring large amounts of data between…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-12 Pedro Carrinho , Hamid Moghadaspour , Oscar Ferraz , João Dinis Ferreira , Yann Falevoz , Vitor Silva , Gabriel Falcao

Edge deployment of low-batch large language models (LLMs) faces critical memory bandwidth bottlenecks when executing memory-intensive general matrix-vector multiplications (GEMV) operations. While digital processing-in-memory (PIM)…

Hardware Architecture · Computer Science 2026-01-21 Ye Lin , Chao Fang , Xiaoyong Song , Qi Wu , Anying Jiang , Yichuan Bai , Li Du

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of…

Machine Learning · Computer Science 2019-08-27 Youngeun Kwon , Yunjae Lee , Minsoo Rhu

In Scientific Computing and modern Machine Learning (ML) workloads, sequences of dependent General Matrix Multiplications (GEMMs) often dominate execution time. While state-of-the-art BLAS libraries aggressively optimize individual GEMM…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-07 César Guedes Carneiro , Lucas Alvarenga , Guido Araujo , Sandro Rigo

In-memory database query processing frequently involves substantial data transfers between the CPU and memory, leading to inefficiencies due to Von Neumann bottleneck. Processing-in-Memory (PIM) architectures offer a viable solution to…

Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each…

The deployment of large language models (LLMs) presents significant challenges due to their enormous memory footprints, low arithmetic intensity, and stringent latency requirements, particularly during the autoregressive decoding stage.…

Hardware Architecture · Computer Science 2025-11-03 Cenlin Duan , Jianlei Yang , Rubing Yang , Yikun Wang , Yiou Wang , Lingkun Long , Yingjie Qi , Xiaolin He , Ao Zhou , Xueyan Wang , Weisheng Zhao

High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited instruction sets and require…

Hardware Architecture · Computer Science 2026-05-01 Emanuele Venieri , Simone Manoni , Alberto Florian , Jaehyun Park , Kyomin Sohn , Andrea Bartolini

Large language models (LLMs) face significant inference latency due to inefficiencies in GEMM operations, weight access, and KV cache access, especially in real-time scenarios. This highlights the need for a versatile compute-memory…

Hardware Architecture · Computer Science 2025-09-15 Huizheng Wang , Zichuan Wang , Zhiheng Yue , Yousheng Long , Taiquan Wei , Jianxun Yang , Yang Wang , Chao Li , Shaojun Wei , Yang Hu , Shouyi Yin

Deep Neural Networks (DNNs) have transformed the field of machine learning and are widely deployed in many applications involving image, video, speech and natural language processing. The increasing compute demands of DNNs have been widely…

Machine Learning · Computer Science 2021-08-17 Sourjya Roy , Mustafa Ali , Anand Raghunathan

General matrix-vector multiplication (GeMV) remains a critical latency bottleneck in large language model (LLM) inference, even with quantized low-bit models. Processing-Using-DRAM (PUD), an analog in-DRAM computing technique, has the…

Hardware Architecture · Computer Science 2025-09-24 Tatsuya Kubo , Daichi Tokuda , Tomoya Nagatani , Masayuki Usui , Lei Qu , Ting Cao , Shinya Takamaeda-Yamazaki

Developing kernels for Processing-In-Memory (PIM) platforms poses unique challenges in data management and parallel programming on limited processing units. Although software development kits (SDKs) for PIM, such as the UPMEM SDK, provide…

Hardware Architecture · Computer Science 2025-10-21 Krystian Chmielewski , Jarosław Ławnicki , Uladzislau Lukyanau , Tadeusz Kobus , Maciej Maciejewski

Deep Learning Recommendation Models (DLRMs) have gained popularity in recommendation systems due to their effectiveness in handling large-scale recommendation tasks. The embedding layers of DLRMs have become the performance bottleneck due…

Information Retrieval · Computer Science 2024-10-10 Sitian Chen , Haobin Tan , Amelie Chi Zhou , Yusen Li , Pavan Balaji

The remarkable positive impact of Deep Neural Networks on many Artificial Intelligence (AI) tasks has led to the development of various high performance algorithms as well as specialized processors and accelerators. In this paper we address…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-16 Jie Lei , José Flich , Enrique S. Quintana-Ortí

FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more…

Hardware Architecture · Computer Science 2024-04-18 Endri Taka , Dimitrios Gourounas , Andreas Gerstlauer , Diana Marculescu , Aman Arora

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched…

Hardware Architecture · Computer Science 2024-06-21 Guseul Heo , Sangyeop Lee , Jaehong Cho , Hyunmin Choi , Sanghyeon Lee , Hyungkyu Ham , Gwangsun Kim , Divya Mahajan , Jongse Park

Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic.…

Computation and Language · Computer Science 2025-06-16 Héctor Martínez , Adrián Castelló , Francisco D. Igual , Enrique S. Quintana-Ortí

Traditional computers with von Neumann architecture are unable to meet the latency and scalability challenges of Deep Neural Network (DNN) workloads. Various DNN accelerators based on Conventional compute Hardware Accelerator (CHA),…

Hardware Architecture · Computer Science 2022-08-11 Tom Glint , Chandan Kumar Jha , Manu Awasthi , Joycee Mekie

Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from…

Hardware Architecture · Computer Science 2023-09-07 Juan Gómez-Luna , Yuxin Guo , Sylvan Brocard , Julien Legriel , Remy Cimadomo , Geraldo F. Oliveira , Gagandeep Singh , Onur Mutlu

PIM architectures aim to reduce data transfer costs between processors and memory by integrating processing units within memory layers. Prior PIM architectures have shown potential to improve energy efficiency and performance. However, such…

Hardware Architecture · Computer Science 2025-10-10 Parker Hao Tian , Zahra Yousefijamarani , Alaa Alameldeen
‹ Prev 1 2 3 10 Next ›