Related papers: PAPI: Exploiting Dynamic Parallelism in Large Lang…

HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference

The deployment of large language models (LLMs) presents significant challenges due to their enormous memory footprints, low arithmetic intensity, and stringent latency requirements, particularly during the autoregressive decoding stage.…

Hardware Architecture · Computer Science 2025-11-03 Cenlin Duan , Jianlei Yang , Rubing Yang , Yikun Wang , Yiou Wang , Lingkun Long , Yingjie Qi , Xiaolin He , Ao Zhou , Xueyan Wang , Weisheng Zhao

PIM-AI: A Novel Architecture for High-Efficiency LLM Inference

Large Language Models (LLMs) have become essential in a variety of applications due to their advanced language understanding and generation capabilities. However, their computational and memory requirements pose significant challenges to…

Hardware Architecture · Computer Science 2024-12-02 Cristobal Ortega , Yann Falevoz , Renaud Ayrignac

PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System

The widespread adoption of Large Language Models (LLMs) has exponentially increased the demand for efficient serving systems. With growing requests and context lengths, key-value (KV)-related operations, including attention computation and…

Hardware Architecture · Computer Science 2026-02-13 Lian Liu , Shixin Zhao , Yutian Zhou , Yintao He , Mengdi Wang , Yinhe Han , Ying Wang

LEAP: LLM Inference on Scalable PIM-NoC Architecture with Balanced Dataflow and Fine-Grained Parallelism

Large language model (LLM) inference has been a prevalent demand in daily life and industries. The large tensor sizes and computing complexities in LLMs have brought challenges to memory, computing, and databus. This paper proposes a…

Hardware Architecture · Computer Science 2025-09-19 Yimin Wang , Yue Jiet Chong , Xuanyao Fong

CompAir: Synergizing Complementary PIMs and In-Transit NoC Computation for Efficient LLM Acceleration

The rapid advancement of Large Language Models (LLMs) has revolutionized various aspects of human life, yet their immense computational and energy demands pose significant challenges for efficient inference. The memory wall, the growing…

Hardware Architecture · Computer Science 2025-09-18 Hongyi Li , Songchen Ma , Huanyu Qu , Weihao Zhang , Jia Chen , Junfeng Lin , Fengbin Tu , Rong Zhao

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding…

Machine Learning · Computer Science 2024-02-06 Yichao Fu , Peter Bailis , Ion Stoica , Hao Zhang

PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System

The expansion of long-context Large Language Models (LLMs) creates significant memory system challenges. While Processing-in-Memory (PIM) is a promising accelerator, we identify that it suffers from critical inefficiencies when scaled to…

Hardware Architecture · Computer Science 2025-12-29 Hyucksung Kwon , Kyungmo Koo , Janghyeon Kim , Woongkyu Lee , Minjae Lee , Gyeonggeun Jung , Hyungdeok Lee , Yousub Jung , Jaehan Park , Yosub Song , Byeongsu Yang , Haerang Choi , Guhyun Kim , Jongsoon Won , Woojae Shin , Changhyun Kim , Gyeongcheol Shin , Yongkee Kwon , Ilkon Kim , Euicheol Lim , John Kim , Jungwook Choi

Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models

Large language models (LLMs) are renowned for their extensive linguistic knowledge and strong generalization capabilities, but their high computational demands make them unsuitable for resource-constrained environments. In contrast, small…

Computation and Language · Computer Science 2025-06-10 Kyeonghyun Kim , Jinhee Jang , Juhwan Choi , Yoonji Lee , Kyohoon Jin , YoungBin Kim

Revealing the Parallel Multilingual Learning within Large Language Models

In this study, we reveal an in-context learning (ICL) capability of multilingual large language models (LLMs): by translating the input to several languages, we provide Parallel Input in Multiple Languages (PiM) to LLMs, which significantly…

Computation and Language · Computer Science 2025-06-04 Yongyu Mu , Peinan Feng , Zhiquan Cao , Yuzhang Wu , Bei Li , Chenglong Wang , Tong Xiao , Kai Song , Tongran Liu , Chunliang Zhang , Jingbo Zhu

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched…

Hardware Architecture · Computer Science 2024-06-21 Guseul Heo , Sangyeop Lee , Jaehong Cho , Hyunmin Choi , Sanghyeon Lee , Hyungkyu Ham , Gwangsun Kim , Divya Mahajan , Jongse Park

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Diver: Large Language Model Decoding with Span-Level Mutual Information Verification

Large language models (LLMs) have shown impressive capabilities in adapting to various tasks when provided with task-specific instructions. However, LLMs using standard decoding strategies often struggle with deviations from the inputs.…

Computation and Language · Computer Science 2024-06-05 Jinliang Lu , Chen Wang , Jiajun Zhang

P3-LLM: An Integrated NPU-PIM Accelerator for Edge LLM Inference Using Hybrid Numerical Formats

The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine neural processing…

Hardware Architecture · Computer Science 2026-05-05 Yuzong Chen , Chao Fang , Xilai Dai , Yuheng Wu , Thierry Tambe , Marian Verhelst , Mohamed S. Abdelfattah

PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition

In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential decoding process, which autoregressively generates all labels…

Computation and Language · Computer Science 2024-11-22 Jinghui Lu , Ziwei Yang , Yanjie Wang , Xuejing Liu , Brian Mac Namee , Can Huang

New Solutions on LLM Acceleration, Optimization, and Application

Large Language Models (LLMs) have become extremely potent instruments with exceptional capacities for comprehending and producing human-like text in a wide range of applications. However, the increasing size and complexity of LLMs present…

Machine Learning · Computer Science 2024-06-18 Yingbing Huang , Lily Jiaxin Wan , Hanchen Ye , Manvi Jha , Jinghua Wang , Yuhong Li , Xiaofan Zhang , Deming Chen

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-19 Stephen Mell , David Mell , Konstantinos Kallas , Steve Zdancewic , Osbert Bastani

Large Process Models: A Vision for Business Process Management in the Age of Generative AI

The continued success of Large Language Models (LLMs) and other generative artificial intelligence approaches highlights the advantages that large information corpora can have over rigidly defined symbolic models, but also serves as a…

Software Engineering · Computer Science 2025-01-20 Timotheus Kampik , Christian Warmuth , Adrian Rebmann , Ron Agam , Lukas N. P. Egger , Andreas Gerber , Johannes Hoffart , Jonas Kolk , Philipp Herzig , Gero Decker , Han van der Aa , Artem Polyvyanyy , Stefanie Rinderle-Ma , Ingo Weber , Matthias Weidlich

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. While HBM-based acceleration offers high bandwidth, its capacity…

Hardware Architecture · Computer Science 2025-04-25 Qingyuan Liu , Liyan Chen , Yanning Yang , Haocheng Wang , Dong Du , Zhigang Mao , Naifeng Jing , Yubin Xia , Haibo Chen

Hardware Acceleration of LLMs: A comprehensive survey and comparison

Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. In this paper, we present a comprehensive survey of…

Hardware Architecture · Computer Science 2024-09-06 Nikoletta Koilia , Christoforos Kachris

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these…

Machine Learning · Computer Science 2025-10-01 Hao Mark Chen , Wayne Luk , Ka Fai Cedric Yiu , Rui Li , Konstantin Mishchenko , Stylianos I. Venieris , Hongxiang Fan