English
Related papers

Related papers: PAPI: Exploiting Dynamic Parallelism in Large Lang…

200 papers

The deployment of large language models (LLMs) presents significant challenges due to their enormous memory footprints, low arithmetic intensity, and stringent latency requirements, particularly during the autoregressive decoding stage.…

Hardware Architecture · Computer Science 2025-11-03 Cenlin Duan , Jianlei Yang , Rubing Yang , Yikun Wang , Yiou Wang , Lingkun Long , Yingjie Qi , Xiaolin He , Ao Zhou , Xueyan Wang , Weisheng Zhao

Large Language Models (LLMs) have become essential in a variety of applications due to their advanced language understanding and generation capabilities. However, their computational and memory requirements pose significant challenges to…

Hardware Architecture · Computer Science 2024-12-02 Cristobal Ortega , Yann Falevoz , Renaud Ayrignac

The widespread adoption of Large Language Models (LLMs) has exponentially increased the demand for efficient serving systems. With growing requests and context lengths, key-value (KV)-related operations, including attention computation and…

Hardware Architecture · Computer Science 2026-02-13 Lian Liu , Shixin Zhao , Yutian Zhou , Yintao He , Mengdi Wang , Yinhe Han , Ying Wang

Large language model (LLM) inference has been a prevalent demand in daily life and industries. The large tensor sizes and computing complexities in LLMs have brought challenges to memory, computing, and databus. This paper proposes a…

Hardware Architecture · Computer Science 2025-09-19 Yimin Wang , Yue Jiet Chong , Xuanyao Fong

The rapid advancement of Large Language Models (LLMs) has revolutionized various aspects of human life, yet their immense computational and energy demands pose significant challenges for efficient inference. The memory wall, the growing…

Hardware Architecture · Computer Science 2025-09-18 Hongyi Li , Songchen Ma , Huanyu Qu , Weihao Zhang , Jia Chen , Junfeng Lin , Fengbin Tu , Rong Zhao

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding…

Machine Learning · Computer Science 2024-02-06 Yichao Fu , Peter Bailis , Ion Stoica , Hao Zhang

The expansion of long-context Large Language Models (LLMs) creates significant memory system challenges. While Processing-in-Memory (PIM) is a promising accelerator, we identify that it suffers from critical inefficiencies when scaled to…

Large language models (LLMs) are renowned for their extensive linguistic knowledge and strong generalization capabilities, but their high computational demands make them unsuitable for resource-constrained environments. In contrast, small…

Computation and Language · Computer Science 2025-06-10 Kyeonghyun Kim , Jinhee Jang , Juhwan Choi , Yoonji Lee , Kyohoon Jin , YoungBin Kim

In this study, we reveal an in-context learning (ICL) capability of multilingual large language models (LLMs): by translating the input to several languages, we provide Parallel Input in Multiple Languages (PiM) to LLMs, which significantly…

Computation and Language · Computer Science 2025-06-04 Yongyu Mu , Peinan Feng , Zhiquan Cao , Yuzhang Wu , Bei Li , Chenglong Wang , Tong Xiao , Kai Song , Tongran Liu , Chunliang Zhang , Jingbo Zhu

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched…

Hardware Architecture · Computer Science 2024-06-21 Guseul Heo , Sangyeop Lee , Jaehong Cho , Hyunmin Choi , Sanghyeon Lee , Hyungkyu Ham , Gwangsun Kim , Divya Mahajan , Jongse Park

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Large language models (LLMs) have shown impressive capabilities in adapting to various tasks when provided with task-specific instructions. However, LLMs using standard decoding strategies often struggle with deviations from the inputs.…

Computation and Language · Computer Science 2024-06-05 Jinliang Lu , Chen Wang , Jiajun Zhang

The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine neural processing…

Hardware Architecture · Computer Science 2026-05-05 Yuzong Chen , Chao Fang , Xilai Dai , Yuheng Wu , Thierry Tambe , Marian Verhelst , Mohamed S. Abdelfattah

In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential decoding process, which autoregressively generates all labels…

Computation and Language · Computer Science 2024-11-22 Jinghui Lu , Ziwei Yang , Yanjie Wang , Xuejing Liu , Brian Mac Namee , Can Huang

Large Language Models (LLMs) have become extremely potent instruments with exceptional capacities for comprehending and producing human-like text in a wide range of applications. However, the increasing size and complexity of LLMs present…

Machine Learning · Computer Science 2024-06-18 Yingbing Huang , Lily Jiaxin Wan , Hanchen Ye , Manvi Jha , Jinghua Wang , Yuhong Li , Xiaofan Zhang , Deming Chen

Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-19 Stephen Mell , David Mell , Konstantinos Kallas , Steve Zdancewic , Osbert Bastani

The continued success of Large Language Models (LLMs) and other generative artificial intelligence approaches highlights the advantages that large information corpora can have over rigidly defined symbolic models, but also serves as a…

Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. While HBM-based acceleration offers high bandwidth, its capacity…

Hardware Architecture · Computer Science 2025-04-25 Qingyuan Liu , Liyan Chen , Yanning Yang , Haocheng Wang , Dong Du , Zhigang Mao , Naifeng Jing , Yubin Xia , Haibo Chen

Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. In this paper, we present a comprehensive survey of…

Hardware Architecture · Computer Science 2024-09-06 Nikoletta Koilia , Christoforos Kachris

The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these…

Machine Learning · Computer Science 2025-10-01 Hao Mark Chen , Wayne Luk , Ka Fai Cedric Yiu , Rui Li , Konstantin Mishchenko , Stylianos I. Venieris , Hongxiang Fan
‹ Prev 1 2 3 10 Next ›