English
Related papers

Related papers: Splitwise: Efficient generative LLM inference usin…

200 papers

Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail…

Hardware Architecture · Computer Science 2025-05-08 Asad Aali , Adney Cardoza , Melissa Capo

The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements,…

Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and…

Machine Learning · Computer Science 2025-11-07 Mingyu Sung , Vikas Palakonda , Suhwan Im , Sunghwan Moon , Il-Min Kim , Sangseok Yun , Jae-Mo Kang

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during…

Computation and Language · Computer Science 2024-11-21 Sean Welleck , Amanda Bertsch , Matthew Finlayson , Hailey Schoelkopf , Alex Xie , Graham Neubig , Ilia Kulikov , Zaid Harchaoui

Reasoning in large language models (LLMs) tends to produce substantially longer token generation sequences than simpler language modeling tasks. This extended generation length reflects the multi-step, compositional nature of reasoning and…

Computation and Language · Computer Science 2025-04-24 Yash Akhauri , Anthony Fei , Chi-Chih Chang , Ahmed F. AbouElhamayed , Yueying Li , Mohamed S. Abdelfattah

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with…

Hardware Architecture · Computer Science 2024-06-25 Hui Wu , Yi Gan , Feng Yuan , Jing Ma , Wei Zhu , Yutao Xu , Hong Zhu , Yuhua Zhu , Xiaoli Liu , Jinghui Gu , Peng Zhao

Performing inference on large volumes of samples with large language models (LLMs) can be computationally and financially costly in industry and real-world use. We propose batch prompting, a simple yet effective prompting approach that…

Computation and Language · Computer Science 2023-10-25 Zhoujun Cheng , Jungo Kasai , Tao Yu

Deploying large language models (LLMs) on edge devices is challenging due to their limited memory and power resources. Cloud-only inference reduces device burden but introduces high latency and cost. Static edge-cloud partitions optimize a…

Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer,…

Computation and Language · Computer Science 2025-11-25 Lingkun Long , Rubing Yang , Yushi Huang , Desheng Hui , Ao Zhou , Jianlei Yang

Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can…

Computational Engineering, Finance, and Science · Computer Science 2024-11-26 Wenxiang Lin , Xinglin Pan , Shaohuai Shi , Xuan Wang , Xiaowen Chu

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute…

Information Retrieval · Computer Science 2026-04-06 Cornelius Kummer , Lena Jurkschat , Michael Färber , Sahar Vahdati

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV…

Machine Learning · Computer Science 2026-03-16 Donglin Yu

As large language models (LLMs) grow in popularity for their diverse capabilities, improving the efficiency of their inference systems has become increasingly critical. Batching LLM requests is a critical step in scheduling the inference…

Computation and Language · Computer Science 2024-12-09 Ozgur Guldogan , Jackson Kunde , Kangwook Lee , Ramtin Pedarsani

Large Language Models (LLMs) are rapidly becoming critical infrastructure for enterprise applications, driving unprecedented demand for GPU-based inference services. A key operational challenge arises from the two-phase nature of LLM…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-04 Ruihan Lin , Zezhen Ding , Zean Han , Jiheng Zhang

The quantization of large language models (LLMs) is crucial for deploying them on devices with limited computational resources. While advanced quantization algorithms offer improved performance compared to the basic linear quantization,…

Machine Learning · Computer Science 2025-03-12 Jaewoo Song , Fangzhen Lin

Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and…

Hardware Architecture · Computer Science 2025-05-06 Yufeng Gu , Alireza Khadem , Sumanth Umesh , Ning Liang , Xavier Servot , Onur Mutlu , Ravi Iyer , Reetuparna Das
‹ Prev 1 2 3 10 Next ›