Related papers: Splitwise: Efficient generative LLM inference usin…

Splitwiser: Efficient LM inference with constrained resources

Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail…

Hardware Architecture · Computer Science 2025-05-08 Asad Aali , Adney Cardoza , Melissa Capo

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements,…

Performance · Computer Science 2024-01-18 Connor Holmes , Masahiro Tanaka , Michael Wyatt , Ammar Ahmad Awan , Jeff Rasley , Samyam Rajbhandari , Reza Yazdani Aminabadi , Heyang Qin , Arash Bakhtiari , Lev Kurilenko , Yuxiong He

Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and…

Machine Learning · Computer Science 2025-11-07 Mingyu Sung , Vikas Palakonda , Suhwan Im , Sunghwan Moon , Il-Min Kim , Sangseok Yun , Jae-Mo Kang

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during…

Computation and Language · Computer Science 2024-11-21 Sean Welleck , Amanda Bertsch , Matthew Finlayson , Hailey Schoelkopf , Alex Xie , Graham Neubig , Ilia Kulikov , Zaid Harchaoui

SplitReason: Learning To Offload Reasoning

Reasoning in large language models (LLMs) tends to produce substantially longer token generation sequences than simpler language modeling tasks. This extended generation length reflects the multi-step, compositional nature of reasoning and…

Computation and Language · Computer Science 2025-04-24 Yash Akhauri , Anthony Fei , Chi-Chih Chang , Ahmed F. AbouElhamayed , Yueying Li , Mohamed S. Abdelfattah

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

SPEED: Speculative Pipelined Execution for Efficient Decoding

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Efficient LLM inference solution on Intel GPU

Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with…

Hardware Architecture · Computer Science 2024-06-25 Hui Wu , Yi Gan , Feng Yuan , Jing Ma , Wei Zhu , Yutao Xu , Hong Zhu , Yuhua Zhu , Xiaoli Liu , Jinghui Gu , Peng Zhao

Batch Prompting: Efficient Inference with Large Language Model APIs

Performing inference on large volumes of samples with large language models (LLMs) can be computationally and financially costly in industry and real-world use. We propose batch prompting, a simple yet effective prompting approach that…

Computation and Language · Computer Science 2023-10-25 Zhoujun Cheng , Jungo Kasai , Tao Yu

Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL

Deploying large language models (LLMs) on edge devices is challenging due to their limited memory and power resources. Cloud-only inference reduces device burden but introduces high latency and cost. Static edge-cloud partitions optimize a…

Machine Learning · Computer Science 2025-12-30 Abolfazl Younesi , Abbas Shabrang Maryan , Elyas Oustad , Zahra Najafabadi Samani , Mohsen Ansari , Thomas Fahringer

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer,…

Computation and Language · Computer Science 2025-11-25 Lingkun Long , Rubing Yang , Yushi Huang , Desheng Hui , Ao Zhou , Jianlei Yang

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can…

Computational Engineering, Finance, and Science · Computer Science 2024-11-26 Wenxiang Lin , Xinglin Pan , Shaohuai Shi , Xuan Wang , Xiaowen Chu

Large Language Model Partitioning for Low-Latency Inference at the Edge

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute…

Information Retrieval · Computer Science 2026-04-06 Cornelius Kummer , Lena Jurkschat , Michael Färber , Sahar Vahdati

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV…

Machine Learning · Computer Science 2026-03-16 Donglin Yu

Multi-Bin Batching for Increasing LLM Inference Throughput

As large language models (LLMs) grow in popularity for their diverse capabilities, improving the efficiency of their inference systems has become increasingly critical. Batching LLM requests is a critical step in scheduling the inference…

Computation and Language · Computer Science 2024-12-09 Ozgur Guldogan , Jackson Kunde , Kangwook Lee , Ramtin Pedarsani

Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control

Large Language Models (LLMs) are rapidly becoming critical infrastructure for enterprise applications, driving unprecedented demand for GPU-based inference services. A key operational challenge arises from the two-phase nature of LLM…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-04 Ruihan Lin , Zezhen Ding , Zean Han , Jiheng Zhang

SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

The quantization of large language models (LLMs) is crucial for deploying them on devices with limited computational resources. While advanced quantization algorithms offer improved performance compared to the basic linear quantization,…

Machine Learning · Computer Science 2025-03-12 Jaewoo Song , Fangzhen Lin

PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference

Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and…

Hardware Architecture · Computer Science 2025-05-06 Yufeng Gu , Alireza Khadem , Sumanth Umesh , Ning Liang , Xavier Servot , Onur Mutlu , Ravi Iyer , Reetuparna Das