Related papers: Communication Compression for Tensor Parallel LLM …

Towards Low-bit Communication for Tensor Parallel LLM Inference

Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be…

Artificial Intelligence · Computer Science 2024-11-13 Harry Dong , Tyler Johnson , Minsik Cho , Emad Soroush

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

The ever-increasing sizes of large language models necessitate distributed solutions for fast inference that exploit multi-dimensional parallelism, where computational loads are split across various accelerators such as GPU clusters.…

Artificial Intelligence · Computer Science 2024-12-12 Qingyuan Li , Bo Zhang , Liang Ye , Yifan Zhang , Wei Wu , Yerui Sun , Lin Ma , Yuchen Xie

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs…

Computation and Language · Computer Science 2023-12-07 Huiqiang Jiang , Qianhui Wu , Chin-Yew Lin , Yuqing Yang , Lili Qiu

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a…

Computation and Language · Computer Science 2025-02-24 Weilan Wang , Yu Mao , Dongdong Tang , Hongchao Du , Nan Guan , Chun Jason Xue

Distributed On-Device LLM Inference With Over-the-Air Computation

Large language models (LLMs) have achieved remarkable success across various artificial intelligence tasks. However, their enormous sizes and computational demands pose significant challenges for the deployment on edge devices. To address…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-19 Kai Zhang , Hengtao He , Shenghui Song , Jun Zhang , Khaled B. Letaief

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute…

Information Retrieval · Computer Science 2026-04-06 Cornelius Kummer , Lena Jurkschat , Michael Färber , Sahar Vahdati

Characterizing Communication Patterns in Distributed Large Language Model Inference

Large Language Models (LLMs) built on transformer architectures have transformed natural language processing, achieving remarkable performance across diverse applications. While distributed inference frameworks enable practical deployment…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-22 Lang Xu , Kaushik Kandadi Suresh , Quentin Anthony , Nawras Alnaasan , Dhabaleswar K. Panda

Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

Large language models (LLMs) have demonstrated remarkable success across various application domains, but their enormous sizes and computational demands pose significant challenges for deployment on resource-constrained edge devices. To…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Kai Zhang , Hengtao He , Shenghui Song , Jun Zhang , Khaled B. Letaief

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

Handling communication overhead in large-scale tensor-parallel training remains a critical challenge due to the dense, near-zero distributions of intermediate tensors, which exacerbate errors under frequent communication and introduce…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-28 Man Liu , Xingchen Liu , Xingjian Tian , Bing Lu , Shengkay Lyu , Shengquan Yin , Wenjing Huang , Zheng Wei , Hairui Zhao , Guangming Tan , Dingwen Tao

Compressing Large Language Models with Automated Sub-Network Search

Large Language Models (LLMs) demonstrate exceptional reasoning abilities, enabling strong generalization across diverse tasks such as commonsense reasoning and instruction following. However, as LLMs scale, inference costs become…

Computation and Language · Computer Science 2025-02-06 Rhea Sanjay Sukthanker , Benedikt Staffler , Frank Hutter , Aaron Klein

LatentLLM: Attention-Aware Joint Tensor Compression

Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension…

Machine Learning · Computer Science 2025-05-27 Toshiaki Koike-Akino , Xiangyu Chen , Jing Liu , Ye Wang , Pu , Wang , Matthew Brand

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the…

Machine Learning · Computer Science 2024-04-10 Georgy Tyukin

Does compressing activations help model parallel training?

Large-scale Transformer models are known for their exceptional performance in a range of tasks, but training them can be difficult due to the requirement for communication-intensive model parallelism. One way to improve training speed is to…

Machine Learning · Computer Science 2023-01-09 Song Bian , Dacheng Li , Hongyi Wang , Eric P. Xing , Shivaram Venkataraman

Tensor-Parallelism with Partially Synchronized Activations

Training and inference of Large Language Models (LLMs) with tensor-parallelism requires substantial communication to synchronize activations. Our findings suggest that with a few minor adjustments to current practices, LLMs can be trained…

Machine Learning · Computer Science 2025-12-02 Itay Lamprecht , Asaf Karnieli , Yair Hanani , Niv Giladi , Daniel Soudry

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges…

Machine Learning · Computer Science 2024-06-21 Jungi Lee , Wonbeom Lee , Jaewoong Sim

An Empirical Study on Prompt Compression for Large Language Models

Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression…

Computation and Language · Computer Science 2025-05-02 Zheng Zhang , Jinyi Li , Yihuai Lan , Xiang Wang , Hao Wang

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization

Large language models (LLMs) have demonstrated state-of-the-art performance across various tasks. However, the latency of inference and the large GPU memory consumption of LLMs restrict their deployment performance. Recently, there have…

Machine Learning · Computer Science 2024-02-29 Yi Zhang , Fei Yang , Shuang Peng , Fangyu Wang , Aimin Pan

Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the…

Computation and Language · Computer Science 2026-04-23 Zihao Xu , John Harvill , Ziwei Fan , Yizhou Sun , Hao Ding , Hao Wang

LightThinker: Thinking Step-by-Step Compression

Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we…

Computation and Language · Computer Science 2025-09-24 Jintian Zhang , Yuqi Zhu , Mengshu Sun , Yujie Luo , Shuofei Qiao , Lun Du , Da Zheng , Huajun Chen , Ningyu Zhang

ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition

Recent advancements in Large Language Models (LLMs) have demonstrated impressive capabilities as their scale expands to billions of parameters. Deploying these large-scale models on resource-constrained platforms presents significant…

Hardware Architecture · Computer Science 2025-05-15 Keran Zheng , Yinting Huang , Zhewen Yu , Christos-Savvas Bouganis