Artificial Intelligence · Computer Science
Towards Low-bit Communication for Tensor Parallel LLM Inference
Harry Dong, Tyler Johnson, Minsik Cho, Emad Soroush
2024-11-13
Artificial Intelligence · Computer Science
Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference
Qingyuan Li, Bo Zhang, Liang Ye, Yifan Zhang +4
2024-12-12
Computation and Language · Computer Science
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang +1
2023-12-07
Computation and Language · Computer Science
When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models
Weilan Wang, Yu Mao, Dongdong Tang, Hongchao Du +2
2025-02-24
Distributed, Parallel, and Cluster Computing · Computer Science
Distributed On-Device LLM Inference With Over-the-Air Computation
Kai Zhang, Hengtao He, Shenghui Song, Jun Zhang +1
2025-02-19
Information Retrieval · Computer Science
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
Cornelius Kummer, Lena Jurkschat, Michael Färber, Sahar Vahdati
2026-04-06
Distributed, Parallel, and Cluster Computing · Computer Science
Characterizing Communication Patterns in Distributed Large Language Model Inference
Lang Xu, Kaushik Kandadi Suresh, Quentin Anthony, Nawras Alnaasan +1
2025-07-22
Distributed, Parallel, and Cluster Computing · Computer Science
Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks
Kai Zhang, Hengtao He, Shenghui Song, Jun Zhang +1
2025-03-20
Distributed, Parallel, and Cluster Computing · Computer Science
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
Man Liu, Xingchen Liu, Xingjian Tian, Bing Lu +7
2026-04-28
Computation and Language · Computer Science
Compressing Large Language Models with Automated Sub-Network Search
Rhea Sanjay Sukthanker, Benedikt Staffler, Frank Hutter, Aaron Klein
2025-02-06
Machine Learning · Computer Science
LatentLLM: Attention-Aware Joint Tensor Compression
Toshiaki Koike-Akino, Xiangyu Chen, Jing Liu, Ye Wang +3
2025-05-27
Machine Learning · Computer Science
Does compressing activations help model parallel training?
Song Bian, Dacheng Li, Hongyi Wang, Eric P. Xing +1
2023-01-09
Machine Learning · Computer Science
Tensor-Parallelism with Partially Synchronized Activations
Itay Lamprecht, Asaf Karnieli, Yair Hanani, Niv Giladi +1
2025-12-02
Computation and Language · Computer Science
An Empirical Study on Prompt Compression for Large Language Models
Zheng Zhang, Jinyi Li, Yihuai Lan, Xiang Wang +1
2025-05-02
Machine Learning · Computer Science
FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
Yi Zhang, Fei Yang, Shuang Peng, Fangyu Wang +1
2024-02-29
Computation and Language · Computer Science
Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models
Zihao Xu, John Harvill, Ziwei Fan, Yizhou Sun +2
2026-04-23
Computation and Language · Computer Science
LightThinker: Thinking Step-by-Step Compression
Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo +5
2025-09-24
Hardware Architecture · Computer Science
ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition
Keran Zheng, Yinting Huang, Zhewen Yu, Christos-Savvas Bouganis
2025-05-15
Machine Learning · Computer Science
EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices
Arnab Sanyal, Gourav Datta, Prithwish Mukherjee, Sandeep P. Chinchali +1
2026-05-05
Computation and Language · Computer Science
CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks
Andrei Tomut, Saeed S. Jahromi, Abhijoy Sarkar, Uygar Kurt +14
2025-06-03
Distributed, Parallel, and Cluster Computing · Computer Science
Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems
Haowei Yang, Yu Tian, Zhongheng Yang, Zhao Wang +2
2025-06-25
Machine Learning · Computer Science
On the Compressibility of Quantized Large Language Models
Yu Mao, Weilan Wang, Hongchao Du, Nan Guan +1
2024-05-07