Artificial Intelligence · Computer Science
Inference Performance Optimization for Large Language Models on CPUs
Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li +6
2024-07-11
Machine Learning · Computer Science
Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization
Jinhao Li, Jiaming Xu, Shiyao Li, Shan Huang +3
2024-11-12
Machine Learning · Computer Science
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen +9
2024-03-05
Distributed, Parallel, and Cluster Computing · Computer Science
Large Language Model Partitioning for Low-Latency Inference at the Edge
Dimitrios Kafetzis, Ramin Khalili, Iordanis Koutsopoulos
2025-05-06
Artificial Intelligence · Computer Science
Towards Low-bit Communication for Tensor Parallel LLM Inference
Harry Dong, Tyler Johnson, Minsik Cho, Emad Soroush
2024-11-13
Machine Learning · Computer Science
Communication Compression for Tensor Parallel LLM Inference
Jan Hansen-Palmus, Michael Truong Le, Oliver Hausdörfer, Alok Verma
2026-01-07
Operating Systems · Computer Science
RTP-LLM: High-Performance Alibaba LLM Inference Engine
Boyu Tan, Jiarui Guo, Zongwei Lv, Hanbo Sun +25
2026-05-29
Machine Learning · Computer Science
Efficient LLM Inference on CPUs
Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo +1
2023-12-08
Distributed, Parallel, and Cluster Computing · Computer Science
Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks
Burak Topcu, Musa Oguzhan Cim, Poovaiah Palangappa, Meena Arunachalam +1
2026-03-09
Machine Learning · Computer Science
LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin +1
2024-03-05
Machine Learning · Computer Science
FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
Yi Zhang, Fei Yang, Shuang Peng, Fangyu Wang +1
2024-02-29
Machine Learning · Computer Science
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores
Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang
2025-03-14
Distributed, Parallel, and Cluster Computing · Computer Science
Parallax: Efficient LLM Inference Service over Decentralized Environment
Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao +5
2025-10-01
Machine Learning · Computer Science
FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
Hao Zhang, Aining Jia, Weifeng Bu, Yushu Cai +3
2025-11-04
Machine Learning · Computer Science
Distributed Inference and Fine-tuning of Large Language Models Over The Internet
Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk +4
2023-12-14
Distributed, Parallel, and Cluster Computing · Computer Science
Characterizing Communication Patterns in Distributed Large Language Model Inference
Lang Xu, Kaushik Kandadi Suresh, Quentin Anthony, Nawras Alnaasan +1
2025-07-22
Distributed, Parallel, and Cluster Computing · Computer Science
Enabling Dynamic Sparsity in Quantized LLM Inference
Rongxiang Wang, Kangyuan Shu, Felix Xiaozhu Lin
2025-11-07
Machine Learning · Computer Science
LatentLLM: Attention-Aware Joint Tensor Compression
Toshiaki Koike-Akino, Xiangyu Chen, Jing Liu, Ye Wang +3
2025-05-27
Computation and Language · Computer Science
Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan +2
2024-07-08
Machine Learning · Computer Science
SparQ Attention: Bandwidth-Efficient LLM Inference
Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake +2
2024-09-05