English
Related papers

Related papers: TP-Aware Dequantization

200 papers

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry.…

Artificial Intelligence · Computer Science 2024-07-11 Pujiang He , Shan Zhou , Wenhuan Huang , Changqing Li , Duyi Wang , Bin Guo , Chen Meng , Sheng Gui , Weifei Yu , Yi Xie

Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing latency and memory…

Machine Learning · Computer Science 2024-11-12 Jinhao Li , Jiaming Xu , Shiyao Li , Shan Huang , Jun Liu , Yaoxiu Lian , Guohao Dai

Large language models (LLMs) have grown beyond the memory capacity of single GPU devices, necessitating quantization techniques for practical deployment. While NF4 (4-bit NormalFloat) quantization enables 4$\times$ memory reduction,…

Machine Learning · Computer Science 2026-04-06 Xiangbo Qi , Chaoyi Jiang , Murali Annavaram

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6…

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Anshumali Shrivastava

As large language models (LLMs) grow in size and deployment scale, quantization has become an essential technique for reducing memory footprint and improving inference efficiency. However, existing quantization toolkits often lack…

Machine Learning · Computer Science 2025-12-01 Dong Liu , Yanxuan Yu

Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges…

Machine Learning · Computer Science 2024-06-21 Jungi Lee , Wonbeom Lee , Jaewoong Sim

Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be…

Artificial Intelligence · Computer Science 2024-11-13 Harry Dong , Tyler Johnson , Minsik Cho , Emad Soroush

Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators…

Machine Learning · Computer Science 2026-01-07 Jan Hansen-Palmus , Michael Truong Le , Oliver Hausdörfer , Alok Verma

Large Language Models (LLMs) have revolutionized AI applications, but deploying them at scale presents significant challenges. We present RTP-LLM, a high-performance inference engine for industrial-scale LLM deployment, successfully…

Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which…

Machine Learning · Computer Science 2023-12-08 Haihao Shen , Hanwen Chang , Bo Dong , Yu Luo , Hengyu Meng

Breakthroughs in the generative AI domain have fueled an explosion of large language model (LLM)-powered applications, whose workloads fundamentally consist of sequences of inferences through transformer architectures. Within this rapidly…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-09 Burak Topcu , Musa Oguzhan Cim , Poovaiah Palangappa , Meena Arunachalam , Mahmut Taylan Kandemir

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

Large language models (LLMs) have demonstrated state-of-the-art performance across various tasks. However, the latency of inference and the large GPU memory consumption of LLMs restrict their deployment performance. Recently, there have…

Machine Learning · Computer Science 2024-02-29 Yi Zhang , Fei Yang , Shuang Peng , Fangyu Wang , Aimin Pan

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor…

Machine Learning · Computer Science 2025-03-14 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

Deploying a large language model (LLM) inference service remains costly because centralized serving depends on specialized GPU clusters and high-bandwidth interconnects in datacenters. An appealing alternative is to leverage collaborative…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-01 Chris Tong , Youhe Jiang , Gufeng Chen , Tianyi Zhao , Sibian Lu , Wenjie Qu , Eric Yang , Lynn Ai , Binhang Yuan

Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade…

Machine Learning · Computer Science 2025-11-04 Hao Zhang , Aining Jia , Weifeng Bu , Yushu Cai , Kai Sheng , Hao Chen , Xin He

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them…

Large Language Models (LLMs) built on transformer architectures have transformed natural language processing, achieving remarkable performance across diverse applications. While distributed inference frameworks enable practical deployment…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-22 Lang Xu , Kaushik Kandadi Suresh , Quentin Anthony , Nawras Alnaasan , Dhabaleswar K. Panda
‹ Prev 1 2 3 10 Next ›