TP-Aware Dequantization

Adnan Hoque; Mudhakar Srivatsa; Chih-Chieh Yang; Raghu Ganti

TP-Aware Dequantization

Distributed, Parallel, and Cluster Computing 2024-02-08 v1 Machine Learning

Authors: Adnan Hoque , Mudhakar Srivatsa , Chih-Chieh Yang , Raghu Ganti

Abstract

In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.

Keywords

large language model inference key-value cache quantization

Cite

@article{arxiv.2402.04925,
  title  = {TP-Aware Dequantization},
  author = {Adnan Hoque and Mudhakar Srivatsa and Chih-Chieh Yang and Raghu Ganti},
  journal= {arXiv preprint arXiv:2402.04925},
  year   = {2024}
}

Related papers

View all related →

Artificial Intelligence · Computer Science

Inference Performance Optimization for Large Language Models on CPUs