English

Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark

Machine Learning 2025-09-24 v2

Abstract

Efficient inference is a critical challenge in deep generative modeling, particularly as diffusion models grow in capacity and complexity. While increased complexity often improves accuracy, it raises compute costs, latency, and memory requirements. This work investigates techniques such as pruning, quantization, knowledge distillation, and simplified attention to reduce computational overhead without impacting performance. The study also explores the Mixture of Experts (MoE) approach to further enhance efficiency. These experiments provide insights into optimizing inference for the state-of-the-art Fast Diffusion Transformer (fast-DiT) model.

Keywords

Cite

@article{arxiv.2509.17894,
  title  = {Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark},
  author = {Siu Hang Ho and Prasad Ganesan and Nguyen Duong and Daniel Schlabig},
  journal= {arXiv preprint arXiv:2509.17894},
  year   = {2025}
}

Comments

6 pages, 4 figures. Technical report