English

Approximate Caching for Efficiently Serving Diffusion Models

Computer Vision and Pattern Recognition 2023-12-08 v1

Abstract

Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, production-grade diffusion model serving is a resource intensive task that not only require high-end GPUs which are expensive but also incurs considerable latency. In this paper, we introduce a technique called approximate-caching that can reduce such iterative denoising steps for an image generation based on a prompt by reusing intermediate noise states created during a prior image generation for similar prompts. Based on this idea, we present an end to end text-to-image system, Nirvana, that uses the approximate-caching with a novel cache management-policy Least Computationally Beneficial and Frequently Used (LCBFU) to provide % GPU compute savings, 19.8% end-to-end latency reduction and 19% dollar savings, on average, on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment.

Keywords

Cite

@article{arxiv.2312.04429,
  title  = {Approximate Caching for Efficiently Serving Diffusion Models},
  author = {Shubham Agarwal and Subrata Mitra and Sarthak Chakraborty and Srikrishna Karanam and Koyel Mukherjee and Shiv Saini},
  journal= {arXiv preprint arXiv:2312.04429},
  year   = {2023}
}

Comments

Accepted at NSDI'24

R2 v1 2026-06-28T13:44:09.973Z