Related papers: FaaSTube: Optimizing GPU-oriented Data Transfer fo…

Towards Fast Setup and High Throughput of GPU Serverless Computing

Integrating GPUs into serverless computing platforms is crucial for improving efficiency. However, existing solutions for GPU-enabled serverless computing platforms face two significant problems due to coarse-grained GPU management: long…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-24 Han Zhao , Weihao Cui , Quan Chen , Shulai Zhang , Zijun Li , Jingwen Leng , Chao Li , Deze Zeng , Minyi Guo

FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference

Serverless computing (FaaS) has been extensively utilized for deep learning (DL) inference due to the ease of deployment and pay-per-use benefits. However, existing FaaS platforms utilize GPUs in a coarse manner for DL inferences, without…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-04 Jianfeng Gu , Yichao Zhu , Puxuan Wang , Mohak Chadha , Michael Gerndt

Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference

Serverless computing offers a compelling cloud model for online inference services. However, existing serverless platforms lack efficient support for GPUs, hindering their ability to deliver high-performance inference. In this paper, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-09 Minchen Yu , Ao Wang , Dong Chen , Haoxuan Yu , Xiaonan Luo , Zhuohao Li , Wei Wang , Ruichuan Chen , Dapeng Nie , Haoran Yang , Yu Ding

A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

The torrential influx of floating-point data from domains like IoT and HPC necessitates high-performance lossless compression to mitigate storage costs while preserving absolute data fidelity. Leveraging GPU parallelism for this task…

Databases · Computer Science 2025-11-12 Zheng Li , Weiyan Wang , Ruiyuan Li , Chao Chen , Xianlei Long , Linjiang Zheng , Quanqing Xu , Chuanhui Yang

HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences

Serverless Computing (FaaS) has become a popular paradigm for deep learning inference due to the ease of deployment and pay-per-use benefits. However, current serverless inference platforms encounter the coarse-grained and static GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-03 Jianfeng Gu , Puxuan Wang , Isaac David Nunez Araya , Kai Huang , Michael Gerndt

FaaSdom: A Benchmark Suite for Serverless Computing

Serverless computing has become a major trend among cloud providers. With serverless computing, developers fully delegate the task of managing the servers, dynamically allocating the required resources, as well as handling availability and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-08 Pascal Maissen , Pascal Felber , Peter Kropf , Valerio Schiavoni

Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching: A Closed-Form Characterization

GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing…

Performance · Computer Science 2021-01-13 Yoshiaki Inoue

FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication

Serverless computing offers attractive scalability, elasticity and cost-effectiveness. However, constraints on memory, CPU and function runtime have hindered its adoption for data-intensive applications and machine learning (ML) workloads.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-25 Joe Oakley , Hakan Ferhatosmanoglu

Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation

Industrial and government organizations increasingly depend on data-driven analytics for workforce, finance, and regulated decision processes, where timeliness, cost efficiency, and compliance are critical. Distributed frameworks such as…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-02 Guilin Zhang , Wulan Guo , Ziqi Tan , Srinivas Vippagunta , Suchitra Raman , Shreeshankar Chatterjee , Ju Lin , Shang Liu , Mary Schladenhauffen , Jeffrey Luo , Hailong Jiang

Kernel-as-a-Service: A Serverless Interface to GPUs

Serverless computing has made it easier than ever to deploy applications over scalable cloud resources, all the while driving higher utilization for cloud providers. While this technique has worked well for easily divisible resources like…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-19 Nathan Pemberton , Anton Zabreyko , Zhoujie Ding , Randy Katz , Joseph Gonzalez

GPU-enabled Function-as-a-Service for Machine Learning Inference

Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-13 Ming Zhao , Kritshekhar Jha , Sungho Hong

A Server-based Approach for Predictable GPU Access with Improved Analysis

We propose a server-based approach to manage a general-purpose graphics processing unit (GPU) in a predictable and efficient manner. Our proposed approach introduces a GPU server that is a dedicated task to handle GPU requests from other…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-14 Hyoseung Kim , Pratyush Patel , Shige Wang , Ragunathan , Rajkumar

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers,…

Machine Learning · Computer Science 2024-07-26 Yao Fu , Leyang Xue , Yeqi Huang , Andrei-Octavian Brabete , Dmitrii Ustiugov , Yuvraj Patel , Luo Mai

A GPU Based Memory Optimized Parallel Method For FFT Implementation

FFT (fast Fourier transform) plays a very important role in many fields, such as digital signal processing, digital image processing and so on. However, in application, FFT becomes a factor of affecting the processing efficiency, especially…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-25 Fan Zhang , Chen Hu , Qiang Yin , Wei Hu

Faasm: Lightweight Isolation for Efficient Stateful Serverless Computing

Serverless computing is an excellent fit for big data processing because it can scale quickly and cheaply to thousands of parallel functions. Existing serverless platforms isolate functions in ephemeral, stateless containers, preventing…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-22 Simon Shillaker , Peter Pietzuch

Serverless inferencing on Kubernetes

Organisations are increasingly putting machine learning models into production at scale. The increasing popularity of serverless scale-to-zero paradigms presents an opportunity for deploying machine learning models to help mitigate…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-27 Clive Cox , Dan Sun , Ellis Tarn , Animesh Singh , Rakesh Kelkar , David Goodwin

A readahead prefetcher for GPU file system layer

GPUs are broadly used in I/O-intensive big data applications. Prior works demonstrate the benefits of using GPU-side file system layer, GPUfs, to improve the GPU performance and programmability in such workloads. However, GPUfs fails to…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-14 Vasilis Dimitsas , Mark Silberstein

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-26 Himel Ghosh

{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-09 Minchen Yu , Rui Yang , Chaobo Jia , Zhaoyuan Su , Sheng Yao , Tingfeng Lan , Yuchen Yang , Zirui Wang , Yue Cheng , Wei Wang , Ao Wang , Ruichuan Chen

A Survey of Serverless Machine Learning Model Inference

Recent developments in Generative AI, Computer Vision, and Natural Language Processing have led to an increased integration of AI models into various products. This widespread adoption of AI requires significant efforts in deploying these…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-23 Kamil Kojs