Related papers: Scalable and Cost-Efficient ML Inference: Parallel…

FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication

Serverless computing offers attractive scalability, elasticity and cost-effectiveness. However, constraints on memory, CPU and function runtime have hindered its adoption for data-intensive applications and machine learning (ML) workloads.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-25 Joe Oakley , Hakan Ferhatosmanoglu

Towards Resource-Efficient Serverless LLM Inference with SLINFER

The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Chuhao Xu , Zijun Li , Quan Chen , Han Zhao , Xueyan Tang , Minyi Guo

Cost-Performance Analysis: A Comparative Study of CPU-Based Serverless and GPU-Based Training Architectures

The field of distributed machine learning (ML) faces increasing demands for scalable and cost-effective training solutions, particularly in the context of large, complex models. Serverless computing has emerged as a promising paradigm to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-19 Amine Barrak , Fabio Petrillo , Fehmi Jaafar

Serverless architecture efficiency: an exploratory study

Cloud service provider propose services to insensitive customers to use their platform. Different services can achieve the same result at different cost. In this paper, we study the efficiency of a serverless architecture for running highly…

Software Engineering · Computer Science 2019-01-15 Samuel Lavoie , Anthony Garant , Fabio Petrillo

{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-09 Minchen Yu , Rui Yang , Chaobo Jia , Zhaoyuan Su , Sheng Yao , Tingfeng Lan , Yuchen Yang , Zirui Wang , Yue Cheng , Wei Wang , Ao Wang , Ruichuan Chen

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers,…

Machine Learning · Computer Science 2024-07-26 Yao Fu , Leyang Xue , Yeqi Huang , Andrei-Octavian Brabete , Dmitrii Ustiugov , Yuvraj Patel , Luo Mai

A Serverless Architecture for Efficient and Scalable Monte Carlo Markov Chain Computation

Computer power is a constantly increasing demand in scientific data analyses, in particular when Markov Chain Monte Carlo (MCMC) methods are involved, for example for estimating integral functions or Bayesian posterior probabilities. In…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-09 Fabio Castagna , Alberto Trombetta , Marco Landoni , Stefano Andreon

Distributed Double Machine Learning with a Serverless Architecture

This paper explores serverless cloud computing for double machine learning. Being based on repeated cross-fitting, double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-21 Malte S. Kurz

Serverless Abstractions for Short-Running, Lightweight Streams

Serverless computing and stream processing represent two dominant paradigms for event-driven data processing, yet both make assumptions that render them inefficient for short-running, lightweight, and unpredictable streams that require…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-04 Natalie Carl , Niklas Kowallik , Constantin Stahl , Trever Schirmer , Tobias Pfandzelter , David Bermbach

Resource Allocation in Serverless Query Processing

Data lakes hold a growing amount of cold data that is infrequently accessed, yet require interactive response times. Serverless functions are seen as a way to address this use case since they offer an appealing alternative to maintaining…

Databases · Computer Science 2022-08-23 Simon Kassing , Ingo Müller , Gustavo Alonso

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-26 Himel Ghosh

ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve…

Machine Learning · Computer Science 2025-05-21 Yifan Sui , Hao Wang , Hanfei Yu , Yitao Hu , Jianxun Li , Hao Wang

SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning Design and Training

In today's production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-05 Ahsan Ali , Syed Zawad , Paarijaat Aditya , Istemi Ekin Akkus , Ruichuan Chen , Feng Yan

Efficient Serverless Cold Start: Reducing Library Loading Overhead by Profile-guided Optimization

Serverless computing abstracts away server management, enabling automatic scaling, efficient resource utilization, and cost-effective pricing models. However, despite these advantages, it faces the significant challenge of cold-start…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-29 Syed Salauddin Mohammad Tariq , Ali Al Zein , Soumya Sripad Vaidya , Arati Khanolkar , Zheng Song , Probir Roy

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Towards Demystifying Intra-Function Parallelism in Serverless Computing

Serverless computing offers a pay-per-use model with high elasticity and automatic scaling for a wide range of applications. Since cloud providers abstract most of the underlying infrastructure, these services work similarly to black-boxes.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-10 Michael Kiener , Mohak Chadha , Michael Gerndt

Towards Serverless Processing of Spatiotemporal Big Data Queries

Spatiotemporal data are being produced in continuously growing volumes by a variety of data sources and a variety of application fields rely on rapid analysis of such data. Existing systems such as PostGIS or MobilityDB usually build on…

Databases · Computer Science 2026-05-21 Diana Baumann , Tim C. Rese , David Bermbach

MLLess: Achieving Cost Efficiency in Serverless Machine Learning Training

Function-as-a-Service (FaaS) has raised a growing interest in how to "tame" serverless computing to enable domain-specific use cases such as data-intensive applications and machine learning (ML), to name a few. Recently, several systems…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-14 Pablo Gimeno Sarroca , Marc Sánchez-Artigas

Serverless Query Processing with Flexible Performance SLAs and Prices

Serverless query processing has become increasingly popular due to its auto-scaling, high elasticity, and pay-as-you-go pricing. It allows cloud data warehouse (or lakehouse) users to focus on data analysis without the burden of managing…

Databases · Computer Science 2024-12-24 Haoqiong Bian , Dongyang Geng , Yunpeng Chai , Anastasia Ailamaki

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and…

Machine Learning · Computer Science 2024-11-26 Yilong Zhao , Shuo Yang , Kan Zhu , Lianmin Zheng , Baris Kasikci , Yang Zhou , Jiarong Xing , Ion Stoica