Related papers: Learning Runtime Parameters in Computer Systems wi…

Enhancing Kubernetes Automated Scheduling with Deep Learning and Reinforcement Techniques for Large-Scale Cloud Computing Optimization

With the continuous expansion of the scale of cloud computing applications, artificial intelligence technologies such as Deep Learning and Reinforcement Learning have gradually become the key tools to solve the automated task scheduling of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-14 Zheng Xu , Yulu Gong , Yanlin Zhou , Qiaozhi Bao , Wenpin Qian

A Deep Reinforcement Learning based Algorithm for Time and Cost Optimized Scaling of Serverless Applications

Serverless computing has gained a strong traction in the cloud computing community in recent years. Among the many benefits of this novel computing model, the rapid auto-scaling capability of user applications takes prominence. However, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-23 Anupama Mampage , Shanika Karunasekera , Rajkumar Buyya

Learning to Code: Coded Caching via Deep Reinforcement Learning

We consider a system comprising a file library and a network with a server and multiple users equipped with cache memories. The system operates in two phases: a prefetching phase, where users load their caches with parts of contents from…

Information Theory · Computer Science 2019-12-11 Navid Naderializadeh , Seyed Mohammad Asghari

An Advanced Reinforcement Learning Framework for Online Scheduling of Deferrable Workloads in Cloud Computing

Efficient resource utilization and perfect user experience usually conflict with each other in cloud computing platforms. Great efforts have been invested in increasing resource utilization but trying not to affect users' experience for…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-04 Hang Dong , Liwen Zhu , Zhao Shan , Bo Qiao , Fangkai Yang , Si Qin , Chuan Luo , Qingwei Lin , Yuwen Yang , Gurpreet Virdi , Saravan Rajmohan , Dongmei Zhang , Thomas Moscibroda

Time-Based Roofline for Deep Learning Performance Analysis

Deep learning applications are usually very compute-intensive and require a long run time for training and inference. This has been tackled by researchers from both hardware and software sides, and in this paper, we propose a Roofline-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-24 Yunsong Wang , Charlene Yang , Steven Farrell , Yan Zhang , Thorsten Kurth , Samuel Williams

Deep Reinforcement Learning for Multi-Resource Multi-Machine Job Scheduling

Minimizing job scheduling time is a fundamental issue in data center networks that has been extensively studied in recent years. The incoming jobs require different CPU and memory units, and span different number of time slots. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-21 Weijia Chen , Yuedong Xu , Xiaofeng Wu

Multi-Objective Adaptive Rate Limiting in Microservices Using Deep Reinforcement Learning

As cloud computing and microservice architectures become increasingly prevalent, API rate limiting has emerged as a critical mechanism for ensuring system stability and service quality. Traditional rate limiting algorithms, such as token…

Machine Learning · Computer Science 2025-11-06 Ning Lyu , Yuxi Wang , Ziyu Cheng , Qingyuan Zhang , Feng Chen

Deep Reinforcement Learning for Adaptive Caching in Hierarchical Content Delivery Networks

Caching is envisioned to play a critical role in next-generation content delivery infrastructure, cellular networks, and Internet architectures. By smartly storing the most popular contents at the storage-enabled network entities during…

Information Theory · Computer Science 2019-07-12 Alireza Sadeghi , Gang Wang , Georgios B. Giannakis

Queue-Learning: A Reinforcement Learning Approach for Providing Quality of Service

End-to-end delay is a critical attribute of quality of service (QoS) in application domains such as cloud computing and computer networks. This metric is particularly important in tandem service systems, where the end-to-end service is…

Machine Learning · Computer Science 2021-01-13 Majid Raeis , Ali Tizghadam , Alberto Leon-Garcia

Deep Reinforcement Learning for Job Scheduling and Resource Management in Cloud Computing: An Algorithm-Level Review

Cloud computing has revolutionized the provisioning of computing resources, offering scalable, flexible, and on-demand services to meet the diverse requirements of modern applications. At the heart of efficient cloud operations are job…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-03 Yan Gu , Zhaoze Liu , Shuhong Dai , Cong Liu , Ying Wang , Shen Wang , Georgios Theodoropoulos , Long Cheng

Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers

Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-06 Thomas Bouvier , Bogdan Nicolae , Hugo Chaugier , Alexandru Costan , Ian Foster , Gabriel Antoniu

A Stochastic Approximation Approach for Foresighted Task Scheduling in Cloud Computing

With the increasing and elastic demand for cloud resources, finding an optimal task scheduling mechanism become a challenge for cloud service providers. Due to the time-varying nature of resource demands in length and processing over time…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-14 Seyedakbar Mostafavi , Vesal Hakami

Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

With the rapid growth of Internet services, recommendation systems play a central role in delivering personalized content. Faced with massive user requests and complex model architectures, the key challenge for real-time recommendation…

Information Retrieval · Computer Science 2025-08-14 Junli Shao , Jing Dong , Dingzhou Wang , Kowei Shih , Dannier Li , Chengrui Zhou

Automated Cloud Provisioning on AWS using Deep Reinforcement Learning

As the use of cloud computing continues to rise, controlling cost becomes increasingly important. Yet there is evidence that 30\% - 45\% of cloud spend is wasted. Existing tools for cloud provisioning typically rely on highly trained human…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-20 Zhiguang Wang , Chul Gwon , Tim Oates , Adam Iezzi

Reinforcement Learning Testbed for Power-Consumption Optimization

Common approaches to control a data-center cooling system rely on approximated system/environment models that are built upon the knowledge of mechanical cooling and electrical and thermal management. These models are difficult to design and…

Systems and Control · Computer Science 2018-08-31 Takao Moriyama , Giovanni De Magistris , Michiaki Tatsubori , Tu-Hoa Pham , Asim Munawar , Ryuki Tachibana

Reinforcement Learning Based Approaches to Adaptive Context Caching in Distributed Context Management Systems

Performance metrics-driven context caching has a profound impact on throughput and response time in distributed context management systems for real-time context queries. This paper proposes a reinforcement learning based approach to…

Systems and Control · Electrical Eng. & Systems 2023-02-10 Shakthi Weerasinghe , Arkady Zaslavsky , Seng W. Loke , Amin Abken , Alireza Hassani

Deep-learning enhancement of large scale numerical simulations

Traditional simulations on High-Performance Computing (HPC) systems typically involve modeling very large domains and/or very complex equations. HPC systems allow running large models, but limits in performance increase that have become…

Computational Engineering, Finance, and Science · Computer Science 2020-04-08 Caspar van Leeuwen , Damian Podareanu , Valeriu Codreanu , Maxwell X. Cai , Axel Berg , Simon Portegies Zwart , Robin Stoffer , Menno Veerman , Chiel van Heerwaarden , Sydney Otten , Sascha Caron , Cunliang Geng , Francesco Ambrosetti , Alexandre M. J. J. Bonvin

Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning

Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-25 Houkun Zhu , Dominik Scheinert , Lauritz Thamsen , Kordian Gontarska , Odej Kao

Demonstration of effective UCB-based routing in skill-based queues on real-world data

This paper is about optimally controlling skill-based queueing systems such as data centers, cloud computing networks, and service systems. By means of a case study using a real-world data set, we investigate the practical implementation of…

Machine Learning · Computer Science 2025-06-26 Sanne van Kempen , Jaron Sanders , Fiona Sloothaak , Maarten G. Wolf

Reinforcement Learning-based Admission Control in Delay-sensitive Service Systems

Ensuring quality of service (QoS) guarantees in service systems is a challenging task, particularly when the system is composed of more fine-grained services, such as service function chains. An important QoS metric in service systems is…

Performance · Computer Science 2020-08-24 Majid Raeis , Ali Tizghadam , Alberto Leon-Garcia