Related papers: Highly Efficient Memory Failure Prediction using M…

DRAM Failure Prediction in AIOps: Empirical Evaluation, Challenges and Opportunities

DRAM failure prediction is a vital task in AIOps, which is crucial to maintain the reliability and sustainable service of large-scale data centers. However, limited work has been done on DRAM failure prediction mainly due to the lack of…

Machine Learning · Computer Science 2021-05-05 Zhiyue Wu , Hongzuo Xu , Guansong Pang , Fengyuan Yu , Yijie Wang , Songlei Jian , Yongjun Wang

Workload Failure Prediction for Data Centers

Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-13 Jie Li , Rui Wang , Ghazanfar Ali , Tommy Dang , Alan Sill , Yong Chen

First CE Matters: On the Importance of Long Term Properties on Memory Failure Prediction

Dynamic random access memory failures are a threat to the reliability of data centres as they lead to data loss and system crashes. Timely predictions of memory failures allow for taking preventive measures such as server migration and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-21 Jasmin Bogatinovski , Qiao Yu , Jorge Cardoso , Odej Kao

Online Memory Leak Detection in the Cloud-based Infrastructures

A memory leak in an application deployed on the cloud can affect the availability and reliability of the application. Therefore, to identify and ultimately resolve it quickly is highly important. However, in the production environment…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-17 Anshul Jindal , Paul Staab , Jorge Cardoso , Michael Gerndt , Vladimir Podolskiy

Memory Leak Detection Algorithms in the Cloud-based Infrastructure

A memory leak in an application deployed on the cloud can affect the availability and reliability of the application. Therefore, identifying and ultimately resolve it quickly is highly important. However, in the production environment…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-17 Anshul Jindal , Paul Staab , Pooja Kulkarni , Jorge Cardoso , Michael Gerndt , Vladimir Podolskiy

The Life and Death of SSDs and HDDs: Similarities, Differences, and Prediction Models

Data center downtime typically centers around IT equipment failure. Storage devices are the most frequently failing components in data centers. We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that…

Machine Learning · Computer Science 2020-12-24 Riccardo Pinciroli , Lishan Yang , Jacob Alter , Evgenia Smirni

Machine Learning for Predictive Analytics of Compute Cluster Jobs

We address the problem of predicting whether sufficient memory and CPU resources have been requested for jobs at submission time. For this purpose, we examine the task of training a supervised machine learning system to predict the outcome…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-05 Dan Andresen , William Hsu , Huichen Yang , Adedolapo Okanlawon

Investigating Memory Failure Prediction Across CPU Architectures

Large-scale datacenters often experience memory failures, where Uncorrectable Errors (UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing approaches primarily utilize Correctable Errors (CEs) to predict UEs,…

Hardware Architecture · Computer Science 2024-12-17 Qiao Yu , Wengui Zhang , Min Zhou , Jialiang Yu , Zhenli Sheng , Jasmin Bogatinovski , Jorge Cardoso , Odej Kao

Significance of Disk Failure Prediction in Datacenters

Modern datacenters assemble a very large number of disk drives under a single roof. Even if economic and technical factors where to make individual drives more reliable (which is not at all clear, given the commoditization of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-10 Jayanta Basak , Randy H. Katz

Helping HPC Users Specify Job Memory Requirements via Machine Learning

Resource allocation in High Performance Computing (HPC) settings is still not easy for end-users due to the wide variety of application and environment configuration options. Users have difficulties to estimate the number of processors and…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-10 Eduardo R. Rodrigues , Renato L. F. Cunha , Marco A. S. Netto , Michael Spriggs

Online Job Failure Prediction in an HPC System

Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-31 Francesco Antici , Andrea Borghesi , Zeynep Kiziltan

M$^2$-MFP: A Multi-Scale and Multi-Level Memory Failure Prediction Framework for Reliable Cloud Infrastructure

As cloud services become increasingly integral to modern IT infrastructure, ensuring hardware reliability is essential to sustain high-quality service. Memory failures pose a significant threat to overall system stability, making accurate…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-11 Hongyi Xie , Min Zhou , Qiao Yu , Jialiang Yu , Zhenli Sheng , Hong Xie , Defu Lian

Predicting Dynamic Memory Requirements for Scientific Workflow Tasks

With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-20 Jonathan Bader , Nils Diedrich , Lauritz Thamsen , Odej Kao

Evaluating Memento Service Optimizations

Services and applications based on the Memento Aggregator can suffer from slow response times due to the federated search across web archives performed by the Memento infrastructure. In an effort to decrease the response times, we…

Information Retrieval · Computer Science 2019-06-04 Martin Klein , Lyudmila Balakireva , Harihar Shankar

Predicting Scheduling Failures in the Cloud

Cloud Computing has emerged as a key technology to deliver and manage computing, platform, and software services over the Internet. Task scheduling algorithms play an important role in the efficiency of cloud computing services as they aim…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-07-14 Mbarka Soualhia , Foutse Khomh , Sofiene Tahar

A Comparison Between Data Mining Prediction Algorithms for Fault Detection(Case study: Ahanpishegan co.)

In the current competitive world, industrial companies seek to manufacture products of higher quality which can be achieved by increasing reliability, maintainability and thus the availability of products. On the other hand, improvement in…

Machine Learning · Computer Science 2012-01-31 Golriz Amooee , Behrouz Minaei-Bidgoli , Malihe Bagheri-Dehnavi

Rapid Time Series Prediction with a Hardware-Based Reservoir Computer

Reservoir computing is a neural network approach for processing time-dependent signals that has seen rapid development in recent years. Physical implementations of the technique using optical reservoirs have demonstrated remarkable accuracy…

Machine Learning · Computer Science 2019-01-30 Daniel Canaday , Aaron Griffith , Daniel Gauthier

Prediction of GPU Failures Under Deep Learning Workloads

Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. Meanwhile, GPU failures, which are inevitable, cause severe consequences in DL tasks: they disrupt distributed trainings, crash inference…

Machine Learning · Computer Science 2022-01-31 Heting Liu , Zhichao Li , Cheng Tan , Rongqiu Yang , Guohong Cao , Zherui Liu , Chuanxiong Guo

Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study

In large-scale datacenters, memory failure is a common cause of server crashes, with Uncorrectable Errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using…

Hardware Architecture · Computer Science 2023-12-19 Qiao Yu , Wengui Zhang , Jorge Cardoso , Odej Kao

Large Scale Studies of Memory, Storage, and Network Failures in a Modern Data Center

The workloads running in the modern data centers of large scale Internet service providers (such as Amazon, Baidu, Facebook, Google, and Microsoft) support billions of users and span globally distributed infrastructure. Yet, the devices…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-14 Justin Meza