English
Related papers

Related papers: Highly Efficient Memory Failure Prediction using M…

200 papers

DRAM failure prediction is a vital task in AIOps, which is crucial to maintain the reliability and sustainable service of large-scale data centers. However, limited work has been done on DRAM failure prediction mainly due to the lack of…

Machine Learning · Computer Science 2021-05-05 Zhiyue Wu , Hongzuo Xu , Guansong Pang , Fengyuan Yu , Yijie Wang , Songlei Jian , Yongjun Wang

Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-13 Jie Li , Rui Wang , Ghazanfar Ali , Tommy Dang , Alan Sill , Yong Chen

Dynamic random access memory failures are a threat to the reliability of data centres as they lead to data loss and system crashes. Timely predictions of memory failures allow for taking preventive measures such as server migration and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-21 Jasmin Bogatinovski , Qiao Yu , Jorge Cardoso , Odej Kao

A memory leak in an application deployed on the cloud can affect the availability and reliability of the application. Therefore, to identify and ultimately resolve it quickly is highly important. However, in the production environment…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-17 Anshul Jindal , Paul Staab , Jorge Cardoso , Michael Gerndt , Vladimir Podolskiy

A memory leak in an application deployed on the cloud can affect the availability and reliability of the application. Therefore, identifying and ultimately resolve it quickly is highly important. However, in the production environment…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-17 Anshul Jindal , Paul Staab , Pooja Kulkarni , Jorge Cardoso , Michael Gerndt , Vladimir Podolskiy

Data center downtime typically centers around IT equipment failure. Storage devices are the most frequently failing components in data centers. We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that…

Machine Learning · Computer Science 2020-12-24 Riccardo Pinciroli , Lishan Yang , Jacob Alter , Evgenia Smirni

We address the problem of predicting whether sufficient memory and CPU resources have been requested for jobs at submission time. For this purpose, we examine the task of training a supervised machine learning system to predict the outcome…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-05 Dan Andresen , William Hsu , Huichen Yang , Adedolapo Okanlawon

Large-scale datacenters often experience memory failures, where Uncorrectable Errors (UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing approaches primarily utilize Correctable Errors (CEs) to predict UEs,…

Hardware Architecture · Computer Science 2024-12-17 Qiao Yu , Wengui Zhang , Min Zhou , Jialiang Yu , Zhenli Sheng , Jasmin Bogatinovski , Jorge Cardoso , Odej Kao

Modern datacenters assemble a very large number of disk drives under a single roof. Even if economic and technical factors where to make individual drives more reliable (which is not at all clear, given the commoditization of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-10 Jayanta Basak , Randy H. Katz

Resource allocation in High Performance Computing (HPC) settings is still not easy for end-users due to the wide variety of application and environment configuration options. Users have difficulties to estimate the number of processors and…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-10 Eduardo R. Rodrigues , Renato L. F. Cunha , Marco A. S. Netto , Michael Spriggs

Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-31 Francesco Antici , Andrea Borghesi , Zeynep Kiziltan

As cloud services become increasingly integral to modern IT infrastructure, ensuring hardware reliability is essential to sustain high-quality service. Memory failures pose a significant threat to overall system stability, making accurate…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-11 Hongyi Xie , Min Zhou , Qiao Yu , Jialiang Yu , Zhenli Sheng , Hong Xie , Defu Lian

With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-20 Jonathan Bader , Nils Diedrich , Lauritz Thamsen , Odej Kao

Services and applications based on the Memento Aggregator can suffer from slow response times due to the federated search across web archives performed by the Memento infrastructure. In an effort to decrease the response times, we…

Information Retrieval · Computer Science 2019-06-04 Martin Klein , Lyudmila Balakireva , Harihar Shankar

Cloud Computing has emerged as a key technology to deliver and manage computing, platform, and software services over the Internet. Task scheduling algorithms play an important role in the efficiency of cloud computing services as they aim…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-07-14 Mbarka Soualhia , Foutse Khomh , Sofiene Tahar

In the current competitive world, industrial companies seek to manufacture products of higher quality which can be achieved by increasing reliability, maintainability and thus the availability of products. On the other hand, improvement in…

Machine Learning · Computer Science 2012-01-31 Golriz Amooee , Behrouz Minaei-Bidgoli , Malihe Bagheri-Dehnavi

Reservoir computing is a neural network approach for processing time-dependent signals that has seen rapid development in recent years. Physical implementations of the technique using optical reservoirs have demonstrated remarkable accuracy…

Machine Learning · Computer Science 2019-01-30 Daniel Canaday , Aaron Griffith , Daniel Gauthier

Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. Meanwhile, GPU failures, which are inevitable, cause severe consequences in DL tasks: they disrupt distributed trainings, crash inference…

Machine Learning · Computer Science 2022-01-31 Heting Liu , Zhichao Li , Cheng Tan , Rongqiu Yang , Guohong Cao , Zherui Liu , Chuanxiong Guo

In large-scale datacenters, memory failure is a common cause of server crashes, with Uncorrectable Errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using…

Hardware Architecture · Computer Science 2023-12-19 Qiao Yu , Wengui Zhang , Jorge Cardoso , Odej Kao

The workloads running in the modern data centers of large scale Internet service providers (such as Amazon, Baidu, Facebook, Google, and Microsoft) support billions of users and span globally distributed infrastructure. Yet, the devices…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-14 Justin Meza
‹ Prev 1 2 3 10 Next ›