English

Highly Efficient Memory Failure Prediction using Mcelog-based Data Mining and Machine Learning

Databases 2021-05-18 v2 Machine Learning Performance Software Engineering

Abstract

In the data center, unexpected downtime caused by memory failures can lead to a decline in the stability of the server and even the entire information technology infrastructure, which harms the business. Therefore, whether the memory failure can be accurately predicted in advance has become one of the most important issues to be studied in the data center. However, for the memory failure prediction in the production system, it is necessary to solve technical problems such as huge data noise and extreme imbalance between positive and negative samples, and at the same time ensure the long-term stability of the algorithm. This paper compares and summarizes some commonly used skills and the improvement they can bring. The single model we proposed won the top 14th in the 2nd Alibaba Cloud AIOps Competition belonging to the 25th PAKDD conference. It takes only 30 minutes to pass the online test, while most of the other contestants' solution need more than 3 hours. Codes has been open source to https://www.github.com/ycd2016/acaioc2.

Keywords

Cite

@article{arxiv.2105.04547,
  title  = {Highly Efficient Memory Failure Prediction using Mcelog-based Data Mining and Machine Learning},
  author = {Chengdong Yao},
  journal= {arXiv preprint arXiv:2105.04547},
  year   = {2021}
}

Comments

11 pages, 2 figures, 1 table. Codes has been open source to https://www.github.com/ycd2016/acaioc2

R2 v1 2026-06-24T01:57:30.318Z