English
Related papers

Related papers: CloudRCA: A Root Cause Analysis Framework for Clou…

200 papers

To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has…

Machine Learning · Computer Science 2024-05-24 Thorsten Wittkopp , Philipp Wiesner , Odej Kao

Effectively localizing root causes of performance anomalies is crucial to enabling the rapid recovery and loss mitigation of microservice applications in the cloud. Depending on the granularity of the causes that can be localized, a service…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-09 Ruyue Xin , Peng Chen , Zhiming Zhao

The dynamics and complexity of cloud-native systems present significant challenges for Root Cause Analysis (RCA). While causality-based RCA methods have shown significant progress in recent years, their practical adoption is fundamentally…

Software Engineering · Computer Science 2026-03-03 Shuai Liang , Pengfei Chen , Bozhe Tian , Gou Tan , Maohong Xu , Youjun Qu , Yahui Zhao , Yiduo Shang , Chongkang Tan

Root Cause Analysis (RCA) is a crucial aspect of incident management in large-scale cloud services. While the term root cause analysis or RCA has been widely used, different studies formulate the task differently. This is because the term…

Software Engineering · Computer Science 2025-10-23 Aoyang Fang , Haowen Yang , Haoze Dong , Qisheng Lu , Junjielong Xu , Pinjia He

Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce. Typically RCA investigation leverages data-sources…

Information Retrieval · Computer Science 2022-04-26 Amrita Saha , Steven C. H. Hoi

With the rapid development of cloud computing and ultra-large-scale data centers, the scale and complexity of systems have increased significantly, leading to frequent faults that often show cascading propagation. How to achieve efficient,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-17 Jian Hou

Kubernetes, a notably complex and distributed system, utilizes an array of controllers to uphold cluster management logic through state reconciliation. Nevertheless, maintaining state consistency presents significant challenges due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-04 Yong Xiang , Charley Peter Chen , Liyi Zeng , Wei Yin , Xin Liu , Hu Li , Wei Xu

The transition to agentic Root Cause Analysis (RCA) necessitates benchmarks that evaluate active reasoning rather than passive classification. However, current frameworks fail to reconcile ecological validity with reproducibility. We…

Software Engineering · Computer Science 2026-03-03 Yilun Wang , Guangba Yu , Haiyu Huang , Zirui Wang , Yujie Huang , Pengfei Chen , Michael R. Lyu

Root Cause Analysis (RCA) aims at identifying the underlying causes of system faults by uncovering and analyzing the causal structure from complex systems. It has been widely used in many application domains. Reliable diagnostic conclusions…

Artificial Intelligence · Computer Science 2024-07-15 Chang Gong , Di Yao , Jin Wang , Wenbin Li , Lanting Fang , Yongtao Xie , Kaiyu Feng , Peng Han , Jingping Bi

Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts,…

Software Engineering · Computer Science 2026-02-02 Evelien Riddell , James Riddell , Gengyi Sun , Michał Antkiewicz , Krzysztof Czarnecki

Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently. However, current methods are still reliant on manual workflow settings and do not unleash LLMs' decision-making and environment…

Software Engineering · Computer Science 2024-08-05 Zefan Wang , Zichuan Liu , Yingying Zhang , Aoxiao Zhong , Jihong Wang , Fengbin Yin , Lunting Fan , Lingfei Wu , Qingsong Wen

To ensure the reliability of cloud systems, their performance is monitored using KPIs (key performance indicators). When issues arise, root cause localization identifies KPIs responsible for service degradation, aiding in quick diagnosis…

Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic losses. In the field of online service systems, operators rely on enormous monitoring data to detect and mitigate failures. Quickly recognizing a…

Software Engineering · Computer Science 2022-06-14 Mingjie Li , Zeyan Li , Kanglin Yin , Xiaohui Nie , Wenchi Zhang , Kaixin Sui , Dan Pei

Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap,…

Artificial Intelligence · Computer Science 2025-05-20 Lecheng Zheng , Zhengzhang Chen , Dongjie Wang , Chengyuan Deng , Reon Matsuoka , Haifeng Chen

Localizing the root cause of network faults is crucial to network operation and maintenance. However, due to the complicated network architectures and wireless environments, as well as limited labeled data, accurately localizing the true…

Machine Learning · Computer Science 2022-03-08 Chaoli Zhang , Zhiqiang Zhou , Yingying Zhang , Linxiao Yang , Kai He , Qingsong Wen , Liang Sun

With the continued migration of storage to cloud database systems,the impact of slow queries in such systems on services and user experience is increasing. Root-cause diagnosis plays an indispensable role in facilitating slow-query…

With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more…

Software Engineering · Computer Science 2024-06-21 Yuhan Zhu , Jian Wang , Bing Li , Xuxian Tang , Hao Li , Neng Zhang , Yuqi Zhao

Runtime failures are commonplace in modern distributed systems. When such issues arise, users often turn to platforms such as Github or JIRA to report them and request assistance. Automatically identifying the root cause of these failures…

Software Engineering · Computer Science 2025-04-01 Yichen Li , Yulun Wu , Jinyang Liu , Zhihan Jiang , Zhuangbin Chen , Guangba Yu , Michael R. Lyu

The complex dependencies and propagative faults inherent in microservices, characterized by a dense network of interconnected services, pose significant challenges in identifying the underlying causes of issues. Prompt identification and…

Software Engineering · Computer Science 2024-08-05 Tingting Wang , Guilin Qi

Serverless becomes popular as a novel computing paradigms for cloud native services. However, the complexity and dynamic nature of serverless applications present significant challenges to ensure system availability and performance. There…

Software Engineering · Computer Science 2024-12-04 Jin Huang , Pengfei Chen , Guangba Yu , Yilun Wang , Haiyu Huang , Zilong He
‹ Prev 1 2 3 10 Next ›