Related papers: CloudRCA: A Root Cause Analysis Framework for Clou…

LogRCA: Log-based Root Cause Analysis for Distributed Services

To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has…

Machine Learning · Computer Science 2024-05-24 Thorsten Wittkopp , Philipp Wiesner , Odej Kao

CausalRCA: Causal Inference based Precise Fine-grained Root Cause Localization for Microservice Applications

Effectively localizing root causes of performance anomalies is crucial to enabling the rapid recovery and loss mitigation of microservice applications in the cloud. Depending on the granularity of the causes that can be localized, a service…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-09 Ruyue Xin , Peng Chen , Zhiming Zhao

MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge

The dynamics and complexity of cloud-native systems present significant challenges for Root Cause Analysis (RCA). While causality-based RCA methods have shown significant progress in recent years, their practical adoption is fundamentally…

Software Engineering · Computer Science 2026-03-03 Shuai Liang , Pengfei Chen , Bozhe Tian , Gou Tan , Maohong Xu , Youjun Qu , Yahui Zhao , Yiduo Shang , Chongkang Tan

A Goal-Driven Survey on Root Cause Analysis

Root Cause Analysis (RCA) is a crucial aspect of incident management in large-scale cloud services. While the term root cause analysis or RCA has been widely used, different studies formulate the task differently. This is because the term…

Software Engineering · Computer Science 2025-10-23 Aoyang Fang , Haowen Yang , Haoze Dong , Qisheng Lu , Junjielong Xu , Pinjia He

Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps

Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce. Typically RCA investigation leverages data-sources…

Information Retrieval · Computer Science 2022-04-26 Amrita Saha , Steven C. H. Hoi

Research on fault diagnosis and root cause analysis based on full stack observability

With the rapid development of cloud computing and ultra-large-scale data centers, the scale and complexity of systems have increased significantly, leading to frequent faults that often show cascading propagation. How to achieve efficient,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-17 Jian Hou

Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM

Kubernetes, a notably complex and distributed system, utilizes an array of controllers to uphold cluster management logic through state reconciliation. Nevertheless, maintaining state consistency presents significant challenges due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-04 Yong Xiang , Charley Peter Chen , Liyi Zeng , Wei Yin , Xin Liu , Hu Li , Wei Xu

Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems

The transition to agentic Root Cause Analysis (RCA) necessitates benchmarks that evaluate active reasoning rather than passive classification. However, current frameworks fail to reconcile ecological validity with reproducibility. We…

Software Engineering · Computer Science 2026-03-03 Yilun Wang , Guangba Yu , Haiyu Huang , Zirui Wang , Yujie Huang , Pengfei Chen , Michael R. Lyu

PORCA: Root Cause Analysis with Partially Observed Data

Root Cause Analysis (RCA) aims at identifying the underlying causes of system faults by uncovering and analyzing the causal structure from complex systems. It has been widely used in many application domains. Reliable diagnostic conclusions…

Artificial Intelligence · Computer Science 2024-07-15 Chang Gong , Di Yao , Jin Wang , Wenbin Li , Lanting Fang , Yongtao Xie , Kaiyu Feng , Peng Han , Jingping Bi

Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis

Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts,…

Software Engineering · Computer Science 2026-02-02 Evelien Riddell , James Riddell , Gengyi Sun , Michał Antkiewicz , Krzysztof Czarnecki

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently. However, current methods are still reliant on manual workflow settings and do not unleash LLMs' decision-making and environment…

Software Engineering · Computer Science 2024-08-05 Zefan Wang , Zichuan Liu , Yingying Zhang , Aoxiao Zhong , Jihong Wang , Fengbin Yin , Lunting Fan , Lingfei Wu , Qingsong Wen

KPIRoot+: An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems

To ensure the reliability of cloud systems, their performance is monitored using KPIs (key performance indicators). When issues arise, root cause localization identifies KPIs responsible for service degradation, aiding in quick diagnosis…

Software Engineering · Computer Science 2025-06-06 Wenwei Gu , Renyi Zhong , Guangba Yu , Xinying Sun , Jinyang Liu , Yintong Huo , Zhuangbin Chen , Jianping Zhang , Jiazhen Gu , Yongqiang Yang , Michael R. Lyu

Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic losses. In the field of online service systems, operators rely on enormous monitoring data to detect and mitigate failures. Quickly recognizing a…

Software Engineering · Computer Science 2022-06-14 Mingjie Li , Zeyan Li , Kanglin Yin , Xiaohui Nie , Wenchi Zhang , Kaixin Sui , Dan Pei

LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis

Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap,…

Artificial Intelligence · Computer Science 2025-05-20 Lecheng Zheng , Zhengzhang Chen , Dongjie Wang , Chengyuan Deng , Reon Matsuoka , Haifeng Chen

NetRCA: An Effective Network Fault Cause Localization Algorithm

Localizing the root cause of network faults is crucial to network operation and maintenance. However, due to the complicated network architectures and wireless environments, as well as limited labeled data, accurately localizing the true…

Machine Learning · Computer Science 2022-03-08 Chaoli Zhang , Zhiqiang Zhou , Yingying Zhang , Linxiao Yang , Kai He , Qingsong Wen , Liang Sun

RCRank: Multimodal Ranking of Root Causes of Slow Queries in Cloud Database Systems

With the continued migration of storage to cloud database systems,the impact of slow queries in such systems on services and user experience is increasing. Root-cause diagnosis plays an indispensable role in facilitating slow-query…

Databases · Computer Science 2025-03-07 Biao Ouyang , Yingying Zhang , Hanyin Cheng , Yang Shu , Chenjuan Guo , Bin Yang , Qingsong Wen , Lunting Fan , Christian S. Jensen

Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments

With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more…

Software Engineering · Computer Science 2024-06-21 Yuhan Zhu , Jian Wang , Bing Li , Xuxian Tang , Hao Li , Neng Zhang , Yuqi Zhao

COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge

Runtime failures are commonplace in modern distributed systems. When such issues arise, users often turn to platforms such as Github or JIRA to report them and request assistance. Automatically identifying the root cause of these failures…

Software Engineering · Computer Science 2025-04-01 Yichen Li , Yulun Wu , Jinyang Liu , Zhihan Jiang , Zhuangbin Chen , Guangba Yu , Michael R. Lyu

A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends

The complex dependencies and propagative faults inherent in microservices, characterized by a dense network of interconnected services, pose significant challenges in identifying the underlying causes of issues. Prompt identification and…

Software Engineering · Computer Science 2024-08-05 Tingting Wang , Guilin Qi

FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications

Serverless becomes popular as a novel computing paradigms for cloud native services. However, the complexity and dynamic nature of serverless applications present significant challenges to ensure system availability and performance. There…

Software Engineering · Computer Science 2024-12-04 Jin Huang , Pengfei Chen , Guangba Yu , Yilun Wang , Haiyu Huang , Zilong He