English
Related papers

Related papers: LogRCA: Log-based Root Cause Analysis for Distribu…

200 papers

Runtime failures are commonplace in modern distributed systems. When such issues arise, users often turn to platforms such as Github or JIRA to report them and request assistance. Automatically identifying the root cause of these failures…

Software Engineering · Computer Science 2025-04-01 Yichen Li , Yulun Wu , Jinyang Liu , Zhihan Jiang , Zhuangbin Chen , Guangba Yu , Michael R. Lyu

Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts,…

Software Engineering · Computer Science 2026-02-02 Evelien Riddell , James Riddell , Gengyi Sun , Michał Antkiewicz , Krzysztof Czarnecki

Localizing the root cause of network faults is crucial to network operation and maintenance. However, due to the complicated network architectures and wireless environments, as well as limited labeled data, accurately localizing the true…

Machine Learning · Computer Science 2022-03-08 Chaoli Zhang , Zhiqiang Zhou , Yingying Zhang , Linxiao Yang , Kai He , Qingsong Wen , Liang Sun

Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic losses. In the field of online service systems, operators rely on enormous monitoring data to detect and mitigate failures. Quickly recognizing a…

Software Engineering · Computer Science 2022-06-14 Mingjie Li , Zeyan Li , Kanglin Yin , Xiaohui Nie , Wenchi Zhang , Kaixin Sui , Dan Pei

This paper presents MicroRCA-Agent, an innovative solution for microservice root cause analysis based on large language model agents, which constructs an intelligent fault root cause localization system with multimodal data fusion. The…

Artificial Intelligence · Computer Science 2025-09-22 Pan Tang , Shixiang Tang , Huanqi Pu , Zhiqing Miao , Zhixing Wang

System logs are some of the most important information for the maintenance of software systems, which have become larger and more complex in recent years. The goal of log-based anomaly detection is to automatically detect system anomalies…

Machine Learning · Computer Science 2024-02-19 Yuuki Yamanaka , Tomokatsu Takahashi , Takuya Minami , Yoshiaki Nakajima

Effectively localizing root causes of performance anomalies is crucial to enabling the rapid recovery and loss mitigation of microservice applications in the cloud. Depending on the granularity of the causes that can be localized, a service…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-09 Ruyue Xin , Peng Chen , Zhiming Zhao

Root cause analysis (RCA) for incidents in large-scale cloud systems is a complex, knowledge-intensive task that often requires significant manual effort from on-call engineers (OCEs). Improving RCA is vital for accelerating the incident…

Distributed databases, as the core infrastructure software for internet applications, play a critical role in modern cloud services. However, existing distributed databases frequently experience system failures and performance degradation,…

Databases · Computer Science 2025-05-06 Lingzhe Zhang , Tong Jia , Mengxi Jia , Ying Li

Root Cause Analysis (RCA) aims at identifying the underlying causes of system faults by uncovering and analyzing the causal structure from complex systems. It has been widely used in many application domains. Reliable diagnostic conclusions…

Artificial Intelligence · Computer Science 2024-07-15 Chang Gong , Di Yao , Jin Wang , Wenbin Li , Lanting Fang , Yongtao Xie , Kaiyu Feng , Peng Han , Jingping Bi

While cloud-native microservice architectures have revolutionized software development, their inherent operational complexity makes failure Root Cause Analysis (RCA) a critical yet challenging task. Numerous data-driven RCA models have been…

Software Engineering · Computer Science 2025-12-24 Aoyang Fang , Songhan Zhang , Yifan Yang , Haotong Wu , Junjielong Xu , Xuyang Wang , Rui Wang , Manyi Wang , Qisheng Lu , Pinjia He

In the evolving IT landscape, stability and reliability of systems are essential, yet their growing complexity challenges DevOps teams in implementation and maintenance. Log analysis, a core element of AIOps, provides critical insights into…

Machine Learning · Computer Science 2025-09-11 Thorsten Wittkopp

Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap,…

Artificial Intelligence · Computer Science 2025-05-20 Lecheng Zheng , Zhengzhang Chen , Dongjie Wang , Chengyuan Deng , Reon Matsuoka , Haifeng Chen

The momentum gained by microservices and cloud-native software architecture pushed nowadays enterprise IT towards multi-service applications. The proliferation of services and service interactions within applications, often consisting of…

Software Engineering · Computer Science 2021-05-27 Jacopo Soldani , Antonio Brogi

The dynamics and complexity of cloud-native systems present significant challenges for Root Cause Analysis (RCA). While causality-based RCA methods have shown significant progress in recent years, their practical adoption is fundamentally…

Software Engineering · Computer Science 2026-03-03 Shuai Liang , Pengfei Chen , Bozhe Tian , Gou Tan , Maohong Xu , Youjun Qu , Yahui Zhao , Yiduo Shang , Chongkang Tan

As business of Alibaba expands across the world among various industries, higher standards are imposed on the service quality and reliability of big data cloud computing platforms which constitute the infrastructure of Alibaba Cloud.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-09 Yingying Zhang , Zhengxiong Guan , Huajie Qian , Leili Xu , Hengbo Liu , Qingsong Wen , Liang Sun , Junwei Jiang , Lunting Fan , Min Ke

The goal of Root Cause Analysis (RCA) is to explain why an anomaly occurred by identifying where the fault originated. Several recent works model the anomalous event as resulting from a change in the causal mechanism at the root cause,…

Log analysis is one of the main techniques engineers use to troubleshoot faults of large-scale software systems. During the past decades, many log analysis approaches have been proposed to detect system anomalies reflected by logs. They…

Software Engineering · Computer Science 2022-09-19 Yongzheng Xie , Hongyu Zhang , Muhammad Ali Babar

Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce. Typically RCA investigation leverages data-sources…

Information Retrieval · Computer Science 2022-04-26 Amrita Saha , Steven C. H. Hoi

The complex dependencies and propagative faults inherent in microservices, characterized by a dense network of interconnected services, pose significant challenges in identifying the underlying causes of issues. Prompt identification and…

Software Engineering · Computer Science 2024-08-05 Tingting Wang , Guilin Qi
‹ Prev 1 2 3 10 Next ›