Related papers: Why Do AI Agents Systematically Fail at Cloud Root…

Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis

Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts,…

Software Engineering · Computer Science 2026-02-02 Evelien Riddell , James Riddell , Gengyi Sun , Michał Antkiewicz , Krzysztof Czarnecki

Exploring LLM-based Agents for Root Cause Analysis

The growing complexity of cloud based software systems has resulted in incident management becoming an integral part of the software development lifecycle. Root cause analysis (RCA), a critical part of the incident management process, is a…

Software Engineering · Computer Science 2024-03-08 Devjeet Roy , Xuchao Zhang , Rashi Bhave , Chetan Bansal , Pedro Las-Casas , Rodrigo Fonseca , Saravan Rajmohan

Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks

Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the…

Artificial Intelligence · Computer Science 2025-08-19 Ruofan Lu , Yichen Li , Yintong Huo

GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis?

Root cause analysis (RCA) in microservice systems is challenging, requiring on-call engineers to rapidly diagnose failures across heterogeneous telemetry such as metrics, logs, and traces. Traditional RCA methods often focus on single…

Artificial Intelligence · Computer Science 2025-08-19 Yifang Tian , Yaming Liu , Zichun Chong , Zihang Huang , Hans-Arno Jacobsen

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently. However, current methods are still reliant on manual workflow settings and do not unleash LLMs' decision-making and environment…

Software Engineering · Computer Science 2024-08-05 Zefan Wang , Zichuan Liu , Yingying Zhang , Aoxiao Zhong , Jihong Wang , Fengbin Yin , Lunting Fan , Lingfei Wu , Qingsong Wen

Reasoning Language Models for Root Cause Analysis in 5G Wireless Networks

Root Cause Analysis (RCA) in mobile networks remains a challenging task due to the need for interpretability, domain expertise, and causal reasoning. In this work, we propose a lightweight framework that leverages Large Language Models…

Artificial Intelligence · Computer Science 2025-07-30 Mohamed Sana , Nicola Piovesan , Antonio De Domenico , Yibin Kang , Haozhe Zhang , Merouane Debbah , Fadhel Ayed

Agentic Diagnostic Reasoning over Telecom and Datacenter Infrastructure

Large-scale telecom and datacenter infrastructures rely on multi-layered service and resource models, where failures propagate across physical and logical components and affect multiple customers. Traditional approaches to root cause…

Artificial Intelligence · Computer Science 2026-01-13 Nicolas Tacheny

Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads

Large language model (LLM) services have become an integral part of search, assistance, and decision-making applications. However, unlike traditional web or microservices, the hardware and software stack enabling LLM inference deployment is…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-03 Dominik Scheinert , Alexander Acker , Thorsten Wittkopp , Soeren Becker , Hamza Yous , Karnakar Reddy , Ibrahim Farhat , Hakim Hacid , Odej Kao

Where LLM Agents Fail and How They can Learn From Failures

Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading…

Artificial Intelligence · Computer Science 2025-10-01 Kunlun Zhu , Zijia Liu , Bingxuan Li , Muxin Tian , Yingxuan Yang , Jiaxun Zhang , Pengrui Han , Qipeng Xie , Fuyang Cui , Weijia Zhang , Xiaoteng Ma , Xiaodong Yu , Gowtham Ramesh , Jialian Wu , Zicheng Liu , Pan Lu , James Zou , Jiaxuan You

Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems

Organisations are starting to adopt LLM-based AI agents, with their deployments naturally evolving from single agents towards interconnected, multi-agent networks. Yet a collection of safe agents does not guarantee a safe collection of…

Multiagent Systems · Computer Science 2025-08-11 Alistair Reid , Simon O'Callaghan , Liam Carroll , Tiberio Caetano

eARCO: Efficient Automated Root Cause Analysis with Prompt Optimization

Root cause analysis (RCA) for incidents in large-scale cloud systems is a complex, knowledge-intensive task that often requires significant manual effort from on-call engineers (OCEs). Improving RCA is vital for accelerating the incident…

Software Engineering · Computer Science 2025-04-17 Drishti Goel , Raghav Magazine , Supriyo Ghosh , Akshay Nambi , Prathamesh Deshpande , Xuchao Zhang , Chetan Bansal , Saravan Rajmohan

LLM-Augmented Knowledge Base Construction For Root Cause Analysis

Communications networks now form the backbone of our digital world, with fast and reliable connectivity. However, even with appropriate redundancy and failover mechanisms, it is difficult to guarantee "five 9s" (99.999 %) reliability,…

Computation and Language · Computer Science 2026-04-29 Nguyen Phuc Tran , Brigitte Jaumard , Oscar Delgado , Tristan Glatard , Karthikeyan Premkumar , Kun Ni

Automatic Root Cause Analysis via Large Language Models for Cloud Incidents

Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are…

Software Engineering · Computer Science 2023-11-14 Yinfang Chen , Huaibing Xie , Minghua Ma , Yu Kang , Xin Gao , Liu Shi , Yunjie Cao , Xuedong Gao , Hao Fan , Ming Wen , Jun Zeng , Supriyo Ghosh , Xuchao Zhang , Chaoyun Zhang , Qingwei Lin , Saravan Rajmohan , Dongmei Zhang , Tianyin Xu

An Expert-grounded benchmark of General Purpose LLMs in LCA

Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains,…

Computation and Language · Computer Science 2025-10-24 Artur Donaldson , Bharathan Balaji , Cajetan Oriekezie , Manish Kumar , Laure Patouillard

RCA Copilot: Transforming Network Data into Actionable Insights via Large Language Models

Ensuring the reliability and availability of complex networked services demands effective root cause analysis (RCA) across cloud environments, data centers, and on-premises networks. Traditional RCA methods, which involve manual inspection…

Networking and Internet Architecture · Computer Science 2025-07-08 Alexander Shan , Jasleen Kaur , Rahul Singh , Tarun Banka , Raj Yavatkar , T. Sridhar

Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications

Large language models (LLMs) are being rapidly integrated into decision-support tools, automation workflows, and AI-enabled software systems. However, their behavior in production environments remains poorly understood, and their failure…

Artificial Intelligence · Computer Science 2025-11-27 Vaishali Vinay

Root Cause Analysis Method Based on Large Language Models with Residual Connection Structures

Root cause localization remain challenging in complex and large-scale microservice architectures. The complex fault propagation among microservices and the high dimensionality of telemetry data, including metrics, logs, and traces, limit…

Artificial Intelligence · Computer Science 2026-02-10 Liming Zhou , Ailing Liu , Hongwei Liu , Min He , Heng Zhang

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud services, requiring on-call engineers to identify the primary issues and implement corrective actions to prevent future recurrences. Improving the…

Computation and Language · Computer Science 2024-01-26 Xuchao Zhang , Supriyo Ghosh , Chetan Bansal , Rujia Wang , Minghua Ma , Yu Kang , Saravan Rajmohan

Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure

Coding agents represent a new paradigm in automated software engineering, combining the reasoning capabilities of Large Language Models (LLMs) with tool-augmented interaction loops. However, coding agents still have severe limitations.…

Software Engineering · Computer Science 2026-04-06 Tural Mehtiyev , Wesley Assunção

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate…

Multiagent Systems · Computer Science 2025-06-03 Shaokun Zhang , Ming Yin , Jieyu Zhang , Jiale Liu , Zhiguang Han , Jingyang Zhang , Beibin Li , Chi Wang , Huazheng Wang , Yiran Chen , Qingyun Wu