English
Related papers

Related papers: Why Do AI Agents Systematically Fail at Cloud Root…

200 papers

Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts,…

Software Engineering · Computer Science 2026-02-02 Evelien Riddell , James Riddell , Gengyi Sun , Michał Antkiewicz , Krzysztof Czarnecki

The growing complexity of cloud based software systems has resulted in incident management becoming an integral part of the software development lifecycle. Root cause analysis (RCA), a critical part of the incident management process, is a…

Software Engineering · Computer Science 2024-03-08 Devjeet Roy , Xuchao Zhang , Rashi Bhave , Chetan Bansal , Pedro Las-Casas , Rodrigo Fonseca , Saravan Rajmohan

Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the…

Artificial Intelligence · Computer Science 2025-08-19 Ruofan Lu , Yichen Li , Yintong Huo

Root cause analysis (RCA) in microservice systems is challenging, requiring on-call engineers to rapidly diagnose failures across heterogeneous telemetry such as metrics, logs, and traces. Traditional RCA methods often focus on single…

Artificial Intelligence · Computer Science 2025-08-19 Yifang Tian , Yaming Liu , Zichun Chong , Zihang Huang , Hans-Arno Jacobsen

Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently. However, current methods are still reliant on manual workflow settings and do not unleash LLMs' decision-making and environment…

Software Engineering · Computer Science 2024-08-05 Zefan Wang , Zichuan Liu , Yingying Zhang , Aoxiao Zhong , Jihong Wang , Fengbin Yin , Lunting Fan , Lingfei Wu , Qingsong Wen

Root Cause Analysis (RCA) in mobile networks remains a challenging task due to the need for interpretability, domain expertise, and causal reasoning. In this work, we propose a lightweight framework that leverages Large Language Models…

Artificial Intelligence · Computer Science 2025-07-30 Mohamed Sana , Nicola Piovesan , Antonio De Domenico , Yibin Kang , Haozhe Zhang , Merouane Debbah , Fadhel Ayed

Large-scale telecom and datacenter infrastructures rely on multi-layered service and resource models, where failures propagate across physical and logical components and affect multiple customers. Traditional approaches to root cause…

Artificial Intelligence · Computer Science 2026-01-13 Nicolas Tacheny

Large language model (LLM) services have become an integral part of search, assistance, and decision-making applications. However, unlike traditional web or microservices, the hardware and software stack enabling LLM inference deployment is…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-03 Dominik Scheinert , Alexander Acker , Thorsten Wittkopp , Soeren Becker , Hamza Yous , Karnakar Reddy , Ibrahim Farhat , Hakim Hacid , Odej Kao

Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading…

Organisations are starting to adopt LLM-based AI agents, with their deployments naturally evolving from single agents towards interconnected, multi-agent networks. Yet a collection of safe agents does not guarantee a safe collection of…

Multiagent Systems · Computer Science 2025-08-11 Alistair Reid , Simon O'Callaghan , Liam Carroll , Tiberio Caetano

Root cause analysis (RCA) for incidents in large-scale cloud systems is a complex, knowledge-intensive task that often requires significant manual effort from on-call engineers (OCEs). Improving RCA is vital for accelerating the incident…

Communications networks now form the backbone of our digital world, with fast and reliable connectivity. However, even with appropriate redundancy and failover mechanisms, it is difficult to guarantee "five 9s" (99.999 %) reliability,…

Computation and Language · Computer Science 2026-04-29 Nguyen Phuc Tran , Brigitte Jaumard , Oscar Delgado , Tristan Glatard , Karthikeyan Premkumar , Kun Ni

Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are…

Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains,…

Computation and Language · Computer Science 2025-10-24 Artur Donaldson , Bharathan Balaji , Cajetan Oriekezie , Manish Kumar , Laure Patouillard

Ensuring the reliability and availability of complex networked services demands effective root cause analysis (RCA) across cloud environments, data centers, and on-premises networks. Traditional RCA methods, which involve manual inspection…

Networking and Internet Architecture · Computer Science 2025-07-08 Alexander Shan , Jasleen Kaur , Rahul Singh , Tarun Banka , Raj Yavatkar , T. Sridhar

Large language models (LLMs) are being rapidly integrated into decision-support tools, automation workflows, and AI-enabled software systems. However, their behavior in production environments remains poorly understood, and their failure…

Artificial Intelligence · Computer Science 2025-11-27 Vaishali Vinay

Root cause localization remain challenging in complex and large-scale microservice architectures. The complex fault propagation among microservices and the high dimensionality of telemetry data, including metrics, logs, and traces, limit…

Artificial Intelligence · Computer Science 2026-02-10 Liming Zhou , Ailing Liu , Hongwei Liu , Min He , Heng Zhang

Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud services, requiring on-call engineers to identify the primary issues and implement corrective actions to prevent future recurrences. Improving the…

Computation and Language · Computer Science 2024-01-26 Xuchao Zhang , Supriyo Ghosh , Chetan Bansal , Rujia Wang , Minghua Ma , Yu Kang , Saravan Rajmohan

Coding agents represent a new paradigm in automated software engineering, combining the reasoning capabilities of Large Language Models (LLMs) with tool-augmented interaction loops. However, coding agents still have severe limitations.…

Software Engineering · Computer Science 2026-04-06 Tural Mehtiyev , Wesley Assunção

Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate…

Multiagent Systems · Computer Science 2025-06-03 Shaokun Zhang , Ming Yin , Jieyu Zhang , Jiale Liu , Zhiguang Han , Jingyang Zhang , Beibin Li , Chi Wang , Huazheng Wang , Yiran Chen , Qingyun Wu
‹ Prev 1 2 3 10 Next ›