Related papers: Enhancing Failure Propagation Analysis in Cloud Co…

Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems

Cloud computing systems fail in complex and unexpected ways due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a…

Software Engineering · Computer Science 2020-10-02 Domenico Cotroneo , Luigi De Simone , Pietro Liguori , Roberto Natella

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages…

Software Engineering · Computer Science 2019-09-04 Domenico Cotroneo , Luigi De Simone , Pietro Liguori , Roberto Natella , Nematollah Bidokhti

Run-time Failure Detection via Non-intrusive Event Analysis in a Large-Scale Cloud Computing Platform

Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not…

Software Engineering · Computer Science 2023-01-19 Domenico Cotroneo , Luigi De Simone , Pietro Liguori , Roberto Natella

Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning

Identifying the failure modes of cloud computing systems is a difficult and time-consuming task, due to the growing complexity of such systems, and the large volume and noisiness of failure data. This paper presents a novel approach for…

Artificial Intelligence · Computer Science 2022-03-09 Domenico Cotroneo , Luigi De Simone , Pietro Liguori , Roberto Natella

Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures

Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To…

Software Engineering · Computer Science 2022-03-09 Domenico Cotroneo , Luigi De Simone , Pietro Liguori , Roberto Natella , Angela Scibelli

Efficient Fault Localization in a Cloud Stack Using End-to-End Application Service Topology

Cloud application services are distributed in nature and have components across the stack working together to deliver the experience to end users. The wide adoption of microservice architecture exacerbates failure management due to…

Performance · Computer Science 2025-09-09 Dhanya R Mathews , Mudit Verma , Pooja Aggarwal , J. Lakshmi

Failure Identification from Unstable Log Data using Deep Learning

The reliability of cloud platforms is of significant relevance because society increasingly relies on complex software systems running on the cloud. To improve it, cloud providers are automating various maintenance tasks, with failure…

Software Engineering · Computer Science 2022-04-07 Jasmin Bogatinovski , Sasho Nedelkoski , Li Wu , Jorge Cardoso , Odej Kao

Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications: A Survey

The momentum gained by microservices and cloud-native software architecture pushed nowadays enterprise IT towards multi-service applications. The proliferation of services and service interactions within applications, often consisting of…

Software Engineering · Computer Science 2021-05-27 Jacopo Soldani , Antonio Brogi

Automatic Failure Explanation in CPS Models

Debugging Cyber-Physical System (CPS) models can be extremely complex. Indeed, only the detection of a failure is insuffcient to know how to correct a faulty model. Faults can propagate in time and in space producing observable…

Software Engineering · Computer Science 2020-10-14 Ezio Bartocci , Niveditha Manjunath , Leonardo Mariani , Cristinel Mateis , Dejan Ničković

Fault Localization in Cloud using Centrality Measures

Fault localization is an imperative method in fault tolerance in a distributed environment that designs a blueprint for continuing the ongoing process even when one or many modules are non-functional. Visualizing a distributed environment…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-24 Narayanaa S R , Sivaranjan M , Lekshmi R S

FaaSter Troubleshooting -- Evaluating Distributed Tracing Approaches for Serverless Applications

Serverless applications can be particularly difficult to troubleshoot, as these applications are often composed of various managed and partly managed services. Faults are often unpredictable and can occur at multiple points, even in simple…

Software Engineering · Computer Science 2024-07-16 Maria C. Borges , Sebastian Werner , Ahmet Kilic

Supporting Early-Safety Analysis of IoT Systems by Exploiting Testing Techniques

IoT systems complexity and susceptibility to failures pose significant challenges in ensuring their reliable operation Failures can be internally generated or caused by external factors impacting both the systems correctness and its…

Software Engineering · Computer Science 2023-09-07 Diego Clerissi , Juri Di Rocco , Davide Di Ruscio , Claudio Di Sipio , Felicien Ihirwe , Leonardo Mariani , Daniela Micucci , Maria Teresa Rossi , Riccardo Rubei

A Machine Learning Approach to Online Fault Classification in HPC Systems

As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-29 Alessio Netti , Zeynep Kiziltan , Ozalp Babaoglu , Alina Sirbu , Andrea Bartolini , Andrea Borghesi

Anomaly detecting and ranking of the cloud computing platform by multi-view learning

Anomaly detecting as an important technical in cloud computing is applied to support smooth running of the cloud platform. Traditional detecting methods based on statistic, analysis, etc. lead to the high false-alarm rate due to…

Machine Learning · Computer Science 2019-01-29 Jing Zhang

Anomaly Detection in Cloud Components

Cloud platforms, under the hood, consist of a complex inter-connected stack of hardware and software components. Each of these components can fail which may lead to an outage. Our goal is to improve the quality of Cloud services through…

Software Engineering · Computer Science 2021-02-12 Mohammad Saiful Islam , Andriy Miranskyy

Failure Analysis and Quantification for Contemporary and Future Supercomputers

Large-scale computing systems today are assembled by numerous computing units for massive computational capability needed to solve problems at scale, which enables failures common events in supercomputing scenarios. Considering the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-07 Li Tan , Nathan DeBardeleben

Fault Diagnosis for Distributed Systems using Accuracy Technique

Distributed Systems involve two or more computer systems which may be situated at geographically distinct locations and are connected by a communication network. Due to failures in the communication link, faults arise which may make the…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-20 Poorva Kulkarni , Varsha Deshpande , Latika Sarna , Sumedha Shenolikar , Supriya Kelkar

Online Fault Classification in HPC Systems through Machine Learning

As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-12 Alessio Netti , Zeynep Kiziltan , Ozalp Babaoglu , Alina Sirbu , Andrea Bartolini , Andrea Borghesi

Diffusion-based Time Series Data Imputation for Microsoft 365

Reliability is extremely important for large-scale cloud systems like Microsoft 365. Cloud failures such as disk failure, node failure, etc. threaten service reliability, resulting in online service interruptions and economic loss. Existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-07 Fangkai Yang , Wenjie Yin , Lu Wang , Tianci Li , Pu Zhao , Bo Liu , Paul Wang , Bo Qiao , Yudong Liu , Mårten Björkman , Saravan Rajmohan , Qingwei Lin , Dongmei Zhang

AFDI: A Virtualization-based Accelerated Fault Diagnosis Innovation for High Availability Computing

Fault diagnosis has attracted extensive attention for its importance in the exceedingly fault management framework for cloud virtualization, despite the fact that fault diagnosis becomes more difficult due to the increasing scalability and…

Software Engineering · Computer Science 2015-07-30 Ameen Alkasem , Hongwei Liu , Zuo Decheng , Yao Zhao