English
Related papers

Related papers: Enhancing Failure Propagation Analysis in Cloud Co…

200 papers

Cloud computing systems fail in complex and unexpected ways due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a…

Software Engineering · Computer Science 2020-10-02 Domenico Cotroneo , Luigi De Simone , Pietro Liguori , Roberto Natella

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages…

Software Engineering · Computer Science 2019-09-04 Domenico Cotroneo , Luigi De Simone , Pietro Liguori , Roberto Natella , Nematollah Bidokhti

Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not…

Software Engineering · Computer Science 2023-01-19 Domenico Cotroneo , Luigi De Simone , Pietro Liguori , Roberto Natella

Identifying the failure modes of cloud computing systems is a difficult and time-consuming task, due to the growing complexity of such systems, and the large volume and noisiness of failure data. This paper presents a novel approach for…

Artificial Intelligence · Computer Science 2022-03-09 Domenico Cotroneo , Luigi De Simone , Pietro Liguori , Roberto Natella

Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To…

Software Engineering · Computer Science 2022-03-09 Domenico Cotroneo , Luigi De Simone , Pietro Liguori , Roberto Natella , Angela Scibelli

Cloud application services are distributed in nature and have components across the stack working together to deliver the experience to end users. The wide adoption of microservice architecture exacerbates failure management due to…

Performance · Computer Science 2025-09-09 Dhanya R Mathews , Mudit Verma , Pooja Aggarwal , J. Lakshmi

The reliability of cloud platforms is of significant relevance because society increasingly relies on complex software systems running on the cloud. To improve it, cloud providers are automating various maintenance tasks, with failure…

Software Engineering · Computer Science 2022-04-07 Jasmin Bogatinovski , Sasho Nedelkoski , Li Wu , Jorge Cardoso , Odej Kao

The momentum gained by microservices and cloud-native software architecture pushed nowadays enterprise IT towards multi-service applications. The proliferation of services and service interactions within applications, often consisting of…

Software Engineering · Computer Science 2021-05-27 Jacopo Soldani , Antonio Brogi

Debugging Cyber-Physical System (CPS) models can be extremely complex. Indeed, only the detection of a failure is insuffcient to know how to correct a faulty model. Faults can propagate in time and in space producing observable…

Software Engineering · Computer Science 2020-10-14 Ezio Bartocci , Niveditha Manjunath , Leonardo Mariani , Cristinel Mateis , Dejan Ničković

Fault localization is an imperative method in fault tolerance in a distributed environment that designs a blueprint for continuing the ongoing process even when one or many modules are non-functional. Visualizing a distributed environment…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-24 Narayanaa S R , Sivaranjan M , Lekshmi R S

Serverless applications can be particularly difficult to troubleshoot, as these applications are often composed of various managed and partly managed services. Faults are often unpredictable and can occur at multiple points, even in simple…

Software Engineering · Computer Science 2024-07-16 Maria C. Borges , Sebastian Werner , Ahmet Kilic

IoT systems complexity and susceptibility to failures pose significant challenges in ensuring their reliable operation Failures can be internally generated or caused by external factors impacting both the systems correctness and its…

As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-29 Alessio Netti , Zeynep Kiziltan , Ozalp Babaoglu , Alina Sirbu , Andrea Bartolini , Andrea Borghesi

Anomaly detecting as an important technical in cloud computing is applied to support smooth running of the cloud platform. Traditional detecting methods based on statistic, analysis, etc. lead to the high false-alarm rate due to…

Machine Learning · Computer Science 2019-01-29 Jing Zhang

Cloud platforms, under the hood, consist of a complex inter-connected stack of hardware and software components. Each of these components can fail which may lead to an outage. Our goal is to improve the quality of Cloud services through…

Software Engineering · Computer Science 2021-02-12 Mohammad Saiful Islam , Andriy Miranskyy

Large-scale computing systems today are assembled by numerous computing units for massive computational capability needed to solve problems at scale, which enables failures common events in supercomputing scenarios. Considering the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-07 Li Tan , Nathan DeBardeleben

Distributed Systems involve two or more computer systems which may be situated at geographically distinct locations and are connected by a communication network. Due to failures in the communication link, faults arise which may make the…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-20 Poorva Kulkarni , Varsha Deshpande , Latika Sarna , Sumedha Shenolikar , Supriya Kelkar

As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-12 Alessio Netti , Zeynep Kiziltan , Ozalp Babaoglu , Alina Sirbu , Andrea Bartolini , Andrea Borghesi

Reliability is extremely important for large-scale cloud systems like Microsoft 365. Cloud failures such as disk failure, node failure, etc. threaten service reliability, resulting in online service interruptions and economic loss. Existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-07 Fangkai Yang , Wenjie Yin , Lu Wang , Tianci Li , Pu Zhao , Bo Liu , Paul Wang , Bo Qiao , Yudong Liu , Mårten Björkman , Saravan Rajmohan , Qingwei Lin , Dongmei Zhang

Fault diagnosis has attracted extensive attention for its importance in the exceedingly fault management framework for cloud virtualization, despite the fact that fault diagnosis becomes more difficult due to the increasing scalability and…

Software Engineering · Computer Science 2015-07-30 Ameen Alkasem , Hongwei Liu , Zuo Decheng , Yao Zhao
‹ Prev 1 2 3 10 Next ›