Related papers: Enhancing Failure Propagation Analysis in Cloud Co…
Cloud computing systems fail in complex and unexpected ways due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a…
Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages…
Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not…
Identifying the failure modes of cloud computing systems is a difficult and time-consuming task, due to the growing complexity of such systems, and the large volume and noisiness of failure data. This paper presents a novel approach for…
Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To…
Cloud application services are distributed in nature and have components across the stack working together to deliver the experience to end users. The wide adoption of microservice architecture exacerbates failure management due to…
The reliability of cloud platforms is of significant relevance because society increasingly relies on complex software systems running on the cloud. To improve it, cloud providers are automating various maintenance tasks, with failure…
The momentum gained by microservices and cloud-native software architecture pushed nowadays enterprise IT towards multi-service applications. The proliferation of services and service interactions within applications, often consisting of…
Debugging Cyber-Physical System (CPS) models can be extremely complex. Indeed, only the detection of a failure is insuffcient to know how to correct a faulty model. Faults can propagate in time and in space producing observable…
Fault localization is an imperative method in fault tolerance in a distributed environment that designs a blueprint for continuing the ongoing process even when one or many modules are non-functional. Visualizing a distributed environment…
Serverless applications can be particularly difficult to troubleshoot, as these applications are often composed of various managed and partly managed services. Faults are often unpredictable and can occur at multiple points, even in simple…
IoT systems complexity and susceptibility to failures pose significant challenges in ensuring their reliable operation Failures can be internally generated or caused by external factors impacting both the systems correctness and its…
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and…
Anomaly detecting as an important technical in cloud computing is applied to support smooth running of the cloud platform. Traditional detecting methods based on statistic, analysis, etc. lead to the high false-alarm rate due to…
Cloud platforms, under the hood, consist of a complex inter-connected stack of hardware and software components. Each of these components can fail which may lead to an outage. Our goal is to improve the quality of Cloud services through…
Large-scale computing systems today are assembled by numerous computing units for massive computational capability needed to solve problems at scale, which enables failures common events in supercomputing scenarios. Considering the…
Distributed Systems involve two or more computer systems which may be situated at geographically distinct locations and are connected by a communication network. Due to failures in the communication link, faults arise which may make the…
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating…
Reliability is extremely important for large-scale cloud systems like Microsoft 365. Cloud failures such as disk failure, node failure, etc. threaten service reliability, resulting in online service interruptions and economic loss. Existing…
Fault diagnosis has attracted extensive attention for its importance in the exceedingly fault management framework for cloud virtualization, despite the fact that fault diagnosis becomes more difficult due to the increasing scalability and…