Related papers: Tracking System Behaviour from Resource Usage Data
High performance computing (HPC) facilities consist of a large number of interconnected computing units (or nodes) that execute highly complex scientific simulations to support scientific research. Monitoring such facilities, in real-time,…
High-performance computing (HPC) systems are a complex combination of software, processors, memory, networks, and storage systems characterized by frequent disruptive technological advances. Anomalous behavior has to be manually diagnosed…
Detecting and resolving performance anomalies in Cloud services is crucial for maintaining desired performance objectives. Scaling actions triggered by an anomaly detector help achieve target latency at the cost of extra resource…
This paper provides an overview of three notable approaches for detecting anomalies in spatio-temporal data. The three review methods are selected from the framework of multivariate statistical process control (SPC), scan statistics, and…
Software performance modeling plays a crucial role in developing and maintaining software systems. A performance model analytically describes the relationship between the performance of a system and its runtime activities. This process…
As contemporary software-intensive systems reach increasingly large scale, it is imperative that failure detection schemes be developed to help prevent costly system downtimes. A promising direction towards the construction of such schemes…
The ability to understand how a scientific application is executed on a large HPC system is of great importance in allocating resources within the HPC data center. In this paper, we describe how we used system performance data to identify:…
Spatiotemporal traffic time series, such as traffic speed data, collected from sensing systems are often incomplete, with considerable corruption and large amounts of missing values. A vast amount of data conceals implicit data structures,…
Failure detection in telecommunication networks is a vital task. So far, several supervised and unsupervised solutions have been provided for discovering failures in such networks. Among them unsupervised approaches has attracted more…
Anomaly detection in spatiotemporal data is a challenging problem encountered in a variety of applications, including video surveillance, medical imaging data, and urban traffic monitoring. Existing anomaly detection methods focus mainly on…
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper…
Spatiotemporal traffic data (e.g., link speed/flow) collected from sensor networks can be organized as multivariate time series with additional spatial attributes. A crucial task in analyzing such data is to identify and detect anomalous…
While detailed resource usage monitoring is possible on the low-level using proper tools, associating such usage with higher-level abstractions in the application layer that actually cause the resource usage in the first place presents a…
Event detection is gaining increasing attention in smart cities research. Large-scale mobility data serves as an important tool to uncover the dynamics of urban transportation systems, and more often than not the dataset is incomplete. In…
Energy efficiency is one of the major concern in designing advanced computing infrastructures. From single nodes to large-scale systems (data centers), monitoring the energy consumption of the computing system when applications run is a…
Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks can translate into large…
Most enterprise applications use logging as a mechanism to diagnose anomalies, which could help with reducing system downtime. Anomaly detection using software execution logs has been explored in several prior studies, using both classical…
Tensor completion is an extension of matrix completion aimed at recovering a multiway data tensor by leveraging a given subset of its entries (observations) and the pattern of observation. The low-rank assumption is key in establishing a…
Complex networks have now become integral parts of modern information infrastructures. This paper proposes a user-centric method for detecting anomalies in heterogeneous information networks, in which nodes and/or edges might be from…
The complexity and ubiquity of modern computing systems is a fertile ground for anomalies, including security and privacy breaches. In this paper, we propose a new methodology that addresses the practical challenges to implement anomaly…