Related papers: Debugging OpenStack Problems Using a State Graph A…
Open-source software (OSS) is a critical component of modern software systems, yet supply chain security remains challenging in practice due to unavailable or obfuscated source code. Consequently, security teams often rely on runtime…
Anomaly detection is a crucial task in complex distributed systems. A thorough understanding of the requirements and challenges of anomaly detection is pivotal to the security of such systems, especially for real-world deployment. While…
We present the Open Graph Benchmark (OGB), a diverse set of challenging and realistic benchmark datasets to facilitate scalable, robust, and reproducible graph machine learning (ML) research. OGB datasets are large-scale, encompass multiple…
Revealing the continuous dynamics on the networks is essential for understanding, predicting, and even controlling complex systems, but it is hard to learn and model the continuous network dynamics because of complex and unknown governing…
The rise of graph analytic systems has created a need for ways to measure and compare the capabilities of these systems. Graph analytics present unique scalability difficulties. The machine learning, high performance computing, and visual…
Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To…
Learning distributions of graphs can be used for automatic drug discovery, molecular design, complex network analysis, and much more. We present an improved framework for learning generative models of graphs based on the idea of deep state…
We propose a novel Graph Neural Network (GNN) model, named DeepStateGNN, for analyzing traffic data, demonstrating its efficacy in two critical tasks: forecasting and reconstruction. Unlike typical GNN methods that treat each traffic sensor…
Designing and debugging distributed systems is notoriously difficult. The correctness of a distributed system is largely determined by its handling of failure scenarios. The sequence of events leading to a bug can be long and complex, and…
Self-Organized Criticality (SOC) is a ubiquitous dynamical phenomenon believed to be responsible for the emergence of universal scale-invariant behavior in many, seemingly unrelated systems, such as forest fires, virus spreading or atomic…
Current applications have produced graphs on the order of hundreds of thousands of nodes and millions of edges. To take advantage of such graphs, one must be able to find patterns, outliers and communities. These tasks are better performed…
Cloud computing systems fail in complex and unexpected ways due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a…
Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not…
The Open Science Grid(OSG) is a world-wide computing system which facilitates distributed computing for scientific research. It can distribute a computationally intensive job to geo-distributed clusters and process job's tasks in parallel.…
With the proliferation of large irregular sparse relational datasets, new storage and analysis platforms have arisen to fill gaps in performance and capability left by conventional approaches built on traditional database technologies and…
Distributed systems in general and cloud systems in particular, are susceptible to failures that can lead to substantial economic and data losses, security breaches, and even potential threats to human safety. Software ageing is an example…
On-stack replacement (OSR) dynamically transfers execution between different code versions. This mechanism is used in mainstream runtime systems to support adaptive and speculative optimizations by running code tailored to provide the best…
Self-stabilizing algorithms are an important because of their robustness and guaranteed convergence. Starting from any arbitrary state, a self-stabilizing algorithm is guaranteed to converge to a legitimate state.Those algorithms are not…
In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to…
The graph identification problem consists of discovering the interactions among nodes in a network given their state/feature trajectories. This problem is challenging because the behavior of a node is coupled to all the other nodes by the…