English

CausIL: Causal Graph for Instance Level Microservice Data

Distributed, Parallel, and Cluster Computing 2023-03-21 v2

Abstract

AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ~25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL's applicability in deployment settings.

Keywords

Cite

@article{arxiv.2303.00554,
  title  = {CausIL: Causal Graph for Instance Level Microservice Data},
  author = {Sarthak Chakraborty and Shaddy Garg and Shubham Agarwal and Ayush Chauhan and Shiv Kumar Saini},
  journal= {arXiv preprint arXiv:2303.00554},
  year   = {2023}
}

Comments

Accepted to the Proceedings of the ACM Web Conference 2023 (WWW '23)

R2 v1 2026-06-28T08:54:18.845Z