English

In-database connected component analysis

Data Structures and Algorithms 2019-10-18 v2 Distributed, Parallel, and Cluster Computing

Abstract

We describe a Big Data-practical, SQL-implementable algorithm for efficiently determining connected components for graph data stored in a Massively Parallel Processing (MPP) relational database. The algorithm described is a linear-space, randomised algorithm, always terminating with the correct answer but subject to a stochastic running time, such that for any ϵ>0\epsilon>0 and any input graph G=V,EG=\langle V, E \rangle the algorithm terminates after O(logV)\mathop{\text{O}}(\log |V|) SQL queries with probability of at least 1ϵ1-\epsilon, which we show empirically to translate to a quasi-linear runtime in practice.

Keywords

Cite

@article{arxiv.1802.09478,
  title  = {In-database connected component analysis},
  author = {Harald Bögeholz and Michael Brand and Radu-Alexandru Todor},
  journal= {arXiv preprint arXiv:1802.09478},
  year   = {2019}
}

Comments

major revision with new datasets

R2 v1 2026-06-23T00:33:57.722Z