A Generic Distributed Clustering Framework for Massive Data

Pingyi Luo; Qiang Huang; Anthony K. H. Tung

A Generic Distributed Clustering Framework for Massive Data

Databases 2021-06-22 v1 Distributed, Parallel, and Cluster Computing

Authors: Pingyi Luo , Qiang Huang , Anthony K. H. Tung

Abstract

In this paper, we introduce a novel Generic distributEd clustEring frameworK (GEEK) beyond $k$ -means clustering to process massive amounts of data. To deal with different data types, GEEK first converts data in the original feature space into a unified format of buckets; then, we design a new Seeding method based on simILar bucKets (SILK) to determine initial seeds. Compared with state-of-the-art seeding methods such as $k$ -means++ and its variants, SILK can automatically identify the number of initial seeds based on the closeness of shared data objects in similar buckets instead of pre-specifying $k$ . Thus, its time complexity is independent of $k$ . With these well-selected initial seeds, GEEK only needs a one-pass data assignment to get the final clusters. We implement GEEK on a distributed CPU-GPU platform for large-scale clustering. We evaluate the performance of GEEK over five large-scale real-life datasets and show that GEEK can deal with massive data of different types and is comparable to (or even better than) many state-of-the-art customized GPU-based methods, especially in large $k$ values.

Keywords

cluster analysis

Cite

@article{arxiv.2106.10515,
  title  = {A Generic Distributed Clustering Framework for Massive Data},
  author = {Pingyi Luo and Qiang Huang and Anthony K. H. Tung},
  journal= {arXiv preprint arXiv:2106.10515},
  year   = {2021}
}

Comments

11 pages, 7 figures

A Generic Distributed Clustering Framework for Massive Data

Abstract

Keywords

Cite

Comments

Related papers