Related papers: Optimal Parallel Algorithms for Dendrogram Computa…
Single-linkage clustering is a popular form of hierarchical agglomerative clustering (HAC) where the distance between two clusters is defined as the minimum distance between any pair of points across the two clusters. In single-linkage HAC,…
This paper presents \pandora, a novel parallel algorithm for efficiently constructing dendrograms for single-linkage hierarchical clustering, including \hdbscan. Traditional dendrogram construction methods from a minimum spanning tree…
This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN$^*$). Our approach is based on generating a well-separated pair decomposition followed by using…
We address the problem of computing a single linkage dendrogram. A possible approach is to: (i) Form an edge weighted graph $G$ over the data, with edge weights reflecting dissimilarities. (ii) Calculate the MST $T$ of $G$. (iii) Break the…
This paper presents new deterministic and distributed low-diameter decomposition algorithms for weighted graphs. In particular, we show that if one can efficiently compute approximate distances in a parallel or a distributed setting, one…
We present the design and analysis of a near linear-work parallel algorithm for solving symmetric diagonally dominant (SDD) linear systems. On input of a SDD $n$-by-$n$ matrix $A$ with $m$ non-zero entries and a vector $b$, our algorithm…
Convex clustering is a modern clustering framework that guarantees globally optimal solutions and performs comparably to other advanced clustering methods. However, obtaining a complete dendrogram (clusterpath) for large-scale datasets…
One of the main challenges for hierarchical clustering is how to appropriately identify the representative points in the lower level of the cluster tree, which are going to be utilized as the roots in the higher level of the cluster tree…
Hierarchical clustering and community detection are important problems in machine learning and complex network analysis. A common approach to identify clusters is to simply cut dendrograms at some threshold. However, single-level cuts are…
We derive a statistical model for estimation of a dendrogram from single linkage hierarchical clustering (SLHC) that takes account of uncertainty through noise or corruption in the measurements of separation of data. Our focus is on just…
Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms…
Clustering multidimensional points is a fundamental data mining task, with applications in many fields, such as astronomy, neuroscience, bioinformatics, and computer vision. The goal of clustering algorithms is to group similar objects…
Previously, we proposed a physically-inspired method to construct data points into an effective in-tree (IT) structure, in which the underlying cluster structure in the dataset is well revealed. Although there are some edges in the IT…
The minimum spanning tree clustering algorithm is capable of detecting clusters with irregular boundaries. In this paper we propose two minimum spanning trees based clustering algorithm. The first algorithm produces k clusters with center…
This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative…
Modern trends in data collection are bringing current mainstream techniques for database query processing to their limits. Consequently, various novel approaches for efficient query processing are being actively studied. One such approach…
Parallelism has become a central concern in modern decoding frameworks aiming to meet stringent throughput and latency requirements. Guessing Random Additive Noise Decoding (GRAND) is a recently proposed decoding paradigm that tests…
We show fast deterministic algorithms for fundamental problems on forests in the challenging low-space regime of the well-known Massive Parallel Computation (MPC) model. A recent breakthrough result by Coy and Czumaj [STOC'22] shows that,…
We present a new way to summarize and select mixture models via the hierarchical clustering tree (dendrogram) constructed from an overfitted latent mixing measure. Our proposed method bridges agglomerative hierarchical clustering and…
Search trees on trees (STTs) generalize the fundamental binary search tree (BST) data structure: in STTs the underlying search space is an arbitrary tree, whereas in BSTs it is a path. An optimal BST of size $n$ can be computed for a given…