Related papers: Fast Clustering using MapReduce

Parallel K-Medoids++ Spatial Clustering Algorithm Based on MapReduce

Clustering analysis has received considerable attention in spatial data mining for several years. With the rapid development of the geospatial information technologies, the size of spatial information data is growing exponentially which…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-25 Xia Yue , Wang Man , Jun Yue , Guangcao Liu

Faster Balanced Clusterings in High Dimension

The problem of constrained clustering has attracted significant attention in the past decades. In this paper, we study the balanced $k$-center, $k$-median, and $k$-means clustering problems where the size of each cluster is constrained by…

Computational Geometry · Computer Science 2018-09-11 Hu Ding

Near-Optimal Clustering in the $k$-machine model

The clustering problem, in its many variants, has numerous applications in operations research and computer science (e.g., in applications in bioinformatics, image processing, social network analysis, etc.). As sizes of data sets have grown…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-24 Sayan Bandyapadhyay , Tanmay Inamdar , Shreyas Pai , Sriram V. Pemmaraju

Solving $k$-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially

Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular $k$-center variant which, given a set $S$ of points from some metric space and a…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-02 Matteo Ceccarello , Andrea Pietracaprina , Geppino Pucci

Improved Approximation Algorithms for Relational Clustering

Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within…

Databases · Computer Science 2026-02-19 Aryan Esmailpour , Stavros Sintos

Simple, Scalable and Effective Clustering via One-Dimensional Projections

Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and $k$-means++ can take $\Omega(ndk)$ time when clustering $n$ points in…

Machine Learning · Computer Science 2023-10-26 Moses Charikar , Monika Henzinger , Lunjia Hu , Maxmilian Vötsch , Erik Waingarten

An efficient K -means clustering algorithm for massive data

The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. In this sense, cluster analysis algorithms are a key element of exploratory data analysis, due to their easiness in the…

Machine Learning · Statistics 2018-01-10 Marco Capó , Aritz Pérez , Jose A. Lozano

Accurate MapReduce Algorithms for $k$-median and $k$-means in General Metric Spaces

Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular $k$-median and $k$-means variants which, given a set $P$ of points from a metric…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 Alessio Mazzetto , Andrea Pietracaprina , Geppino Pucci

Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7\sim10^9$ points in high-dimensions ($d\geq100$). All current practical methods for this problem have runtimes at least…

Machine Learning · Computer Science 2025-02-11 Jack Spalding-Jamieson , Eliot Wong Robson , Da Wei Zheng

How to Use K-means for Big Data Clustering?

K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of…

Machine Learning · Computer Science 2023-11-27 Rustam Mussabayev , Nenad Mladenovic , Bassem Jarboui , Ravil Mussabayev

Data-Driven Clustering via Parameterized Lloyd's Families

Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as…

Data Structures and Algorithms · Computer Science 2019-05-27 Maria-Florina Balcan , Travis Dick , Colin White

Explainable $k$-Means and $k$-Medians Clustering

Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a…

Machine Learning · Computer Science 2020-09-23 Sanjoy Dasgupta , Nave Frost , Michal Moshkovitz , Cyrus Rashtchian

Clustering of Big Data with Mixed Features

Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We…

Machine Learning · Statistics 2020-11-13 Joshua Tobin , Mimi Zhang

Universal Algorithms for Clustering Problems

This paper presents universal algorithms for clustering problems, including the widely studied $k$-median, $k$-means, and $k$-center objectives. The input is a metric space containing all potential client locations. The algorithm must…

Data Structures and Algorithms · Computer Science 2021-07-16 Arun Ganesh , Bruce M. Maggs , Debmalya Panigrahi

Fair $k$-Center: a Coreset Approach in Low Dimensions

Center-based clustering techniques are fundamental in some areas of machine learning such as data summarization. Generic $k$-center algorithms can produce biased cluster representatives so there has been a recent interest in fair $k$-center…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-21 Jinxiang Gan , Mordecai Golin , Zonghan Yang , Yuhao Zhang

A parallel sampling based clustering

The problem of automatically clustering data is an age old problem. People have created numerous algorithms to tackle this problem. The execution time of any of this algorithm grows with the number of input points and the number of cluster…

Machine Learning · Computer Science 2014-12-08 Aditya AV Sastry , Kalyan Netti

Balanced $k$-Center Clustering When $k$ Is A Constant

The problem of constrained $k$-center clustering has attracted significant attention in the past decades. In this paper, we study balanced $k$-center cluster where the size of each cluster is constrained by the given lower and upper bounds.…

Computational Geometry · Computer Science 2017-04-11 Hu Ding

Near-Optimal Algorithms for Constrained k-Center Clustering with Instance-level Background Knowledge

Center-based clustering has attracted significant research interest from both theory and practice. In many practical applications, input data often contain background knowledge that can be used to improve clustering results. In this work,…

Machine Learning · Computer Science 2025-06-13 Longkun Guo , Chaoqi Jia , Kewen Liao , Zhigang Lu , Minhui Xue

Probabilistic Partitive Partitioning (PPP)

Clustering is a NP-hard problem. Thus, no optimal algorithm exists, heuristics are applied to cluster the data. Heuristics can be very resource-intensive, if not applied properly. For substantially large data sets computational efficiencies…

Databases · Computer Science 2020-03-11 Mujahid Sultan

Analyzing Large-Scale, Distributed and Uncertain Data

The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively…

Databases · Computer Science 2017-12-06 Yaron Gonen