Related papers: A Weighted K-Center Algorithm for Data Subset Sele…

An efficient K-means algorithm for Massive Data

Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to ma- nipulate and analyze such information. Even though datasets have grown in size, the K-means algorithm…

Machine Learning · Statistics 2016-05-11 Marco Capó , Aritz Pérez , José Antonio Lozano

A sampling-based approach for efficient clustering in large datasets

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our…

Machine Learning · Computer Science 2022-03-30 Georgios Exarchakis , Omar Oubari , Gregor Lenz

Fair $k$-Center: a Coreset Approach in Low Dimensions

Center-based clustering techniques are fundamental in some areas of machine learning such as data summarization. Generic $k$-center algorithms can produce biased cluster representatives so there has been a recent interest in fair $k$-center…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-21 Jinxiang Gan , Mordecai Golin , Zonghan Yang , Yuhao Zhang

Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions

In this paper, we investigate the learning-augmented $k$-median clustering problem, which aims to improve the performance of traditional clustering algorithms by preprocessing the point set with a predictor of error rate $\alpha \in [0,1)$.…

Data Structures and Algorithms · Computer Science 2026-03-12 Kangke Cheng , Shihong Song , Guanlin Mo , Hu Ding

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

Modern datasets span billions of samples, making training on all available data infeasible. Selecting a high quality subset helps in reducing training costs and enhancing model quality. Submodularity, a discrete analogue of convexity, is…

Machine Learning · Computer Science 2025-04-04 Maximilian Böther , Abraham Sebastian , Pranjal Awasthi , Ana Klimovic , Srikumar Ramalingam

The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection

As machine learning tasks continue to evolve, the trend has been to gather larger datasets and train increasingly larger models. While this has led to advancements in accuracy, it has also escalated computational costs to unsustainable…

Machine Learning · Computer Science 2024-06-03 Mohammad Jafari , Yimeng Zhang , Yihua Zhang , Sijia Liu

Improved constant approximation factor algorithms for $k$-center problem for uncertain data

In real applications, database systems should be able to manage and process data with uncertainty. Any real dataset may have missing or rounded values, also the values of data may change by time. So, it becomes important to handle these…

Computational Geometry · Computer Science 2020-06-12 Sharareh Alipour

Accurate MapReduce Algorithms for $k$-median and $k$-means in General Metric Spaces

Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular $k$-median and $k$-means variants which, given a set $P$ of points from a metric…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 Alessio Mazzetto , Andrea Pietracaprina , Geppino Pucci

Faster Balanced Clusterings in High Dimension

The problem of constrained clustering has attracted significant attention in the past decades. In this paper, we study the balanced $k$-center, $k$-median, and $k$-means clustering problems where the size of each cluster is constrained by…

Computational Geometry · Computer Science 2018-09-11 Hu Ding

Clustering of Big Data with Mixed Features

Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We…

Machine Learning · Statistics 2020-11-13 Joshua Tobin , Mimi Zhang

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and…

Machine Learning · Computer Science 2024-02-28 Kyriakos Axiotis , Vincent Cohen-Addad , Monika Henzinger , Sammy Jerome , Vahab Mirrokni , David Saulpic , David Woodruff , Michael Wunder

A 2-Approximation Algorithm for Data-Distributed Metric k-Center

In a metric space, a set of point sets of roughly the same size and an integer $k\geq 1$ are given as the input and the goal of data-distributed $k$-center is to find a subset of size $k$ of the input points as the set of centers to…

Computational Geometry · Computer Science 2023-09-11 Sepideh Aghamolaei , Mohammad Ghodsi

Robust Coreset Construction for Distributed Machine Learning

Coreset, which is a summary of the original dataset in the form of a small weighted set in the same sample space, provides a promising approach to enable machine learning over distributed data. Although viewed as a proxy of the original…

Machine Learning · Computer Science 2020-06-24 Hanlin Lu , Ming-Ju Li , Ting He , Shiqiang Wang , Vijaykrishnan Narayanan , Kevin S Chan

Fast k-means algorithm clustering

k-means has recently been recognized as one of the best algorithms for clustering unsupervised data. Since k-means depends mainly on distance calculation between all data points and the centers, the time cost will be high when the size of…

Data Structures and Algorithms · Computer Science 2011-08-08 Raied Salman , Vojislav Kecman , Qi Li , Robert Strack , Erik Test

Approximation Algorithms for Clustering with Dynamic Points

We study two generalizations of classic clustering problems called dynamic ordered $k$-median and dynamic $k$-supplier, where the points that need clustering evolve over time, and we are allowed to move the cluster centers between…

Data Structures and Algorithms · Computer Science 2022-07-26 Shichuan Deng , Jian Li , Yuval Rabani

DISCERN: Diversity-based Selection of Centroids for k-Estimation and Rapid Non-stochastic Clustering

One of the applications of center-based clustering algorithms such as K-Means is partitioning data points into K clusters. In some examples, the feature space relates to the underlying problem we are trying to solve, and sometimes we can…

Machine Learning · Computer Science 2020-09-23 Ali Hassani , Amir Iranmanesh , Mahdi Eftekhari , Abbas Salemi

Near-Optimal Algorithms for Constrained k-Center Clustering with Instance-level Background Knowledge

Center-based clustering has attracted significant research interest from both theory and practice. In many practical applications, input data often contain background knowledge that can be used to improve clustering results. In this work,…

Machine Learning · Computer Science 2025-06-13 Longkun Guo , Chaoqi Jia , Kewen Liao , Zhigang Lu , Minhui Xue

Scalable k-Means Clustering via Lightweight Coresets

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive…

Machine Learning · Statistics 2018-06-08 Olivier Bachem , Mario Lucic , Andreas Krause

An efficient K -means clustering algorithm for massive data

The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. In this sense, cluster analysis algorithms are a key element of exploratory data analysis, due to their easiness in the…

Machine Learning · Statistics 2018-01-10 Marco Capó , Aritz Pérez , Jose A. Lozano

Fast Approximate $K$-Means via Cluster Closures

$K$-means, a simple and effective clustering algorithm, is one of the most widely used algorithms in multimedia and computer vision community. Traditional $k$-means is an iterative algorithm---in each iteration new cluster centers are…

Computer Vision and Pattern Recognition · Computer Science 2013-12-12 Jingdong Wang , Jing Wang , Qifa Ke , Gang Zeng , Shipeng Li