Related papers: Binary Bleed: Fast Distributed and Parallel Method…

Near-perfect Clustering Based on Recursive Binary Splitting Using Max-MMD

We develop novel clustering algorithms for functional data when the number of clusters $K$ is unknown and also when it is prefixed. These algorithms are developed based on the Maximum Mean Discrepancy (MMD) measure between two sets of…

Methodology · Statistics 2025-07-16 Sourav Chakrabarty , Anirvan Chakraborty , Shyamal K. De

Simultaneous Estimation of Number of Clusters and Feature Sparsity in Clustering High-Dimensional Data

Estimating the number of clusters (K) is a critical and often difficult task in cluster analysis. Many methods have been proposed to estimate K, including some top performers using resampling approach. When performing cluster analysis in…

Methodology · Statistics 2019-09-05 Yujia Li , Xiangrui Zeng , Chien-Wei Lin , George Tseng

Effective Sampling: Fast Segmentation Using Robust Geometric Model Fitting

Identifying the underlying models in a set of data points contaminated by noise and outliers, leads to a highly complex multi-model fitting problem. This problem can be posed as a clustering problem by the projection of higher order…

Computer Vision and Pattern Recognition · Computer Science 2018-08-01 Ruwan Tennakoon , Alireza Sadri , Reza Hoseinnezhad , Alireza Bab-Hadiashar

Comparing Cluster-Based Cross-Validation Strategies for Machine Learning Model Evaluation

Cross-validation plays a fundamental role in Machine Learning, enabling robust evaluation of model performance and preventing overestimation on training and validation data. However, one of its drawbacks is the potential to create data…

Machine Learning · Computer Science 2025-08-28 Afonso Martini Spezia , Thomas Fontanari , Mariana Recamonde-Mendoza

CAS Condensed and Accelerated Silhouette: An Efficient Method for Determining the Optimal K in K-Means Clustering

Clustering is a critical component of decision-making in todays data-driven environments. It has been widely used in a variety of fields such as bioinformatics, social network analysis, and image processing. However, clustering accuracy…

Machine Learning · Computer Science 2025-07-14 Krishnendu Das , Sumit Gupta , Awadhesh Kumar

A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data

Recent advancements in Mixed Integer Optimization (MIO) algorithms, paired with hardware enhancements, have led to significant speedups in resolving MIO problems. These strategies have been utilized for optimal subset selection,…

Methodology · Statistics 2024-03-27 Madhav Sankaranarayanan , Intekhab Hossain , Tom Chen

Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming

$K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Recently, semidefinite programming (SDP) relaxations have been proposed for solving the $K$-means optimization problem, which enjoy…

Machine Learning · Statistics 2024-04-16 Yubo Zhuang , Xiaohui Chen , Yun Yang , Richard Y. Zhang

A Binary Optimization Approach for Constrained K-Means Clustering

K-Means clustering still plays an important role in many computer vision problems. While the conventional Lloyd method, which alternates between centroid update and cluster assignment, is primarily used in practice, it may converge to a…

Computer Vision and Pattern Recognition · Computer Science 2018-10-30 Huu Le , Anders Eriksson , Thanh-Toan Do , Michael Milford

Using Multi-Core HW/SW Co-design Architecture for Accelerating K-means Clustering Algorithm

The capability of classifying and clustering a desired set of data is an essential part of building knowledge from data. However, as the size and dimensionality of input data increases, the run-time for such clustering algorithms is…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-25 Hadi Mardani Kamali

Distributed Kernel K-Means for Large Scale Clustering

Clustering samples according to an effective metric and/or vector space representation is a challenging unsupervised learning task with a wide spectrum of applications. Among several clustering algorithms, k-means and its kernelized version…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-10 Marco Jacopo Ferrarotti , Sergio Decherchi , Walter Rocchia

K-Splits: Improved K-Means Clustering Algorithm to Automatically Detect the Number of Clusters

This paper introduces k-splits, an improved hierarchical algorithm based on k-means to cluster data without prior knowledge of the number of clusters. K-splits starts from a small number of clusters and uses the most significant data…

Computer Vision and Pattern Recognition · Computer Science 2022-05-25 Seyed Omid Mohammadi , Ahmad Kalhor , Hossein Bodaghi

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and…

Machine Learning · Computer Science 2024-02-28 Kyriakos Axiotis , Vincent Cohen-Addad , Monika Henzinger , Sammy Jerome , Vahab Mirrokni , David Saulpic , David Woodruff , Michael Wunder

Unified Spectral Clustering with Optimal Graph

Spectral clustering has found extensive use in many areas. Most traditional spectral clustering algorithms work in three separate steps: similarity graph construction; continuous labels learning; discretizing the learned labels by k-means…

Machine Learning · Computer Science 2017-11-15 Zhao Kang , Chong Peng , Qiang Cheng , Zenglin Xu

Mine Blood Donors Information through Improved K-Means Clustering

The number of accidents and health diseases which are increasing at an alarming rate are resulting in a huge increase in the demand for blood. There is a necessity for the organized analysis of the blood donor database or blood banks…

Databases · Computer Science 2013-09-11 Bondu Venkateswarlu , Prof G. S. V. Prasad Raju

How to Use K-means for Big Data Clustering?

K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of…

Machine Learning · Computer Science 2023-11-27 Rustam Mussabayev , Nenad Mladenovic , Bassem Jarboui , Ravil Mussabayev

Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions

In this paper, we investigate the learning-augmented $k$-median clustering problem, which aims to improve the performance of traditional clustering algorithms by preprocessing the point set with a predictor of error rate $\alpha \in [0,1)$.…

Data Structures and Algorithms · Computer Science 2026-03-12 Kangke Cheng , Shihong Song , Guanlin Mo , Hu Ding

Randomized Dimensionality Reduction for k-means Clustering

We study the topic of dimensionality reduction for $k$-means clustering. Dimensionality reduction encompasses the union of two approaches: \emph{feature selection} and \emph{feature extraction}. A feature selection based algorithm for…

Data Structures and Algorithms · Computer Science 2015-03-19 Christos Boutsidis , Anastasios Zouzias , Michael W. Mahoney , Petros Drineas

Symmetric nonnegative matrix factorization (SymNMF) is a powerful tool for clustering, which typically uses the $k$-nearest neighbor ($k$-NN) method to construct similarity matrix. However, $k$-NN may mislead clustering since the neighbors…

Machine Learning · Computer Science 2024-12-06 Wenlong Lyu , Yuheng Jia

Clustering-Based Validation Splits for Model Selection under Domain Shift

This paper considers the problem of model selection under domain shift. Motivated by principles from distributionally robust optimisation and domain adaptation theory, it is proposed that the training-validation split should maximise the…

Machine Learning · Computer Science 2025-08-19 Andrea Napoli , Paul White

Double/Debiased Machine Learning for Treatment and Causal Parameters

Most modern supervised statistical/machine learning (ML) methods are explicitly designed to solve prediction problems very well. Achieving this goal does not imply that these methods automatically deliver good estimators of causal…

Machine Learning · Statistics 2024-11-05 Victor Chernozhukov , Denis Chetverikov , Mert Demirer , Esther Duflo , Christian Hansen , Whitney Newey , James Robins