Related papers: Clustering with Statistical Error Control

Post-clustering difference testing: valid inference and practical considerations

Clustering is part of unsupervised analysis methods that consist in grouping samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing is often used to…

Methodology · Statistics 2022-10-25 Benjamin Hivert , Denis Agniel , Rodolphe Thiébaut , Boris P Hejblum

Panel Data with Unknown Clusters

Clustered standard errors and approximate randomization tests are popular inference methods that allow for dependence within observations. However, they require researchers to know the cluster structure ex ante. We propose a procedure to…

Econometrics · Economics 2022-01-14 Yong Cai

When Should You Adjust Standard Errors for Clustering?

In empirical work it is common to estimate parameters of models and report associated standard errors that account for "clustering" of units, where clusters are defined by factors such as geography. Clustering adjustments are typically…

Statistics Theory · Mathematics 2022-09-21 Alberto Abadie , Susan Athey , Guido Imbens , Jeffrey Wooldridge

Reclustering: A New Method to Test the Appropriate Level of Clustering

When scholars suspect units are dependent on each other within clusters but independent of each other across clusters, they employ cluster-robust standard errors (CRSEs). Nevertheless, what to cluster over is sometimes unknown. For…

Methodology · Statistics 2025-11-12 Kentaro Fukumoto

Clustering with Confidence: Finding Clusters with Statistical Guarantees

Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; slightly changing the used data sample or…

Machine Learning · Statistics 2017-01-02 Andreas Henelius , Kai Puolamäki , Henrik Boström , Panagiotis Papapetrou

A general trimming approach to robust Cluster Analysis

We introduce a new method for performing clustering with the aim of fitting clusters with different scatters and weights. It is designed by allowing to handle a proportion $\alpha$ of contaminating data to guarantee the robustness of the…

Statistics Theory · Mathematics 2008-12-18 Luis A. García-Escudero , Alfonso Gordaliza , Carlos Matrán , Agustin Mayo-Iscar

Approximating Spectral Clustering via Sampling: a Review

Spectral clustering refers to a family of unsupervised learning algorithms that compute a spectral embedding of the original data based on the eigenvectors of a similarity graph. This non-linear transformation of the data is both the key of…

Machine Learning · Computer Science 2019-01-30 Nicolas Tremblay , Andreas Loukas

Clustering in Partially Labeled Stochastic Block Models via Total Variation Minimization

A main task in data analysis is to organize data points into coherent groups or clusters. The stochastic block model is a probabilistic model for the cluster structure. This model prescribes different probabilities for the presence of edges…

Machine Learning · Computer Science 2020-09-24 Alexander Jung

Clustering functional data with measurement errors: a simulation-based approach

Clustering analysis of functional data, which comprises observations that evolve continuously over time or space, has gained increasing attention across various scientific disciplines. Practical applications often involve functional data…

Methodology · Statistics 2024-06-19 Tingyu Zhu , Lan Xue , Carmen Tekwe , Keith Diaz , Mark Benden , Roger Zoh

Inference for Clustering: Conformal Sets for Cluster Labels

While clustering is ubiquitously used across science and industry, uncertainty in cluster assignments is rarely quantified with rigorous guarantees. We propose a novel conformal inference framework for clustering that returns confidence…

Methodology · Statistics 2026-04-13 YoonHaeng Hur , Anirban Nath , Genevera Allen

SACA: Selective Attention-Based Clustering Algorithm

Clustering algorithms are fundamental tools across many fields, with density-based methods offering particular advantages in identifying arbitrarily shaped clusters and handling noise. However, their effectiveness is often limited by the…

Machine Learning · Computer Science 2025-12-01 Meysam Shirdel Bilehsavar , Razieh Ghaedi , Samira Seyed Taheri , Xinqi Fan , Christian O'Reilly

Estimating the number of clusters using cross-validation

Many clustering methods, including k-means, require the user to specify the number of clusters as an input parameter. A variety of methods have been devised to choose the number of clusters automatically, but they often rely on strong…

Methodology · Statistics 2017-02-10 Wei Fu , Patrick O. Perry

Quantifying uncertainty in spectral clusterings: expectations for perturbed and incomplete data

Spectral clustering is a popular unsupervised learning technique which is able to partition unlabelled data into disjoint clusters of distinct shapes. However, the data under consideration are often experimental data, implying that the data…

Machine Learning · Statistics 2025-05-26 Jürgen Dölz , Jolanda Weygandt

A Short Survey on Data Clustering Algorithms

With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial…

Data Structures and Algorithms · Computer Science 2015-12-01 Ka-Chun Wong

Clustering with Prototype Extraction for Census Data Analysis

Not long ago primary census data became available to publicity. It opened qualitatively new perspectives not only for researchers in demography and sociology, but also for those people, who somehow face processes occurring in society. In…

Databases · Computer Science 2011-06-28 Oleg Chertov , Marharyta Aleksandrova

How to Control the Error Rates of Binary Classifiers

The traditional binary classification framework constructs classifiers which may have good accuracy, but whose false positive and false negative error rates are not under users' control. In many cases, one of the errors is more severe and…

Machine Learning · Statistics 2020-10-22 Miloš Simić

Clustering Hierarchies via a Semi-Parametric Generalized Linear Mixed Model: a statistical significance-based approach

We introduce a novel statistical significance-based approach for clustering hierarchical data using semi-parametric linear mixed-effects models designed for responses with laws in the exponential family (e.g., Poisson and Bernoulli). Within…

Methodology · Statistics 2025-02-04 Alessandra Ragni , Chiara Masci , Francesca Ieva , Anna Maria Paganoni

Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach

Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The…

Machine Learning · Statistics 2023-10-20 Dimitrios Saligkaras , Vasileios E. Papageorgiou

Inference for Dependent Data with Learned Clusters

This paper presents and analyzes an approach to cluster-based inference for dependent data. The primary setting considered here is with spatially indexed data in which the dependence structure of observed random variables is characterized…

Statistics Theory · Mathematics 2022-11-16 Jianfei Cao , Christian Hansen , Damian Kozbur , Lucciano Villacorta

A Computational Theory and Semi-Supervised Algorithm for Clustering

A computational theory for clustering and a semi-supervised clustering algorithm is presented. Clustering is defined to be the obtainment of groupings of data such that each group contains no anomalies with respect to a chosen grouping…

Machine Learning · Computer Science 2025-07-17 Nassir Mohammad