Related papers: Post-clustering difference testing: valid inferenc…

Two-cluster test

Cluster analysis is a fundamental research issue in statistics and machine learning. In many modern clustering methods, we need to determine whether two subsets of samples come from the same cluster. Since these subsets are usually…

Machine Learning · Computer Science 2025-07-15 Xinying Liu , Lianyu Hu , Mudi Jiang , Simeng Zhang , Jun Lou , Zengyou He

Statistical Testing Framework for Clustering Pipelines by Selective Inference

A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis algorithms. In many practical applications, analytical findings are obtained only after data pass…

Machine Learning · Statistics 2026-05-04 Yugo Miyata , Tomohiro Shiraishi , Shuichi Nishino , Ichiro Takeuchi

Evaluating and Validating Cluster Results

Clustering is the technique to partition data according to their characteristics. Data that are similar in nature belong to the same cluster [1]. There are two types of evaluation methods to evaluate clustering quality. One is an external…

Machine Learning · Computer Science 2024-09-05 Anupriya Vysala , Joseph Gomes

Clustering with Statistical Error Control

This paper presents a clustering approach that allows for rigorous statistical error control similar to a statistical test. We develop estimators for both the unknown number of clusters and the clusters themselves. The estimators depend on…

Statistics Theory · Mathematics 2017-07-13 Michael Vogt , Matthias Schmid

Inference for Dependent Data with Learned Clusters

This paper presents and analyzes an approach to cluster-based inference for dependent data. The primary setting considered here is with spatially indexed data in which the dependence structure of observed random variables is characterized…

Statistics Theory · Mathematics 2022-11-16 Jianfei Cao , Christian Hansen , Damian Kozbur , Lucciano Villacorta

Selective Inference for Hierarchical Clustering

Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I…

Methodology · Statistics 2022-11-01 Lucy L. Gao , Jacob Bien , Daniela Witten

Selective inference for multiple pairs of clusters after K-means clustering

If the same data is used for both clustering and for testing a null hypothesis that is formulated in terms of the estimated clusters, then the traditional hypothesis testing framework often fails to control the Type I error. Gao et al.…

Methodology · Statistics 2024-05-28 Youngjoo Yun , Yinqiu He

Selective inference for clustering with unknown variance

In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for…

Methodology · Statistics 2023-07-24 Youngjoo Yun , Rina Foygel Barber

Estimating the number of clusters using cross-validation

Many clustering methods, including k-means, require the user to specify the number of clusters as an input parameter. A variety of methods have been devised to choose the number of clusters automatically, but they often rely on strong…

Methodology · Statistics 2017-02-10 Wei Fu , Patrick O. Perry

Testing for a difference in means of a single feature after clustering

For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In…

Methodology · Statistics 2023-11-29 Yiqun T. Chen , Lucy L. Gao

Reclustering: A New Method to Test the Appropriate Level of Clustering

When scholars suspect units are dependent on each other within clusters but independent of each other across clusters, they employ cluster-robust standard errors (CRSEs). Nevertheless, what to cluster over is sometimes unknown. For…

Methodology · Statistics 2025-11-12 Kentaro Fukumoto

Semi-supervised clustering methods

Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning…

Methodology · Statistics 2014-07-11 Eric Bair

Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach

Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The…

Machine Learning · Statistics 2023-10-20 Dimitrios Saligkaras , Vasileios E. Papageorgiou

Testing for the appropriate level of clustering in linear regression models

The overwhelming majority of empirical research that uses cluster-robust inference assumes that the clustering structure is known, even though there are often several possible ways in which a dataset could be clustered. We propose two tests…

Econometrics · Economics 2023-03-14 James G. MacKinnon , Morten Ørregaard Nielsen , Matthew D. Webb

Evaluation of the number of clusters in a data set using $p$-values from Multiple Tests of Hypotheses

This paper proposes a novel, nonparametric, interpoint distance-based measure to investigate whether there exist any groups in a set of given data, and if so then, how many groups are prevailing in total. It is a cluster accuracy index…

Methodology · Statistics 2026-05-21 Soumita Modak

Issues,Challenges and Tools of Clustering Algorithms

Clustering is an unsupervised technique of Data Mining. It means grouping similar objects together and separating the dissimilar ones. Each object in the data set is assigned a class label in the clustering process using a distance measure.…

Information Retrieval · Computer Science 2011-10-13 Parul Agarwal , M. Afshar Alam , Ranjit Biswas

Clustering and Classification of Genetic Data Through U-Statistics

Genetic data are frequently categorical and have complex dependence structures that are not always well understood. For this reason, clustering and classification based on genetic data, while highly relevant, are challenging statistical…

Methodology · Statistics 2016-06-13 Gabriela Bettella Cybis , Marcio Valk , Silvia Regina Costa Lopes

When Should You Adjust Standard Errors for Clustering?

In empirical work it is common to estimate parameters of models and report associated standard errors that account for "clustering" of units, where clusters are defined by factors such as geography. Clustering adjustments are typically…

Statistics Theory · Mathematics 2022-09-21 Alberto Abadie , Susan Athey , Guido Imbens , Jeffrey Wooldridge

Clustering validity based on the most similarity

One basic requirement of many studies is the necessity of classifying data. Clustering is a proposed method for summarizing networks. Clustering methods can be divided into two categories named model-based approaches and algorithmic…

Machine Learning · Computer Science 2013-02-19 Raheleh Namayandeh , Farzad Didehvar , Zahra Shojaei

Clustering -- Basic concepts and methods

We review clustering as an analysis tool and the underlying concepts from an introductory perspective. What is clustering and how can clusterings be realised programmatically? How can data be represented and prepared for a clustering task?…

Machine Learning · Computer Science 2022-12-05 Jan-Oliver Felix Kapp-Joswig , Bettina G. Keller