Related papers: Two-cluster test

Post-clustering difference testing: valid inference and practical considerations

Clustering is part of unsupervised analysis methods that consist in grouping samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing is often used to…

Methodology · Statistics 2022-10-25 Benjamin Hivert , Denis Agniel , Rodolphe Thiébaut , Boris P Hejblum

Selective Inference for Hierarchical Clustering

Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I…

Methodology · Statistics 2022-11-01 Lucy L. Gao , Jacob Bien , Daniela Witten

Testing for a difference in means of a single feature after clustering

For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In…

Methodology · Statistics 2023-11-29 Yiqun T. Chen , Lucy L. Gao

Selective inference for k-means clustering

We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we…

Methodology · Statistics 2022-03-30 Yiqun T. Chen , Daniela M. Witten

Non-Parametric Cluster Significance Testing with Reference to a Unimodal Null Distribution

Cluster analysis is an unsupervised learning strategy that can be employed to identify subgroups of observations in data sets of unknown structure. This strategy is particularly useful for analyzing high-dimensional data such as microarray…

Methodology · Statistics 2016-10-07 Erika S. Helgeson , Eric Bair

Advanced Tutorial: Label-Efficient Two-Sample Tests

Hypothesis testing is a statistical inference approach used to determine whether data supports a specific hypothesis. An important type is the two-sample test, which evaluates whether two sets of data points are from identical…

Machine Learning · Computer Science 2025-01-08 Weizhi Li , Visar Berisha , Gautam Dasarathy

A label-efficient two-sample test

Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). We consider a new setting for this problem where sample features are…

Machine Learning · Computer Science 2022-07-20 Weizhi Li , Gautam Dasarathy , Karthikeyan Natesan Ramamurthy , Visar Berisha

Significance Analysis of High-Dimensional, Low-Sample Size Partially Labeled Data

Classification and clustering are both important topics in statistical learning. A natural question herein is whether predefined classes are really different from one another, or whether clusters are really there. Specifically, we may be…

Machine Learning · Statistics 2015-09-22 Qiyi Lu , Xingye Qiao

Statistical Significance for Hierarchical Clustering

Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other…

Methodology · Statistics 2014-11-20 Patrick K. Kimes , Yufeng Liu , D. Neil Hayes , J. S. Marron

Cluster Analysis of Educational Data: an Example of Quantitative Study on the answers to an Open-Ended Questionnaire

In the last years many studies examined the consistency of students' answers in a variety of contexts. Some of these papers tried to develop more detailed models of the consistency of students' reasoning, or to subdivide a sample of…

Physics Education · Physics 2017-08-17 Onofrio Rosario Battaglia , Benedetto Di Paola , Claudio Fazio

Clustering validity based on the most similarity

One basic requirement of many studies is the necessity of classifying data. Clustering is a proposed method for summarizing networks. Clustering methods can be divided into two categories named model-based approaches and algorithmic…

Machine Learning · Computer Science 2013-02-19 Raheleh Namayandeh , Farzad Didehvar , Zahra Shojaei

Instance-Based Classification through Hypothesis Testing

Classification is a fundamental problem in machine learning and data mining. During the past decades, numerous classification methods have been presented based on different principles. However, most existing classifiers cast the…

Machine Learning · Computer Science 2019-04-23 Zengyou He , Chaohua Sheng , Yan Liu , Quan Zou

Reclustering: A New Method to Test the Appropriate Level of Clustering

When scholars suspect units are dependent on each other within clusters but independent of each other across clusters, they employ cluster-robust standard errors (CRSEs). Nevertheless, what to cluster over is sometimes unknown. For…

Methodology · Statistics 2025-11-12 Kentaro Fukumoto

Significance-Based Categorical Data Clustering

Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the…

Machine Learning · Computer Science 2022-11-09 Lianyu Hu , Mudi Jiang , Yan Liu , Zengyou He

Cluster validation by measurement of clustering characteristics relevant to the user

There are many cluster analysis methods that can produce quite different clusterings on the same dataset. Cluster validation is about the evaluation of the quality of a clustering; "relative cluster validation" is about using such criteria…

Methodology · Statistics 2020-09-10 Christian Hennig

Statistical Significance of Clustering using Soft Thresholding

Clustering methods have led to a number of important discoveries in bioinformatics and beyond. A major challenge in their use is determining which clusters represent important underlying structure, as opposed to spurious sampling artifacts.…

Methodology · Statistics 2021-10-20 Hanwen Huang , Yufeng Liu , Ming Yuan , J. S. Marron

Two-Sample Testing in High-Dimensional Models

We propose novel methodology for testing equality of model parameters between two high-dimensional populations. The technique is very general and applicable to a wide range of models. The method is based on sample splitting: the data is…

Methodology · Statistics 2013-01-17 Nicolas Städler , Sach Mukherjee

Test for the statistical significance of a treatment effect in the presence of hidden sub-populations

For testing the statistical significance of a treatment effect, we usually compare between two parts of a population, one is exposed to the treatment, and the other is not exposed to it. Standard parametric and nonparametric two-sample…

Computation · Statistics 2012-11-02 Bikram Karmakar , Kumaresh Dhara , Kushal Kumar Dey , Analabha Basu , Anil Ghosh

An Effective and Efficient Approach for Clusterability Evaluation

Clustering is an essential data mining tool that aims to discover inherent cluster structure in data. As such, the study of clusterability, which evaluates whether data possesses such structure, is an integral part of cluster analysis. Yet,…

Machine Learning · Computer Science 2016-02-24 Margareta Ackerman , Andreas Adolfsson , Naomi Brownstein

Estimating the number of clusters using cross-validation

Many clustering methods, including k-means, require the user to specify the number of clusters as an input parameter. A variety of methods have been devised to choose the number of clusters automatically, but they often rely on strong…

Methodology · Statistics 2017-02-10 Wei Fu , Patrick O. Perry