Related papers: Post-clustering difference testing: valid inferenc…
Cluster analysis is a fundamental research issue in statistics and machine learning. In many modern clustering methods, we need to determine whether two subsets of samples come from the same cluster. Since these subsets are usually…
A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis algorithms. In many practical applications, analytical findings are obtained only after data pass…
Clustering is the technique to partition data according to their characteristics. Data that are similar in nature belong to the same cluster [1]. There are two types of evaluation methods to evaluate clustering quality. One is an external…
This paper presents a clustering approach that allows for rigorous statistical error control similar to a statistical test. We develop estimators for both the unknown number of clusters and the clusters themselves. The estimators depend on…
This paper presents and analyzes an approach to cluster-based inference for dependent data. The primary setting considered here is with spatially indexed data in which the dependence structure of observed random variables is characterized…
Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I…
If the same data is used for both clustering and for testing a null hypothesis that is formulated in terms of the estimated clusters, then the traditional hypothesis testing framework often fails to control the Type I error. Gao et al.…
In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for…
Many clustering methods, including k-means, require the user to specify the number of clusters as an input parameter. A variety of methods have been devised to choose the number of clusters automatically, but they often rely on strong…
For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In…
When scholars suspect units are dependent on each other within clusters but independent of each other across clusters, they employ cluster-robust standard errors (CRSEs). Nevertheless, what to cluster over is sometimes unknown. For…
Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning…
Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The…
The overwhelming majority of empirical research that uses cluster-robust inference assumes that the clustering structure is known, even though there are often several possible ways in which a dataset could be clustered. We propose two tests…
This paper proposes a novel, nonparametric, interpoint distance-based measure to investigate whether there exist any groups in a set of given data, and if so then, how many groups are prevailing in total. It is a cluster accuracy index…
Clustering is an unsupervised technique of Data Mining. It means grouping similar objects together and separating the dissimilar ones. Each object in the data set is assigned a class label in the clustering process using a distance measure.…
Genetic data are frequently categorical and have complex dependence structures that are not always well understood. For this reason, clustering and classification based on genetic data, while highly relevant, are challenging statistical…
In empirical work it is common to estimate parameters of models and report associated standard errors that account for "clustering" of units, where clusters are defined by factors such as geography. Clustering adjustments are typically…
One basic requirement of many studies is the necessity of classifying data. Clustering is a proposed method for summarizing networks. Clustering methods can be divided into two categories named model-based approaches and algorithmic…
We review clustering as an analysis tool and the underlying concepts from an introductory perspective. What is clustering and how can clusterings be realised programmatically? How can data be represented and prepared for a clustering task?…