Related papers: Inference for Dependent Data with Learned Clusters
Clustering is part of unsupervised analysis methods that consist in grouping samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing is often used to…
This paper considers the problem of inference in cluster randomized experiments when cluster sizes are non-ignorable. Here, by a cluster randomized experiment, we mean one in which treatment is assigned at the cluster level. By…
In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for…
An extension of the latent class model is presented for clustering categorical data by relaxing the classical "class conditional independence assumption" of variables. This model consists in grouping the variables into inter-independent and…
This paper studies inference in cluster randomized trials where treatment status is determined according to a "matched pairs" design. Here, by a cluster randomized experiment, we mean one in which treatment is assigned at the level of the…
Recent work by Gao et al. (JASA 2022) has laid the foundations for post-clustering inference, establishing a theoretical framework allowing to test for differences between means of estimated clusters. Additionally, they studied the…
This work presents an unsupervised deep discriminant analysis for clustering. The method is based on deep neural networks and aims to minimize the intra-cluster discrepancy and maximize the inter-cluster discrepancy in an unsupervised…
Clustered standard errors and approximate randomization tests are popular inference methods that allow for dependence within observations. However, they require researchers to know the cluster structure ex ante. We propose a procedure to…
Motivated by modern applications in which one constructs graphical models based on a very large number of features, this paper introduces a new class of cluster-based graphical models, in which variable clustering is applied as an initial…
Model-based clustering is a powerful tool that is often used to discover hidden structure in data by grouping observational units that exhibit similar response values. Recently, clustering methods have been developed that permit…
In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial…
While clustering is ubiquitously used across science and industry, uncertainty in cluster assignments is rarely quantified with rigorous guarantees. We propose a novel conformal inference framework for clustering that returns confidence…
We consider the problem of analyzing the heterogeneity of clustering distributions for multiple groups of observed data, each of which is indexed by a covariate value, and inferring global clusters arising from observations aggregated over…
This paper focuses on a setting with observations having a cluster dependence structure and presents two main impossibility results. First, we show that when there is only one large cluster, i.e., the researcher does not have any knowledge…
Identifying dependency in multivariate data is a common inference task that arises in numerous applications. However, existing nonparametric independence tests typically require computation that scales at least quadratically with the sample…
This paper considers inference when there is a single treated cluster and a fixed number of control clusters, a setting that is common in empirical work, especially in difference-in-differences designs. We use the t-statistic and develop…
Clustering multivariate data is a pervasive task in many applied problems, particularly in social studies and life science. Model-based approaches to clustering rely on mixture models, where each mixture component corresponds to the kernel…
The independence clustering problem is considered in the following formulation: given a set $S$ of random variables, it is required to find the finest partitioning $\{U_1,\dots,U_k\}$ of $S$ into clusters such that the clusters…
This paper focuses on a data-rich environment where the data set has a very large cross-sectional dimension, is likely to exhibit local dependence, and yet is hard to determine the dependence ordering. Such a situation arises, for example,…
Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning…