Related papers: Statistical Inference for Cluster Trees
We introduce a cluster evaluation technique called Tree Index. Our Tree Index algorithm aims at describing the structural information of the clustering rather than the quantitative format of cluster-quality indexes (where the representation…
Connected acyclic graphs (trees) are data objects that hierarchically organize categories. Collections of trees arise in a diverse variety of fields, including evolutionary biology, public health, machine learning, social sciences and…
For a density $f$ on ${\mathbb R}^d$, a {\it high-density cluster} is any connected component of $\{x: f(x) \geq \lambda\}$, for some $\lambda > 0$. The set of all high-density clusters forms a hierarchy called the {\it cluster tree} of…
Due to their accuracies, methods based on ensembles of regression trees are a popular approach for making predictions. Some common examples include Bayesian additive regression trees, boosting and random forests. This paper focuses on…
Clustering is a widely used unsupervised learning method for finding structure in the data. However, the resulting clusters are typically presented without any guarantees on their robustness; slightly changing the used data sample or…
Nearest neighbor (k-NN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is…
Usual formulations of the clustering coefficient can be shown to be insufficient in the task of describing the local topology of very simple networks. Motivated by this, we review some alternatives in order to present an extension, the…
In many applications, data cluster. Failing to take the cluster structure into consideration generally leads to underestimated variances of point estimators and inflated type I errors in hypothesis tests. Many circumstance-dependent…
While clustering is ubiquitously used across science and industry, uncertainty in cluster assignments is rarely quantified with rigorous guarantees. We propose a novel conformal inference framework for clustering that returns confidence…
Clustering is an essential data mining tool that aims to discover inherent cluster structure in data. For most applications, applying clustering is only appropriate when cluster structure is present. As such, the study of clusterability,…
The reconstruction of a central tendency `species tree' from a large number of conflicting gene trees is a central problem in systematic biology. Moreover, it becomes particularly problematic when taxon coverage is patchy, so that not all…
Mixture model-based frameworks are very popular for statistical inference in clustering. While convenient for producing probabilistic estimates of cluster assignments and uncertainty, they are prone to misspecification, which can lead to…
Cluster sampling is common in survey practice, and the corresponding inference has been predominantly design-based. We develop a Bayesian framework for cluster sampling and account for the design effect in the outcome modeling. We consider…
This paper presents and analyzes an approach to cluster-based inference for dependent data. The primary setting considered here is with spatially indexed data in which the dependence structure of observed random variables is characterized…
High density clusters can be characterized by the connected components of a level set $L(\lambda) = \{x:\ p(x)>\lambda\}$ of the underlying probability density function $p$ generating the data, at some appropriate level $\lambda\geq 0$. The…
Clustering is an underspecified task: there are no universal criteria for what makes a good clustering. This is especially true for relational data, where similarity can be based on the features of individuals, the relationships between…
Interpretable machine learning has emerged as central in leveraging artificial intelligence within high-stakes domains such as healthcare, where understanding the rationale behind model predictions is as critical as achieving high…
Estimating phylogenetic trees is an important problem in evolutionary biology, environmental policy and medicine. Although trees are estimated, their uncertainties are discarded by mathematicians working in tree space. Here we explicitly…
A commonly used characteristic of statistical dependence of adjacency relations in real networks, the clustering coefficient, evaluates chances that two neighbours of a given vertex are adjacent. An extension is obtained by considering…
This work develops formal statistical inference procedures for machine learning ensemble methods. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but…