Related papers: Statistical Tests for Large Tree-structured Data
Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain…
Model performance is frequently reported only for the overall population under consideration. However, due to heterogeneity, overall performance measures often do not accurately represent model performance within specific subgroups. We…
Goodness-of-fit tests are often used in data analysis to test the agreement of a distribution to a set of data. These tests can be used to detect an unknown signal against a known background or to set limits on a proposed signal…
The objective of goodness-of-fit testing is to assess whether a dataset of observations is likely to have been drawn from a candidate probability distribution. This paper presents a rank-based family of goodness-of-fit tests that is…
Methods of performing anomaly detection on high-dimensional data sets are needed, since algorithms which are trained on data are only expected to perform well on data that is similar to the training data. There are theoretical results on…
In this paper, we address the question of comparison between populations of trees. We study an statistical test based on the distance between empirical mean trees, as an analog of the two sample z statistic for comparing two means. Despite…
We propose a family of tests to assess the goodness-of-fit of a high-dimensional generalized linear model. Our framework is flexible and may be used to construct an omnibus test or directed against testing specific non-linearities and…
This paper formally derives the asymptotic distribution of a goodness-of-fit test based on the Kernel Stein Discrepancy introduced in (Oscar Key et al., "Composite Goodness-of-fit Tests with Kernels", Journal of Machine Learning Research…
Clustered data, which arise when observations are nested within groups, are incredibly common in clinical, education, and social science research. Traditionally, a linear mixed model, which includes random effects to account for…
We propose a nonparametric statistical test for goodness-of-fit: given a set of samples, the test determines how likely it is that these were generated from a target density function. The measure of goodness-of-fit is a divergence…
As a flexible nonparametric learning tool, the random forests algorithm has been widely applied to various real applications with appealing empirical performance, even in the presence of high-dimensional feature space. Unveiling the…
We consider goodness-of-fit tests for the distribution of the composed error in Stochastic Frontier Models. The proposed test statistic utilizes the characteristic function of the composed error term, and is formulated as a weighted…
Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of robust methods which directly account for this issue. However, whether these more…
We consider the analysis of high dimensional data given in the form of a matrix with columns consisting of observations and rows consisting of features. Often the data is such that the observations do not reside on a regular grid, and the…
Nowadays, data analysis in the world of Big Data is connected typically to data mining, descriptive or exploratory statistics, e.~g.\ cluster analysis, classification or regression analysis. Aside these techniques there is a huge area of…
This paper introduces chi-square goodness-of-fit tests to check for conditional distribution model specification. The data is cross-classified according to the Rosenblatt transform of the dependent variable and the explanatory variables,…
We consider the problem of goodness-of-fit testing for a model that has at least one unknown parameter that cannot be eliminated by transformation. Examples of such problems can be as simple as testing whether a sample consists of…
The stochastic block model is a popular tool for studying community structures in network data. We develop a goodness-of-fit test for the stochastic block model. The test statistic is based on the largest singular value of a residual matrix…
In this work, the distributional properties of the goodness-of-fit term in likelihood-based information criteria are explored. These properties are then leveraged to construct a novel goodness-of-fit test for normal linear regression models…
We introduce two new tools to assess the validity of statistical distributions. These tools are based on components derived from a new statistical quantity, the $comparison$ $curve$. The first tool is a graphical representation of these…