Related papers: Bagging multiple comparisons from microarray data
Modern statisticians are often presented with hundreds or thousands of hypothesis testing problems to evaluate at the same time, generated from new scientific technologies such as microarrays, medical and satellite imaging devices, or flow…
Two-sample hypothesis testing for network comparison presents many significant challenges, including: leveraging repeated network observations and known node registration, but without requiring them to operate; relaxing strong structural…
Large-scale data analysis poses both statistical and computational problems which need to be addressed simultaneously. A solution is often straightforward if the data are homogeneous: one can use classical ideas of subsampling and mean…
The problem of finding itemsets that are statistically significantly enriched in a class of transactions is complicated by the need to correct for multiple hypothesis testing. Pruning untestable hypotheses was recently proposed as a…
The problem of multiple hypothesis testing arises when there are more than one hypothesis to be tested simultaneously for statistical significance. This is a very common situation in many data mining applications. For instance, assessing…
Modern scientific technology has provided a new class of large-scale simultaneous inference problems, with thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology, but similar situations…
Comparing two population means of network data is of paramount importance in a wide range of scientific applications. Many existing network inference solutions focus on global testing of entire networks, without comparing individual network…
Bagging, a powerful ensemble method from machine learning, improves the performance of unstable predictors. Although the power of Bagging has been shown mostly in classification problems, we demonstrate the success of employing Bagging in…
Bagging and boosting, two effective machine learning techniques, are applied to natural language parsing. Experiments using these techniques with a trainable statistical parser are described. The best resulting system provides roughly as…
Hypothesis testing of structure in correlation and covariance matrices is of broad interest in many application areas. In high dimensions and/or small to moderate sample sizes, high error rates in testing is a substantial concern. This…
This article introduces a subbagging (subsample aggregating) approach for variable selection in regression within the context of big data. The proposed subbagging approach not only ensures that variable selection is scalable given the…
This version is ***superseded*** by a full version that can be found at http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger theoretical results and fixes a mistake in the reporting of experiments. Abstract:…
Data with multiple functional recordings at each observational unit are increasingly common in various fields including medical imaging and environmental sciences. To conduct inference for such observations, we develop a paired two-sample…
We propose an adaptive sampling approach for multiple testing which aims to maximize statistical power while ensuring anytime false discovery control. We consider $n$ distributions whose means are partitioned by whether they are below or…
We study the effectiveness of subagging, or subsample aggregating, on regression trees, a popular non-parametric method in machine learning. First, we give sufficient conditions for pointwise consistency of trees. We formalize that (i) the…
This work investigates the ``small-vs-large gap'', where repeating on fewer samples can lead to compute saving during training compared to using a larger dataset. This is observed across algorithmic tasks, architectures and optimizers and…
In this work, we address the question of how to enhance signal-agnostic searches by leveraging multiple testing strategies. Specifically, we consider hypothesis tests relying on machine learning, where model selection can introduce a bias…
Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information…
To identify statistically significant conclusions, it is proposed to simultaneously test hypotheses and alternatives. It is shown that, under the condition of free combination of hypotheses and alternatives, the closure method leads to…
Scientific claims gain credibility by replicability, especially if replication under different circumstances and varying designs yields equivalent results. Aggregating results over multiple studies is, however, not straightforward, and when…