Related papers: Computationally efficient univariate filtering for…
We show that the incorporation of any new piece of information allows for improved decision making in the sense that the expected costs of an optimal decision decrease (or, in boundary cases where no or not enough new information is…
Motivated by genome-wide association studies, we consider a standard linear model with one additional random effect in situations where many predictors have been collected on the same subjects and each predictor is analyzed separately.…
Non-parametric two-sample tests based on energy distance or maximum mean discrepancy are widely used statistical tests for comparing multivariate data from two populations. While these tests enjoy desirable statistical properties, their…
Recommender system data presents unique challenges to the data mining, machine learning, and algorithms communities. The high missing data rate, in combination with the large scale and high dimensionality that is typical of recommender…
Permutation tests are among the simplest and most widely used statistical tools. Their p-values can be computed by a straightforward sampling of permutations. However, this way of computing p-values is often so slow that it is replaced by…
A multivariate score-driven filter is developed to extract signals from noisy vector processes. By assuming that the conditional location vector from a multivariate Student's t distribution changes over time, we construct a robust filter…
Score matching is an alternative to maximum likelihood (ML) for estimating a probability distribution parametrized up to a constant of proportionality. By fitting the ''score'' of the distribution, it sidesteps the need to compute this…
Complex scientific models where the likelihood cannot be evaluated present a challenge for statistical inference. Over the past two decades, a wide range of algorithms have been proposed for learning parameters in computationally feasible…
Given a high-dimensional covariate matrix and a response vector, ridge-regularized sparse linear regression selects a subset of features that explains the relationship between covariates and the response in an interpretable manner. To…
Re-sampling based statistical tests are known to be computationally heavy, but reliable when small sample sizes are available. Despite their nice theoretical properties not much effort has been put to make them efficient. In this paper we…
In a standard regression problem, we have a set of explanatory variables whose effect on some response vector is modeled. For wide binary data, such as genetic marker data, we often have two limitations. First, we have more parameters than…
The large-scale multiple testing inherent to high throughput biological data necessitates very high statistical stringency and thus true effects in data are difficult to detect unless they have high effect sizes. One solution to this…
Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the…
Score matching is an approach to learning probability distributions parametrized up to a constant of proportionality (e.g. Energy-Based Models). The idea is to fit the score of the distribution, rather than the likelihood, thus avoiding the…
Statistical tests that compare classification algorithms are univariate and use a single performance measure, e.g., misclassification error, $F$ measure, AUC, and so on. In multivariate tests, comparison is done using multiple measures…
We revisit the fundamental question of simple-versus-simple hypothesis testing with an eye towards computational complexity, as the statistically optimal likelihood ratio test is often computationally intractable in high-dimensional…
Recent large scale genome wide association analysis involves large scale linear mixed models. Quantifying (co)-variance parameters in the mixed models with a restricted maximum likelihood method results in a score function which is the…
This paper introduces a new data analysis method for big data using a newly defined regression model named multiple model linear regression(MMLR), which separates input datasets into subsets and construct local linear regression models of…
Inferring linear relationships lies at the heart of many empirical investigations. A measure of linear dependence should correctly evaluate the strength of the relationship as well as qualify whether it is meaningful for the population.…
A group of approaches for calculating forensic likelihood ratios first calculates scores which quantify the degree of difference or the degree of similarity between pairs of samples, then converts those scores to likelihood ratios. In order…