Related papers: Closed-Form Beta Distribution Estimation from Spar…
Random forests have become an established tool for classification and regression, in particular in high-dimensional settings and in the presence of complex predictor-response relationships. For bounded outcome variables restricted to the…
Since their introduction by Breiman, Random Forests (RFs) have proven to be useful for both classification and regression tasks. The RF prediction of a previously unseen observation can be represented as a weighted sum of all training…
A compositional tree refers to a tree structure on a set of random variables where each random variable is a node and composition occurs at each non-leaf node of the tree. As a generalization of compositional data, compositional trees…
Many statistical estimators for high-dimensional linear regression are M-estimators, formed through minimizing a data-dependent square loss function plus a regularizer. This work considers a new class of estimators implicitly defined…
Regularization is a common tool in variational inverse problems to impose assumptions on the parameters of the problem. One such assumption is sparsity, which is commonly promoted using lasso and total variation-like regularization.…
Discrete probability laws underpin statistical modeling, yet the catalog of interpretable distributions has expanded only gradually through centuries of case-by-case mathematical derivations. We introduce symbolic density estimation (SDE),…
We investigate implicit regularization schemes for gradient descent methods applied to unpenalized least squares regression to solve the problem of reconstructing a sparse signal from an underdetermined system of linear measurements under…
Irregular errors such as heteroscedasticity and nonnormality remain major challenges in linear modeling. These issues often lead to biased inference and unreliable measures of uncertainty. Classical remedies, such as robust standard errors…
This article introduces a subbagging (subsample aggregating) approach for variable selection in regression within the context of big data. The proposed subbagging approach not only ensures that variable selection is scalable given the…
Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well…
In this paper, we study randomized reduction methods, which reduce high-dimensional features into low-dimensional space by randomized methods (e.g., random projection, random hashing), for large-scale high-dimensional classification.…
Training neural network models with discrete (categorical or structured) latent variables can be computationally challenging, due to the need for marginalization over large or combinatorial sets. To circumvent this issue, one typically…
This work develops formal statistical inference procedures for machine learning ensemble methods. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but…
We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally…
Recovering jointly sparse signals in the multiple measurement vectors (MMV) setting is a fundamental problem in machine learning, but traditional methods often require careful parameter tuning or prior knowledge of the sparsity of the…
This paper considers statistical inference for the explained variance $\beta^{\intercal}\Sigma \beta$ under the high-dimensional linear model $Y=X\beta+\epsilon$ in the semi-supervised setting, where $\beta$ is the regression vector and…
Model explainability is crucial for human users to be able to interpret how a proposed classifier assigns labels to data based on its feature values. We study generalized linear models constructed using sets of feature value rules, which…
The beta distribution is the best-known distribution for modelling doubly-bounded data, \eg percentage data or probabilities. A new generalization of the beta distribution is proposed, which uses a cubic transformation of the beta random…
We consider the problem of estimating the support of a vector $\beta^* \in \mathbb{R}^{p}$ based on observations contaminated by noise. A significant body of work has studied behavior of $\ell_1$-relaxations when applied to measurement…
Many popular statistical models, such as factor and random effects models, give arise a certain type of covariance structures that is a summation of low rank and sparse matrices. This paper introduces a penalized approximation framework to…