Related papers: Closed-Form Beta Distribution Estimation from Spar…

A Random Forest Approach for Modeling Bounded Outcomes

Random forests have become an established tool for classification and regression, in particular in high-dimensional settings and in the presence of complex predictor-response relationships. For bounded outcome variables restricted to the…

Methodology · Statistics 2019-01-21 Leonie Weinhold , Matthias Schmid , Marvin N. Wright , Moritz Berger

Simplifying Random Forests' Probabilistic Forecasts

Since their introduction by Breiman, Random Forests (RFs) have proven to be useful for both classification and regression tasks. The RF prediction of a previously unseen observation can be represented as a weighted sum of all training…

Applications · Statistics 2025-08-21 Nils Koster , Fabian Krüger

Regularized regression on compositional trees with application to MRI analysis

A compositional tree refers to a tree structure on a set of random variables where each random variable is a node and composition occurs at each non-leaf node of the tree. As a generalization of compositional data, compositional trees…

Methodology · Statistics 2021-04-20 Bingkai Wang , Brian S. Caffo , Xi Luo , Chin-Fu Liu , Andreia V. Faria , Michael I. Miller , Yi Zhao

High-Dimensional Linear Regression via Implicit Regularization

Many statistical estimators for high-dimensional linear regression are M-estimators, formed through minimizing a data-dependent square loss function plus a regularizer. This work considers a new class of estimators implicitly defined…

Statistics Theory · Mathematics 2022-02-15 Peng Zhao , Yun Yang , Qiao-Chu He

Sparse Bayesian Inference with Regularized Gaussian Distributions

Regularization is a common tool in variational inverse problems to impose assumptions on the parameters of the problem. One such assumption is sparsity, which is commonly promoted using lasso and total variation-like regularization.…

Statistics Theory · Mathematics 2023-02-15 Jasper Marijn Everink , Yiqiu Dong , Martin Skovgaard Andersen

Symbolic Density Estimation for Discrete Distributions

Discrete probability laws underpin statistical modeling, yet the catalog of interpretable distributions has expanded only gradually through centuries of case-by-case mathematical derivations. We introduce symbolic density estimation (SDE),…

Machine Learning · Computer Science 2026-05-25 Ziwen Liu , Meng Li

Implicit Regularization for Optimal Sparse Recovery

We investigate implicit regularization schemes for gradient descent methods applied to unpenalized least squares regression to solve the problem of reconstructing a sparse signal from an underdetermined system of linear measurements under…

Machine Learning · Statistics 2019-09-12 Tomas Vaškevičius , Varun Kanade , Patrick Rebeschini

A Percentile-Focused Regression Method for Applied Data with Irregular Error Structures

Irregular errors such as heteroscedasticity and nonnormality remain major challenges in linear modeling. These issues often lead to biased inference and unreliable measures of uncertainty. Classical remedies, such as robust standard errors…

Methodology · Statistics 2026-03-05 Elsayed Elamir

Subbagging Variable Selection for Big Data

This article introduces a subbagging (subsample aggregating) approach for variable selection in regression within the context of big data. The proposed subbagging approach not only ensures that variable selection is scalable given the…

Methodology · Statistics 2025-03-10 Xian Li , Xuan Liang , Tao Zou

Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success

Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well…

Machine Learning · Statistics 2020-09-15 Lucas Mentch , Siyu Zhou

Theory of Dual-sparse Regularized Randomized Reduction

In this paper, we study randomized reduction methods, which reduce high-dimensional features into low-dimensional space by randomized methods (e.g., random projection, random hashing), for large-scale high-dimensional classification.…

Machine Learning · Computer Science 2015-07-21 Tianbao Yang , Lijun Zhang , Rong Jin , Shenghuo Zhu

Efficient Marginalization of Discrete and Structured Latent Variables via Sparsity

Training neural network models with discrete (categorical or structured) latent variables can be computationally challenging, due to the need for marginalization over large or combinatorial sets. To circumvent this issue, one typically…

Machine Learning · Computer Science 2020-12-29 Gonçalo M. Correia , Vlad Niculae , Wilker Aziz , André F. T. Martins

Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests

This work develops formal statistical inference procedures for machine learning ensemble methods. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but…

Machine Learning · Statistics 2015-09-11 Lucas Mentch , Giles Hooker

Autoencoding Random Forests

We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally…

Machine Learning · Statistics 2026-01-16 Binh Duc Vu , Jan Kapar , Marvin Wright , David S. Watson

Tuning-Free Structured Sparse Recovery of Multiple Measurement Vectors using Implicit Regularization

Recovering jointly sparse signals in the multiple measurement vectors (MMV) setting is a fundamental problem in machine learning, but traditional methods often require careful parameter tuning or prior knowledge of the sparsity of the…

Machine Learning · Computer Science 2026-02-02 Lakshmi Jayalal , Sheetal Kalyani

Semi-supervised Inference for Explained Variance in High-dimensional Linear Regression and Its Applications

This paper considers statistical inference for the explained variance $\beta^{\intercal}\Sigma \beta$ under the high-dimensional linear model $Y=X\beta+\epsilon$ in the semi-supervised setting, where $\beta$ is the regression vector and…

Methodology · Statistics 2020-12-01 T. Tony Cai , Zijian Guo

Obtaining Explainable Classification Models using Distributionally Robust Optimization

Model explainability is crucial for human users to be able to interpret how a proposed classifier assigns labels to data based on its feature values. We study generalized linear models constructed using sets of feature value rules, which…

Machine Learning · Statistics 2023-11-06 Sanjeeb Dash , Soumyadip Ghosh , Joao Goncalves , Mark S. Squillante

A new generalization of the beta distribution

The beta distribution is the best-known distribution for modelling doubly-bounded data, \eg percentage data or probabilities. A new generalization of the beta distribution is proposed, which uses a cubic transformation of the beta random…

Methodology · Statistics 2016-12-19 Rose Baker

High-dimensional subset recovery in noise: Sparsified measurements without loss of statistical efficiency

We consider the problem of estimating the support of a vector $\beta^* \in \mathbb{R}^{p}$ based on observations contaminated by noise. A significant body of work has studied behavior of $\ell_1$-relaxations when applied to measurement…

Machine Learning · Statistics 2008-05-21 Dapo Omidiran , Martin J. Wainwright

Recovering Model Structures from Large Low Rank and Sparse Covariance Matrix Estimation

Many popular statistical models, such as factor and random effects models, give arise a certain type of covariance structures that is a summation of low rank and sparse matrices. This paper introduces a penalized approximation framework to…

Methodology · Statistics 2015-03-19 Xi Luo