机器学习
We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error. The method applies to architectures whose first layer is fully connected.…
Modern deep learning techniques focus on extracting intricate information from data to achieve accurate predictions. However, the training datasets may be crowdsourced and include sensitive information, such as personal contact details,…
Machine learning models have achieved widespread success but often inherit and amplify historical biases, resulting in unfair outcomes. Traditional fairness methods typically impose constraints at the prediction level, without addressing…
Decomposing prediction uncertainty into aleatoric (irreducible) and epistemic (reducible) components is critical for the reliable deployment of machine learning systems. While the mutual information between the response variable and model…
There has been a surge of recent interest in automatically learning policies to target treatment decisions based on rich individual covariates. In addition, practitioners want confidence that the learned policy has better performance than…
Cross-validation (CV) is known to provide asymptotically exact tests and confidence intervals for model improvement but only when the model comparison is relatively stable. Surprisingly, we prove that even simple, individually stable models…
This paper explores the relationship between the condition number of a neural network's weight tensor and the extent of information encoded by the associated processing unit, viewed through the lens of information theory. It argues that a…
Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an…
Deep learning has recently revealed the existence of scaling laws, demonstrating that model performance follows predictable trends based on dataset and model sizes. Inspired by these findings and fascinating phenomena emerging in the…
Score-based generative models (SGMs) have emerged as one of the most popular classes of generative models. A substantial body of work now exists on the analysis of SGMs, focusing either on discretization aspects or on their statistical…
In statistical applications it has become increasingly common to encounter data structures that live on non-linear spaces such as manifolds. Classical linear regression, one of the most fundamental methodologies of statistical learning,…
Sparse learning is ubiquitous in many machine learning tasks. It aims to regularize the goodness-of-fit objective by adding a penalty term to encode structural constraints on the model parameters. In this paper, we develop a flexible sparse…
We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with…
This note addresses computational difficulty of the Gromov-Wasserstein distance frequently mentioned in the literature. We provide details on the structure of the Gromov-Wasserstein distance optimization problem that show its non-convex…
Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by well-developed estimation theory,…
Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would…
We prove a large deviation principle for deep neural networks with Gaussian weights and at most linearly growing activation functions, such as ReLU. This generalises earlier work, in which bounded and continuous activation functions were…
The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to…
In an effort to develop topic modeling methods that can be quickly applied to large data sets, we revisit the problem of maximum-likelihood estimation in topic models. It is known, at least informally, that maximum-likelihood estimation in…
We present simple differentially private estimators for the mean and covariance of multivariate sub-Gaussian data that are accurate at small sample sizes. We demonstrate the effectiveness of our algorithms both theoretically and empirically…