机器学习 — Scifaro

Communication-Efficient l_0 Penalized Least Square

In this paper, we propose a communication-efficient penalized regression algorithm for high-dimensional sparse linear regression models with massive data. This approach incorporates an optimized distributed system communication algorithm,…

机器学习 · 统计学 2025-04-02 Chenqi Gong , Hu Yang

A stochastic gradient descent algorithm with random search directions

Stochastic coordinate descent algorithms are efficient methods in which each iterate is obtained by fixing most coordinates at their values from the current iteration, and approximately minimizing the objective with respect to the remaining…

机器学习 · 统计学 2025-04-02 Eméric Gbaguidi

Addressing pitfalls in implicit unobserved confounding synthesis using explicit block hierarchical ancestral sampling

Unbiased data synthesis is crucial for evaluating causal discovery algorithms in the presence of unobserved confounding, given the scarcity of real-world datasets. A common approach, implicit parameterization, encodes unobserved confounding…

机器学习 · 统计学 2025-04-02 Xudong Sun , Alex Markham , Pratik Misra , Carsten Marr

Optimal generalisation and learning transition in extensive-width shallow neural networks near interpolation

We consider a teacher-student model of supervised learning with a fully-trained two-layer neural network whose width $k$ and input dimension $d$ are large and proportional. We provide an effective theory for approximating the Bayes-optimal…

机器学习 · 统计学 2025-04-02 Jean Barbier , Francesco Camilli , Minh-Toan Nguyen , Mauro Pastore , Rudy Skerk

Safe Policy Learning through Extrapolation: Application to Pre-trial Risk Assessment

Algorithmic recommendations and decisions have become ubiquitous in today's society. Many of these data-driven policies, especially in the realm of public policy, are based on known, deterministic rules to ensure their transparency and…

机器学习 · 统计学 2025-04-02 Eli Ben-Michael , D. James Greiner , Kosuke Imai , Zhichao Jiang

Solving the Best Subset Selection Problem via Suboptimal Algorithms

Best subset selection in linear regression is well known to be nonconvex and computationally challenging to solve, as the number of possible subsets grows rapidly with increasing dimensionality of the problem. As a result, finding the…

机器学习 · 统计学 2025-04-01 Vikram Singh , Min Sun

AutoML Algorithms for Online Generalized Additive Model Selection: Application to Electricity Demand Forecasting

Electricity demand forecasting is key to ensuring that supply meets demand lest the grid would blackout. Reliable short-term forecasts may be obtained by combining a Generalized Additive Models (GAM) with a State-Space model (Obst et al.,…

机器学习 · 统计学 2025-04-01 Keshav Das , Julie Keisler , Margaux Brégère , Amaury Durand

The more the merrier: logical and multistage processors in credit scoring

Machine Learning algorithms are ubiquitous in key decision-making contexts such as organizational justice or healthcare, which has spawned a great demand for fairness in these procedures. In this paper we focus on the application of fair ML…

机器学习 · 统计学 2025-04-01 Arturo Pérez-Peralta , Sandra Benítez-Peña , Rosa E. Lillo

Learning a Single Index Model from Anisotropic Data with vanilla Stochastic Gradient Descent

We investigate the problem of learning a Single Index Model (SIM)- a popular model for studying the ability of neural networks to learn features - from anisotropic Gaussian inputs by training a neuron using vanilla Stochastic Gradient…

机器学习 · 统计学 2025-04-01 Guillaume Braun , Minh Ha Quang , Masaaki Imaizumi

Fr\'echet regression with implicit denoising and multicollinearity reduction

Fr\'echet regression extends linear regression to model complex responses in metric spaces, making it particularly relevant for multi-label regression, where eachinstance can have multiple associated labels. However, addressing noise and…

机器学习 · 统计学 2025-04-01 Dou El Kefel Mansouri , Seif-Eddine Benkabou , Khalid Benabdeslem

Coupled Input-Output Dimension Reduction: Application to Goal-oriented Bayesian Experimental Design and Global Sensitivity Analysis

We introduce a new method to jointly reduce the dimension of the input and output space of a function between high-dimensional spaces. Choosing a reduced input subspace influences which output subspace is relevant and vice versa.…

机器学习 · 统计学 2025-04-01 Qiao Chen , Elise Arnaud , Ricardo Baptista , Olivier Zahm

Is Algorithmic Stability Testable? A Unified Framework under Computational Constraints

Algorithmic stability is a central notion in learning theory that quantifies the sensitivity of an algorithm to small changes in the training data. If a learning algorithm satisfies certain stability properties, this leads to many important…

机器学习 · 统计学 2025-04-01 Yuetian Luo , Rina Foygel Barber

Simulation-based Bayesian Inference from Privacy Protected Data

Many modern statistical analysis and machine learning applications require training models on sensitive user data. Under a formal definition of privacy protection, differentially private algorithms inject calibrated noise into the…

机器学习 · 统计学 2025-04-01 Yifei Xiong , Nianqiao Phyllis Ju , Sanguo Zhang

Optimal vintage factor analysis with deflation varimax

Vintage factor analysis is one important type of factor analysis that aims to first find a low-dimensional representation of the original data, and then to seek a rotation such that the rotated low-dimensional representation is…

机器学习 · 统计学 2025-04-01 Xin Bing , Xin He , Dian Jin , Yuqian Zhang

Cross-Cluster Weighted Forests

Adapting machine learning algorithms to better handle the presence of clusters or batch effects within training datasets is important across a wide variety of biological applications. This article considers the effect of ensembling Random…

机器学习 · 统计学 2025-04-01 Maya Ramchandran , Rajarshi Mukherjee , Giovanni Parmigiani

Debiasing Kernel-Based Generative Models

We propose a novel two-stage framework of generative models named Debiasing Kernel-Based Generative Models (DKGM) with the insights from kernel density estimation (KDE) and stochastic approximation. In the first stage of DKGM, we employ KDE…

机器学习 · 统计学 2025-03-31 Tian Qin , Wei-Min Huang

Spectral-factorized Positive-definite Curvature Learning for NN Training

Many training methods, such as Adam(W) and Shampoo, learn a positive-definite curvature matrix and apply an inverse root before preconditioning. Recently, non-diagonal training methods, such as Shampoo, have gained significant attention;…

机器学习 · 统计学 2025-03-31 Wu Lin , Felix Dangel , Runa Eschenhagen , Juhan Bae , Richard E. Turner , Roger B. Grosse

Controlled Learning of Pointwise Nonlinearities in Neural-Network-Like Architectures

We present a general variational framework for the training of freeform nonlinearities in layered computational architectures subject to some slope constraints. The regularization that we add to the traditional training loss penalizes the…

机器学习 · 统计学 2025-03-31 Michael Unser , Alexis Goujon , Stanislas Ducotterd

Manifold learning in Wasserstein space

This paper aims at building the theoretical foundations for manifold learning algorithms in the space of absolutely continuous probability measures $\mathcal{P}_{\mathrm{a.c.}}(\Omega)$ with $\Omega$ a compact and convex subset of…

机器学习 · 统计学 2025-03-31 Keaton Hamm , Caroline Moosmüller , Bernhard Schmitzer , Matthew Thorpe

Compress Then Test: Powerful Kernel Testing in Near-linear Time

Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address…

机器学习 · 统计学 2025-03-31 Carles Domingo-Enrich , Raaz Dwivedi , Lester Mackey