机器学习 — Scifaro

Provable Separations between Memorization and Generalization in Diffusion Models

Diffusion models have achieved remarkable success across diverse domains, but they remain vulnerable to memorization -- reproducing training data rather than generating novel outputs. This not only limits their creative potential but also…

机器学习 · 统计学 2025-11-10 Zeqi Ye , Qijie Zhu , Molei Tao , Minshuo Chen

Phase Transition for Stochastic Block Model with more than $\sqrt{n}$ Communities

Predictions from statistical physics postulate that recovery of the communities in Stochastic Block Model (SBM) is possible in polynomial time above, and only above, the Kesten-Stigum (KS) threshold. This conjecture has given rise to a rich…

机器学习 · 统计学 2025-11-10 Alexandra Carpentier , Christophe Giraud , Nicolas Verzelen

Closed-Form Beta Distribution Estimation from Sparse Statistics with Random Forest Implicit Regularization

This work advances distribution recovery from sparse data and ensemble classification through three main contributions. First, we introduce a closed-form estimator that reconstructs scaled beta distributions from limited statistics…

机器学习 · 统计学 2025-11-10 Jonathan R. Landers

It's Hard to Be Normal: The Impact of Noise on Structure-agnostic Estimation

Structure-agnostic causal inference studies how well one can estimate a treatment effect given black-box machine learning estimates of nuisance functions (like the impact of confounders on treatment and outcomes). Here, we find that the…

机器学习 · 统计学 2025-11-10 Jikai Jin , Lester Mackey , Vasilis Syrgkanis

Parsimonious Gaussian mixture models with piecewise-constant eigenvalue profiles

Gaussian mixture models (GMMs) are ubiquitous in statistical learning, particularly for unsupervised problems. While full GMMs suffer from the overparameterization of their covariance matrices in high-dimensional spaces, spherical GMMs…

机器学习 · 统计学 2025-11-10 Tom Szwagier , Pierre-Alexandre Mattei , Charles Bouveyron , Xavier Pennec

Performative Validity of Recourse Explanations

When applicants get rejected by an algorithmic decision system, recourse explanations provide actionable suggestions for how to change their input features to get a positive evaluation. A crucial yet overlooked phenomenon is that recourse…

机器学习 · 统计学 2025-11-10 Gunnar König , Hidde Fokkema , Timo Freiesleben , Celestine Mendler-Dünner , Ulrike von Luxburg

Know What You Don't Know: Uncertainty Calibration of Process Reward Models

Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs). However, we observe that even state-of-the-art PRMs can be poorly calibrated. Specifically, they tend to…

机器学习 · 统计学 2025-11-10 Young-Jin Park , Kristjan Greenewald , Kaveh Alim , Hao Wang , Navid Azizan

Prediction-Powered Adaptive Shrinkage Estimation

Prediction-Powered Inference (PPI) is a powerful framework for enhancing statistical estimates by combining limited gold-standard data with machine learning (ML) predictions. While prior work has demonstrated PPI's benefits for individual…

机器学习 · 统计学 2025-11-10 Sida Li , Nikolaos Ignatiadis

Analyzing limits for in-context learning

Our paper challenges claims from prior research that transformer-based models, when learning in context, implicitly implement standard learning algorithms. We present empirical evidence inconsistent with this view and provide a mathematical…

机器学习 · 统计学 2025-11-10 Omar Naim , Jerome Bolte , Nicholas Asher

Linear combinations of latents in generative models: subspaces and beyond

Sampling from generative models has become a crucial tool for applications like data synthesis and augmentation. Diffusion, Flow Matching and Continuous Normalising Flows have shown effectiveness across various modalities, and rely on…

机器学习 · 统计学 2025-11-10 Erik Bodin , Alexandru Stere , Dragos D. Margineantu , Carl Henrik Ek , Henry Moss

Simultaneous Optimization of Geodesics and Fr\'echet Means

A central part of geometric statistics is to compute the Fr\'echet mean. This is a well-known intrinsic mean on a Riemannian manifold that minimizes the sum of squared Riemannian distances from the mean point to all other data points. The…

机器学习 · 统计学 2025-11-07 Frederik Möbius Rygaard , Søren Hauberg , Steen Markvorsen

Robustness of Minimum-Volume Nonnegative Matrix Factorization under an Expanded Sufficiently Scattered Condition

Minimum-volume nonnegative matrix factorization (min-vol NMF) has been used successfully in many applications, such as hyperspectral imaging, chemical kinetics, spectroscopy, topic modeling, and audio source separation. However, its…

机器学习 · 统计学 2025-11-07 Giovanni Barbarino , Nicolas Gillis , Subhayan Saha

Online Conformal Inference with Retrospective Adjustment for Faster Adaptation to Distribution Shift

Conformal prediction has emerged as a powerful framework for constructing distribution-free prediction sets with guaranteed coverage assuming only the exchangeability assumption. However, this assumption is often violated in online…

机器学习 · 统计学 2025-11-07 Jungbin Jun , Ilsang Ohn

A general technique for approximating high-dimensional empirical kernel matrices

We present simple, user-friendly bounds for the expected operator norm of a random kernel matrix under general conditions on the kernel function $k(\cdot,\cdot)$. Our approach uses decoupling results for U-statistics and the non-commutative…

机器学习 · 统计学 2025-11-07 Chiraag Kaushik , Justin Romberg , Vidya Muthukumar

Learning Paths for Dynamic Measure Transport: A Control Perspective

We bring a control perspective to the problem of identifying paths of measures for sampling via dynamic measure transport (DMT). We highlight the fact that commonly used paths may be poor choices for DMT and connect existing methods for…

机器学习 · 统计学 2025-11-07 Aimee Maurais , Bamdad Hosseini , Youssef Marzouk

Bifidelity Karhunen-Lo\`eve Expansion Surrogate with Active Learning for Random Fields

We present a bifidelity Karhunen-Lo\`eve expansion (KLE) surrogate model for field-valued quantities of interest (QoIs) under uncertain inputs. The approach combines the spectral efficiency of the KLE with polynomial chaos expansions (PCEs)…

机器学习 · 统计学 2025-11-07 Aniket Jivani , Cosmin Safta , Beckett Y. Zhou , Xun Huan

Friction on Demand: A Generative Framework for the Inverse Design of Metainterfaces

Designing frictional interfaces to exhibit prescribed macroscopic behavior is a challenging inverse problem, made difficult by the non-uniqueness of solutions and the computational cost of contact simulations. Traditional approaches rely on…

机器学习 · 统计学 2025-11-07 Valentin Mouton , Adrien Mélot

Information-theoretic Generalization Analysis for VQ-VAEs: A Role of Latent Variables

Latent variables (LVs) play a crucial role in encoder-decoder models by enabling effective data compression, prediction, and generation. Although their theoretical properties, such as generalization, have been extensively studied in…

机器学习 · 统计学 2025-11-07 Futoshi Futami , Masahiro Fujisawa

Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression

We study nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) in this paper. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a…

机器学习 · 统计学 2025-11-07 Yingzhen Yang , Ping Li

Beyond State Space Representation: A General Theory for Kernel Packets

Gaussian process (GP) regression provides a flexible, nonparametric framework for probabilistic modeling, yet remains computationally demanding in large-scale applications. For one-dimensional data, state space (SS) models achieve…

机器学习 · 统计学 2025-11-07 Liang Ding , Rui Tuo , Lu Zhou