机器学习 — Scifaro

The Theory and Practice of Highly Scalable Gaussian Process Regression with Nearest Neighbours

Gaussian process ($GP$) regression is a widely used non-parametric modeling tool, but its cubic complexity in the training size limits its use on massive data sets. A practical remedy is to predict using only the nearest neighbours of each…

机器学习 · 统计学 2026-04-09 Robert Allison , Tomasz Maciazek , Anthony Stephenson

A Data-Informed Variational Clustering Framework for Noisy High-Dimensional Data

Clustering in high-dimensional settings with severe feature noise remains challenging, especially when only a small subset of dimensions is informative and the final number of clusters is not specified in advance. In such regimes, partition…

机器学习 · 统计学 2026-04-09 Wan Ping Chen

Tight Convergence Rates for Online Distributed Linear Estimation with Adversarial Measurements

We study mean estimation of a random vector $X$ in a distributed parameter-server-worker setup. Worker $i$ observes samples of $a_i^\top X$, where $a_i^\top$ is the $i$th row of a known sensing matrix $A$. The key challenges are adversarial…

机器学习 · 统计学 2026-04-09 Nibedita Roy , Vishal Halder , Gugan Thoppe , Alexandre Reiffers-Masson , Mihir Dhanakshirur , Naman , Alexandre Azor

Generalization error bounds for two-layer neural networks with Lipschitz loss function

We derive generalization error bounds for the training of two-layer neural networks without assuming boundedness of the loss function, using Wasserstein distance estimates on the discrepancy between a probability distribution and its…

机器学习 · 统计学 2026-04-09 Jiang Yu Nguwi , Nicolas Privault

Conditional flow matching for physics-constrained inverse problems with finite training data

This study presents a conditional flow matching framework for solving physics-constrained Bayesian inverse problems. In this setting, samples from the joint distribution of inferred variables and measurements are assumed available, while…

机器学习 · 统计学 2026-04-09 Agnimitra Dasgupta , Ali Fardisi , Mehrnegar Aminy , Brianna Binder , Bryan Shaddy , Saeed Moazami , Assad Oberai

Robust support vector model based on bounded asymmetric elastic net loss for binary classification

In this paper, we propose a novel bounded asymmetric elastic net ($L_{baen}$) loss function and combine it with the support vector machine (SVM), resulting in the BAEN-SVM. The $L_{baen}$ is bounded and asymmetric and can degrade to the…

机器学习 · 统计学 2026-04-09 Haiyan Du , Hu Yang

PAC-Bayesian Bounds on Constrained f-Entropic Risk Measures

PAC generalization bounds on the risk, when expressed in terms of the expected loss, are often insufficient to capture imbalances between subgroups in the data. To overcome this limitation, we introduce a new family of risk measures, called…

机器学习 · 统计学 2026-04-09 Hind Atbir , Farah Cherfaoui , Guillaume Metzler , Emilie Morvant , Paul Viallard

MF-GLaM: A multifidelity stochastic emulator using generalized lambda models

Stochastic simulators exhibit intrinsic stochasticity due to unobservable, uncontrollable, or unmodeled input variables, resulting in random outputs even at fixed input conditions. Such simulators are common across various scientific…

机器学习 · 统计学 2026-04-09 K. Giannoukou , X. Zhu , S. Marelli , B. Sudret

Computational bottlenecks for denoising diffusions

Denoising diffusions sample from a probability distribution $\mu$ in $\mathbb{R}^d$ by constructing a stochastic process $({\hat{\boldsymbol x}}_t:t\ge 0)$ in $\mathbb{R}^d$ such that ${\hat{\boldsymbol x}}_0$ is easy to sample, but the…

机器学习 · 统计学 2026-04-09 Andrea Montanari , Viet Vu

Nonparametric Instrumental Regression via Kernel Methods is Minimax Optimal

We study the kernel instrumental variable (KIV) algorithm, a kernel-based two-stage least-squares method for nonparametric instrumental variable regression. We provide a convergence analysis covering both identified and non-identified…

机器学习 · 统计学 2026-04-09 Dimitri Meunier , Zhu Li , Tim Christensen , Arthur Gretton

Differentially Private Best-Arm Identification

Best Arm Identification (BAI) problems are progressively used for data-sensitive applications, such as designing adaptive clinical trials, tuning hyper-parameters, and conducting user studies. Motivated by the data privacy concerns invoked…

机器学习 · 统计学 2026-04-09 Achraf Azize , Marc Jourdan , Aymen Al Marjani , Debabrota Basu

Thompson Sampling for Infinite-Horizon Discounted Decision Processes

This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces,…

机器学习 · 统计学 2026-04-09 Daniel Adelman , Cagla Keceli , Alba V. Olivares-Nadal

A Generative Approach to Quasi-Random Sampling from Copulas via Space-Filling Designs

Exploring the dependence between covariates across distributions is crucial for many applications. Copulas serve as a powerful tool for modeling joint variable dependencies and have been effectively applied in various practical contexts due…

机器学习 · 统计学 2026-04-09 Sumin Wang , Chenxian Huang , Yongdao Zhou , Min-Qian Liu

Active Statistical Inference

Inspired by the concept of active learning, we propose active inference$\unicode{x2013}$a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected,…

机器学习 · 统计学 2026-04-09 Tijana Zrnic , Emmanuel J. Candès

Sobolev Norm Learning Rates for Conditional Mean Embeddings

We develop novel learning rates for conditional mean embeddings by applying the theory of interpolation for reproducing kernel Hilbert spaces (RKHS). We derive explicit, adaptive convergence rates for the sample estimator under the…

机器学习 · 统计学 2026-04-09 Prem Talwai , Ali Shameli , David Simchi-Levi

Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification

Neural network classifiers trained with cross-entropy loss achieve strong predictive accuracy but lack the capability to provide inherent predictive uncertainty estimates, thus requiring external techniques to obtain these estimates. In…

机器学习 · 统计学 2026-04-08 Courtney Franzen , Farhad Pourkamali-Anaraki

Efficient machine unlearning with minimax optimality

There is a growing demand for efficient data removal to comply with regulations like the GDPR and to mitigate the influence of biased or corrupted data. This has motivated the field of machine unlearning, which aims to eliminate the…

机器学习 · 统计学 2026-04-08 Jingyi Xie , Linjun Zhang , Sai Li

Hierarchical Contrastive Learning for Multimodal Data

Multimodal representation learning is commonly built on a shared-private decomposition, treating latent information as either common to all modalities or specific to one. This binary view is often inadequate: many factors are shared by only…

机器学习 · 统计学 2026-04-08 Huichao Li , Junhan Yu , Doudou Zhou

Individual-heterogeneous sub-Gaussian Mixture Models

The classical Gaussian mixture model assumes homogeneity within clusters, an assumption that often fails in real-world data where observations naturally exhibit varying scales or intensities. To address this, we introduce the…

机器学习 · 统计学 2026-04-08 Huan Qing

Generative Path-Law Jump-Diffusion: Sequential MMD-Gradient Flows and Generalisation Bounds in Marcus-Signature RKHS

This paper introduces a novel generative framework for synthesising forward-looking, c\`adl\`ag stochastic trajectories that are sequentially consistent with time-evolving path-law proxies, thereby incorporating anticipated structural…

机器学习 · 统计学 2026-04-08 Daniel Bloch