机器学习 — Scifaro

Regional Explanations: Bridging Local and Global Variable Importance

We analyze two widely used local attribution methods, Local Shapley Values and LIME, which aim to quantify the contribution of a feature value $x_i$ to a specific prediction $f(x_1, \dots, x_p)$. Despite their widespread use, we identify…

机器学习 · 统计学 2026-04-14 Salim I. Amoukou , Nicolas J-B. Brunel

Neural Generalized Mixed-Effects Models

Generalized linear mixed-effects models (GLMMs) are widely used to analyze grouped and hierarchical data. In a GLMM, each response is assumed to follow an exponential-family distribution where the natural parameter is given by a linear…

机器学习 · 统计学 2026-04-14 Yuli Slavutsky , Sebastian Salazar , David M. Blei

Tail-Aware Information-Theoretic Generalization for RLHF and SGLD

Classical information-theoretic generalization bounds typically control the generalization gap through KL-based mutual information and therefore rely on boundedness or sub-Gaussian tails via the moment generating function (MGF). In many…

机器学习 · 统计学 2026-04-14 Huiming Zhang , Binghan Li , Wan Tian , Qiang Sun

One-Step Score-Based Density Ratio Estimation

Density ratio estimation (DRE) is a useful tool for quantifying discrepancies between probability distributions, but existing approaches often involve a trade-off between estimation quality and computational efficiency. Classical direct DRE…

机器学习 · 统计学 2026-04-14 Wei Chen , Qibin Zhao , John Paisley , Junmei Yang , Delu Zeng

A Deep Generative Approach to Stratified Learning

While the manifold hypothesis is widely adopted in modern machine learning, complex data is often better modeled as stratified spaces -- unions of manifolds (strata) of varying dimensions. Stratified learning is challenging due to varying…

机器学习 · 统计学 2026-04-14 Randy Martinez , Rong Tang , Lizhen Lin

Orthogonal machine learning for conditional odds and risk ratios

Conditional effects are commonly used measures for understanding how treatment effects vary across different groups, and are often used to target treatments/interventions to groups who benefit most. In this work we review existing methods…

机器学习 · 统计学 2026-04-14 Jiacheng Ge , Iván Díaz

Sparse clustering via the Deterministic Information Bottleneck algorithm

Cluster analysis relates to the task of assigning objects into groups which ideally present some desirable characteristics. When a cluster structure is confined to a subset of the feature space, traditional clustering techniques face…

机器学习 · 统计学 2026-04-14 Efthymios Costa , Ioanna Papatsouma , Angelos Markos

Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such…

机器学习 · 统计学 2026-04-14 Chao Ying , Jun Jin , Haotian Zhang , Qinglong Tian , Yanyuan Ma , Sharon Li , Jiwei Zhao

BLADE: Bayesian Langevin Active Discovery with Replica Exchange for Identification of Complex Systems

Traditional methods for system discovery frequently struggle with efficient data usage and uncertainty quantification. Identifying the governing equations of complex dynamical systems from data presents a significant challenge in scientific…

机器学习 · 统计学 2026-04-14 Cindy Xiangrui Kong , Haoyang Zheng , Guang Lin

Online Covariance Matrix Estimation in Sketched Newton Methods

Given the ubiquity of streaming data, online algorithms have been widely used for parameter estimation, with second-order methods particularly standing out for their efficiency and robustness. In this paper, we study an online sketched…

机器学习 · 统计学 2026-04-14 Wei Kuang , Mihai Anitescu , Sen Na

Score-matching-based Structure Learning for Temporal Data on Networks

Causal discovery is a crucial initial step in establishing causality from empirical data and background knowledge. Numerous algorithms have been developed for this purpose. Among them, the score-matching method has demonstrated superior…

机器学习 · 统计学 2026-04-14 Hao Chen , Kai Yi

Compactly-supported nonstationary kernels for computing exact Gaussian processes on big data

The Gaussian process (GP) is a widely used probabilistic machine learning method with implicit uncertainty characterization for stochastic function approximation, stochastic modeling, and analyzing real-world measurements of nonlinear…

机器学习 · 统计学 2026-04-14 Mark D. Risser , Marcus M. Noack , Hengrui Luo , Ronald Pandolfi

Improved identification of breakpoints in piecewise regression and its applications

Identifying breakpoints in piecewise regression is critical in enhancing the reliability and interpretability of data fitting. In this paper, we propose novel algorithms based on the greedy algorithm to accurately and efficiently identify…

机器学习 · 统计学 2026-04-14 Taehyeong Kim , Hyungu Lee , Myungjin Kim , Hayoung Choi

Sharp description of local minima in the loss landscape of high-dimensional two-layer ReLU neural networks

We study the population loss landscape of two-layer ReLU networks of the form $\sum_{k=1}^K \mathrm{ReLU}(w_k^\top x)$ in a realisable teacher-student setting with Gaussian covariates. We show that local minima admit an exact…

机器学习 · 统计学 2026-04-13 Jie Huang , Bruno Loureiro , Stefano Sarao Mannelli

Iterative Identification Closure: Amplifying Causal Identifiability in Linear SEMs

The Half-Trek Criterion (HTC) is the primary graphical tool for determining generic identifiability of causal effect coefficients in linear structural equation models (SEMs) with latent confounders. However, HTC is inherently node-wise: it…

机器学习 · 统计学 2026-04-13 Ziyi Ding , Xiao-Ping Zhang

A Predictive View on Streaming Hidden Markov Models

We develop a predictive-first optimisation framework for streaming hidden Markov models. Unlike classical approaches that prioritise full posterior recovery under a fully specified generative model, we assume access to regime-specific…

机器学习 · 统计学 2026-04-13 Gerardo Duran-Martin

Identifying Causal Effects Using a Single Proxy Variable

Unobserved confounding is a key challenge when estimating causal effects from a treatment on an outcome in scientific applications. In this work, we assume that we observe a single, potentially multi-dimensional proxy variable of the…

机器学习 · 统计学 2026-04-13 Silvan Vollmer , Niklas Pfister , Sebastian Weichwald

Online Quantile Regression for Nonparametric Additive Models

This paper introduces a projected functional gradient descent algorithm (P-FGD) for training nonparametric additive quantile regression models in online settings. This algorithm extends the functional stochastic gradient descent framework…

机器学习 · 统计学 2026-04-13 Haoran Zhan

A novel hybrid approach for positive-valued DAG learning

Causal discovery from observational data remains a fundamental challenge in machine learning and statistics, particularly when variables represent inherently positive quantities such as gene expression levels, asset prices, company…

机器学习 · 统计学 2026-04-13 Yao Zhao

Policy-Aware Design of Large-Scale Factorial Experiments

Digital firms routinely run many online experiments on shared user populations. When product decisions are compositional, such as combinations of interface elements, flows, messages, or incentives, the number of feasible interventions grows…

机器学习 · 统计学 2026-04-13 Xin Wen , Xi Chen , Will Wei Sun , Yichen Zhang