机器学习 — Scifaro

The Strong, Weak and Benign Goodhart's law. An independence-free and paradigm-agnostic formalisation

Goodhart's law is a famous adage in policy-making that states that ``When a measure becomes a target, it ceases to be a good measure''. As machine learning models and the optimisation capacity to train them grow, growing empirical evidence…

机器学习 · 统计学 2025-09-05 Adrien Majka , El-Mahdi El-Mhamdi

Convergence of Unadjusted Langevin in High Dimensions: Delocalization of Bias

The unadjusted Langevin algorithm is commonly used to sample probability distributions in extremely high-dimensional settings. However, existing analyses of the algorithm for strongly log-concave distributions suggest that, as the dimension…

机器学习 · 统计学 2025-09-05 Yifan Chen , Xiaoou Cheng , Jonathan Niles-Weed , Jonathan Weare

Reverse Ordering Techniques for Attention-Based Channel Prediction

This work aims to predict channels in wireless communication systems based on noisy observations, utilizing sequence-to-sequence models with attention (Seq2Seq-attn) and transformer models. Both models are adapted from natural language…

机器学习 · 统计学 2025-09-05 Valentina Rizzello , Benedikt Böck , Michael Joham , Wolfgang Utschick

Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation

Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that…

机器学习 · 统计学 2025-09-04 Imad Aouali , Otmane Sakhi

Non-Linear Counterfactual Aggregate Optimization

We consider the problem of directly optimizing a non-linear function of an outcome, where this outcome itself is the sum of many small contributions. The non-linearity of the function means that the problem is not equivalent to the…

机器学习 · 统计学 2025-09-04 Benjamin Heymann , Otmane Sakhi

Scale-Adaptive Generative Flows for Multiscale Scientific Data

Flow-based generative models can face significant challenges when modeling scientific data with multiscale Fourier spectra, often producing large errors in fine-scale features. We address this problem within the framework of stochastic…

机器学习 · 统计学 2025-09-04 Yifan Chen , Eric Vanden-Eijnden

Fast kernel methods: Sobolev, physics-informed, and additive models

Kernel methods are powerful tools in statistical learning, but their cubic complexity in the sample size n limits their use on large-scale datasets. In this work, we introduce a scalable framework for kernel regression with O(n log n)…

机器学习 · 统计学 2025-09-04 Nathan Doumèche , Francis Bach , Gérard Biau , Claire Boyer

Debiased maximum-likelihood estimators for hazard ratios under kernel-based machine-learning adjustment

Previous studies have shown that hazard ratios between treatment groups estimated with the Cox model are uninterpretable because the unspecified baseline hazard of the model fails to identify temporal change in the risk set composition due…

机器学习 · 统计学 2025-09-04 Takashi Hayakawa , Satoshi Asai

Statistical Test for Saliency Maps of Graph Neural Networks via Selective Inference

Graph Neural Networks (GNNs) have gained prominence for their ability to process graph-structured data across various domains. However, interpreting GNN decisions remains a significant challenge, leading to the adoption of saliency maps for…

机器学习 · 统计学 2025-09-04 Shuichi Nishino , Tomohiro Shiraishi , Teruyuki Katsuoka , Ichiro Takeuchi

A Novel Characterization of the Population Area Under the Risk Coverage Curve (AURC) and Rates of Finite Sample Estimators

The selective classifier (SC) has been proposed for rank based uncertainty thresholding, which could have applications in safety critical areas such as medical diagnostics, autonomous driving, and the justice system. The Area Under the…

机器学习 · 统计学 2025-09-04 Han Zhou , Jordy Van Landeghem , Teodora Popordanoska , Matthew B. Blaschko

Probabilities of Causation and Root Cause Analysis with Quasi-Markovian Models

Probabilities of causation provide principled ways to assess causal relationships but face computational challenges due to partial identifiability and latent confounding. This paper introduces both algorithmic simplifications, significantly…

机器学习 · 统计学 2025-09-03 Eduardo Rocha Laurentino , Fabio Gagliardi Cozman , Denis Deratani Maua , Daniel Angelo Esteves Lawand , Davi Goncalves Bezerra Coelho , Lucas Martins Marques

Design of Experiment for Discovering Directed Mixed Graph

We study the problem of experimental design for accurately identifying the causal graph structure of a simple structural causal model (SCM), where the underlying graph may include both cycles and bidirected edges induced by latent…

机器学习 · 统计学 2025-09-03 Haijie Xu , Chen Zhang

The Price of Sparsity: Sufficient Conditions for Sparse Recovery using Sparse and Sparsified Measurements

We consider the problem of recovering the support of a sparse signal using noisy projections. While extensive work has been done on the dense measurement matrix setting, the sparse setting remains less explored. In this work, we establish…

机器学习 · 统计学 2025-09-03 Youssef Chaabouni , David Gamarnik

Hybrid Topic-Semantic Labeling and Graph Embeddings for Unsupervised Legal Document Clustering

Legal documents pose unique challenges for text classification due to their domain-specific language and often limited labeled data. This paper proposes a hybrid approach for classifying legal texts by combining unsupervised topic and graph…

机器学习 · 统计学 2025-09-03 Deepak Bastola , Woohyeok Choi

Beyond Universal Approximation Theorems: Algorithmic Uniform Approximation by Neural Networks Trained with Noisy Data

At its core, machine learning seeks to train models that reliably generalize beyond noisy observations; however, the theoretical vacuum in which state-of-the-art universal approximation theorems (UATs) operate isolates them from this goal,…

机器学习 · 统计学 2025-09-03 Anastasis Kratsios , Tin Sum Cheng , Daniel Roy

Identifying Causal Direction via Dense Functional Classes

We address the problem of determining the causal direction between two univariate, continuous-valued variables, X and Y, under the assumption of no hidden confounders. In general, it is not possible to make definitive statements about…

机器学习 · 统计学 2025-09-03 Katerina Hlavackova-Schindler , Suzana Marsela

Probit Monotone BART

Bayesian Additive Regression Trees (BART) of Chipman et al. (2010) has proven to be a powerful tool for nonparametric modeling and prediction. Monotone BART (Chipman et al., 2022) is a recent development that allows BART to be more precise…

机器学习 · 统计学 2025-09-03 Jared D. Fisher

Assessing One-Dimensional Cluster Stability by Extreme-Point Trimming

We develop a probabilistic method for assessing the tail behavior and geometric stability of one-dimensional n i.i.d. samples by tracking how their span contracts when the most extreme points are trimmed. Central to our approach is the…

机器学习 · 统计学 2025-09-03 Erwan Dereure , Emmanuel Akame Mfoumou , David Holcman

Mo' Memory, Mo' Problems: Stream-Native Machine Unlearning

Machine unlearning work assumes a static, i.i.d training environment that doesn't truly exist. Modern ML pipelines need to learn, unlearn, and predict continuously on production streams of data. We translate batch unlearning to the online…

机器学习 · 统计学 2025-09-03 Kennon Stewart

In-Context Learning as Nonparametric Conditional Probability Estimation: Risk Bounds and Optimality

This paper investigates the expected excess risk of in-context learning (ICL) for multiclass classification. We formalize each task as a sequence of labeled examples followed by a query input; a pretrained model then estimates the query's…

机器学习 · 统计学 2025-09-03 Chenrui Liu , Falong Tan , Chuanlong Xie , Yicheng Zeng , Lixing Zhu