机器学习 — Scifaro

Functional relevance based on the continuous Shapley value

The presence of artificial intelligence (AI) in our society is increasing, which brings with it the need to understand the behavior of AI mechanisms, including machine learning predictive algorithms fed with tabular data, text or images,…

机器学习 · 统计学 2025-06-06 Pedro Delicado , Cristian Pachón-García

Hierarchical mixtures of Unigram models for short text clustering: The role of Beta-Liouville priors

This paper presents a variant of the Multinomial mixture model tailored to the unsupervised classification of short text data. While the Multinomial probability vector is traditionally assigned a Dirichlet prior distribution, this work…

机器学习 · 统计学 2025-06-06 Massimo Bilancia , Samuele Magro

One Wave To Explain Them All: A Unifying Perspective On Feature Attribution

Feature attribution methods aim to improve the transparency of deep neural networks by identifying the input features that influence a model's decision. Pixel-based heatmaps have become the standard for attributing features to…

机器学习 · 统计学 2025-06-06 Gabriel Kasmi , Amandine Brunetto , Thomas Fel , Jayneel Parekh

Entropy-based Training Methods for Scalable Neural Implicit Sampler

Efficiently sampling from un-normalized target distributions is a fundamental problem in scientific computing and machine learning. Traditional approaches such as Markov Chain Monte Carlo (MCMC) guarantee asymptotically unbiased samples…

机器学习 · 统计学 2025-06-06 Weijian Luo , Boya Zhang , Zhihua Zhang

Spatially Resolved Meteorological and Ancillary Data in Central Europe for Rainfall Streamflow Modeling

We present a dataset for rainfall streamflow modeling that is fully spatially resolved with the aim of taking neural network-driven hydrological modeling beyond lumped catchments. To this end, we compiled data covering five river basins in…

机器学习 · 统计学 2025-06-05 Marc Aurel Vischer , Noelia Otero , Jackie Ma

Latent Guided Sampling for Combinatorial Optimization

Combinatorial Optimization problems are widespread in domains such as logistics, manufacturing, and drug discovery, yet their NP-hard nature makes them computationally challenging. Recent Neural Combinatorial Optimization methods leverage…

机器学习 · 统计学 2025-06-05 Sobihan Surendran , Adeline Fermanian , Sylvain Le Corff

Position: There Is No Free Bayesian Uncertainty Quantification

Due to their intuitive appeal, Bayesian methods of modeling and uncertainty quantification have become popular in modern machine and deep learning. When providing a prior distribution over the parameter space, it is straightforward to…

机器学习 · 统计学 2025-06-05 Ivan Melev , Goeran Kauermann

SubSearch: Robust Estimation and Outlier Detection for Stochastic Block Models via Subgraph Search

Community detection is a fundamental task in graph analysis, with methods often relying on fitting models like the Stochastic Block Model (SBM) to observed networks. While many algorithms can accurately estimate SBM parameters when the…

机器学习 · 统计学 2025-06-05 Leonardo Martins Bianco , Christine Keribin , Zacharie Naulet

Models of Heavy-Tailed Mechanistic Universality

Recent theoretical and empirical successes in deep learning, including the celebrated neural scaling laws, are punctuated by the observation that many objects of interest tend to exhibit some form of heavy-tailed or power law behavior. In…

机器学习 · 统计学 2025-06-05 Liam Hodgkinson , Zhichao Wang , Michael W. Mahoney

Generalization in Federated Learning: A Conditional Mutual Information Framework

Federated learning (FL) is a widely adopted privacy-preserving distributed learning framework, yet its generalization performance remains less explored compared to centralized learning. In FL, the generalization error consists of two…

机器学习 · 统计学 2025-06-05 Ziqiao Wang , Cheng Long , Yongyi Mao

Nested Expectations with Kernel Quadrature

This paper considers the challenging computational task of estimating nested expectations. Existing algorithms, such as nested Monte Carlo or multilevel Monte Carlo, are known to be consistent but require a large number of samples at both…

机器学习 · 统计学 2025-06-05 Zonghao Chen , Masha Naslidnyk , François-Xavier Briol

How Compositional Generalization and Creativity Improve as Diffusion Models are Trained

Natural data is often organized as a hierarchical composition of features. How many samples do generative models need in order to learn the composition rules, so as to produce a combinatorially large number of novel data? What signal in the…

机器学习 · 统计学 2025-06-05 Alessandro Favero , Antonio Sclocchi , Francesco Cagnetta , Pascal Frossard , Matthieu Wyart

Development of an offline and online hybrid model for the Integrated Forecasting System

In recent years, there has been significant progress in the development of fully data-driven global numerical weather prediction models. These machine learning weather prediction models have their strength, notably accuracy and low…

机器学习 · 统计学 2025-06-05 Alban Farchi , Marcin Chrust , Marc Bocquet , Massimo Bonavita

How Two-Layer Neural Networks Learn, One (Giant) Step at a Time

For high-dimensional Gaussian data, we investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to an improvement in the…

机器学习 · 统计学 2025-06-05 Yatin Dandi , Florent Krzakala , Bruno Loureiro , Luca Pesce , Ludovic Stephan

Robust and Agnostic Learning of Conditional Distributional Treatment Effects

The conditional average treatment effect (CATE) is the best measure of individual causal effects given baseline covariates. However, the CATE only captures the (conditional) average, and can overlook risks and tail events, which are…

机器学习 · 统计学 2025-06-05 Nathan Kallus , Miruna Oprescu

Causal Explainability of Machine Learning in Heart Failure Prediction from Electronic Health Records

The importance of clinical variables in the prognosis of the disease is explained using statistical correlation or machine learning (ML). However, the predictive importance of these variables may not represent their causal relationships…

机器学习 · 统计学 2025-06-04 Yina Hou , Shourav B. Rabbani , Liang Hong , Norou Diawara , Manar D. Samad

Computational Thresholds in Multi-Modal Learning via the Spiked Matrix-Tensor Model

We study the recovery of multiple high-dimensional signals from two noisy, correlated modalities: a spiked matrix and a spiked tensor sharing a common low-rank structure. This setting generalizes classical spiked matrix and tensor models,…

机器学习 · 统计学 2025-06-04 Hugo Tabanelli , Pierre Mergny , Lenka Zdeborova , Florent Krzakala

Assumption-free stability for ranking problems

In this work, we consider ranking problems among a finite set of candidates: for instance, selecting the top-$k$ items among a larger list of candidates or obtaining the full ranking of all items in the set. These problems are often…

机器学习 · 统计学 2025-06-04 Ruiting Liang , Jake A. Soloff , Rina Foygel Barber , Rebecca Willett

Enabling Probabilistic Learning on Manifolds through Double Diffusion Maps

We present a generative learning framework for probabilistic sampling based on an extension of the Probabilistic Learning on Manifolds (PLoM) approach, which is designed to generate statistically consistent realizations of a random vector…

机器学习 · 统计学 2025-06-04 Dimitris G Giovanis , Nikolaos Evangelou , Ioannis G Kevrekidis , Roger G Ghanem

Poisoning Bayesian Inference via Data Deletion and Replication

Research in adversarial machine learning (AML) has shown that statistical models are vulnerable to maliciously altered data. However, despite advances in Bayesian machine learning models, most AML research remains concentrated on classical…

机器学习 · 统计学 2025-06-04 Matthieu Carreau , Roi Naveiro , William N. Caballero