机器学习
We analyze two widely used local attribution methods, Local Shapley Values and LIME, which aim to quantify the contribution of a feature value $x_i$ to a specific prediction $f(x_1, \dots, x_p)$. Despite their widespread use, we identify…
Generalized linear mixed-effects models (GLMMs) are widely used to analyze grouped and hierarchical data. In a GLMM, each response is assumed to follow an exponential-family distribution where the natural parameter is given by a linear…
Classical information-theoretic generalization bounds typically control the generalization gap through KL-based mutual information and therefore rely on boundedness or sub-Gaussian tails via the moment generating function (MGF). In many…
Density ratio estimation (DRE) is a useful tool for quantifying discrepancies between probability distributions, but existing approaches often involve a trade-off between estimation quality and computational efficiency. Classical direct DRE…
While the manifold hypothesis is widely adopted in modern machine learning, complex data is often better modeled as stratified spaces -- unions of manifolds (strata) of varying dimensions. Stratified learning is challenging due to varying…
Conditional effects are commonly used measures for understanding how treatment effects vary across different groups, and are often used to target treatments/interventions to groups who benefit most. In this work we review existing methods…
Cluster analysis relates to the task of assigning objects into groups which ideally present some desirable characteristics. When a cluster structure is confined to a subset of the feature space, traditional clustering techniques face…
We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such…
Traditional methods for system discovery frequently struggle with efficient data usage and uncertainty quantification. Identifying the governing equations of complex dynamical systems from data presents a significant challenge in scientific…
Given the ubiquity of streaming data, online algorithms have been widely used for parameter estimation, with second-order methods particularly standing out for their efficiency and robustness. In this paper, we study an online sketched…
Causal discovery is a crucial initial step in establishing causality from empirical data and background knowledge. Numerous algorithms have been developed for this purpose. Among them, the score-matching method has demonstrated superior…
The Gaussian process (GP) is a widely used probabilistic machine learning method with implicit uncertainty characterization for stochastic function approximation, stochastic modeling, and analyzing real-world measurements of nonlinear…
Identifying breakpoints in piecewise regression is critical in enhancing the reliability and interpretability of data fitting. In this paper, we propose novel algorithms based on the greedy algorithm to accurately and efficiently identify…
We study the population loss landscape of two-layer ReLU networks of the form $\sum_{k=1}^K \mathrm{ReLU}(w_k^\top x)$ in a realisable teacher-student setting with Gaussian covariates. We show that local minima admit an exact…
The Half-Trek Criterion (HTC) is the primary graphical tool for determining generic identifiability of causal effect coefficients in linear structural equation models (SEMs) with latent confounders. However, HTC is inherently node-wise: it…
We develop a predictive-first optimisation framework for streaming hidden Markov models. Unlike classical approaches that prioritise full posterior recovery under a fully specified generative model, we assume access to regime-specific…
Unobserved confounding is a key challenge when estimating causal effects from a treatment on an outcome in scientific applications. In this work, we assume that we observe a single, potentially multi-dimensional proxy variable of the…
This paper introduces a projected functional gradient descent algorithm (P-FGD) for training nonparametric additive quantile regression models in online settings. This algorithm extends the functional stochastic gradient descent framework…
Causal discovery from observational data remains a fundamental challenge in machine learning and statistics, particularly when variables represent inherently positive quantities such as gene expression levels, asset prices, company…
Digital firms routinely run many online experiments on shared user populations. When product decisions are compositional, such as combinations of interface elements, flows, messages, or incentives, the number of feasible interventions grows…