Related papers: Variable Selection in Maximum Mean Discrepancy for…

Variable Selection for Kernel Two-Sample Tests

We consider the variable selection problem for two-sample tests, aiming to select the most informative variables to determine whether two collections of samples follow the same distribution. To address this, we propose a novel framework…

Machine Learning · Statistics 2024-12-23 Jie Wang , Santanu S. Dey , Yao Xie

Variable Selection for Comparing High-dimensional Time-Series Data

Given a pair of multivariate time-series data of the same length and dimensions, an approach is proposed to select variables and time intervals where the two series are significantly different. In applications where one time series is an…

Methodology · Statistics 2024-12-11 Kensuke Mitsuzawa , Margherita Grossi , Stefano Bortoli , Motonobu Kanagawa

Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications

Food authenticity studies are concerned with determining if food samples have been correctly labeled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity…

Methodology · Statistics 2010-10-08 Thomas Brendan Murphy , Nema Dean , Adrian E. Raftery

Subset Selection for Multiple Linear Regression via Optimization

Subset selection in multiple linear regression aims to choose a subset of candidate explanatory variables that tradeoff fitting error (explanatory power) and model complexity (number of variables selected). We build mathematical programming…

Machine Learning · Statistics 2020-09-04 Young Woong Park , Diego Klabjan

Variable selection in discriminant analysis for mixed variables and several groups

We propose a method for variable selection in discriminant analysis with mixed categorical and continuous variables. This method is based on a criterion that permits to reduce the variable selection problem to a problem of estimating…

Statistics Theory · Mathematics 2017-03-14 Alban Mbina Mbina , Guy Martial Nkiet , Fulgence Eyi Obiang

A Scalable Nystrom-Based Kernel Two-Sample Test with Permutations

Two-sample hypothesis testing-determining whether two sets of data are drawn from the same distribution-is a fundamental problem in statistics and machine learning with broad scientific applications. In the context of nonparametric testing,…

Machine Learning · Statistics 2026-04-21 Antoine Chatalic , Marco Letizia , Nicolas Schreuder , Lorenzo Rosasco

Diverse Rule Sets

While machine-learning models are flourishing and transforming many aspects of everyday life, the inability of humans to understand complex models poses difficulties for these models to be fully trusted and embraced. Thus, interpretability…

Artificial Intelligence · Computer Science 2020-06-18 Guangyi Zhang , Aristides Gionis

"What is Different Between These Datasets?" A Framework for Explaining Data Distribution Shifts

The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two…

Machine Learning · Computer Science 2025-09-24 Varun Babbar , Zhicheng Guo , Cynthia Rudin

Bayesian subset selection and variable importance for interpretable prediction and classification

Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with…

Machine Learning · Statistics 2022-02-17 Daniel R. Kowal

On the variance of subset sum estimation

For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries…

Data Structures and Algorithms · Computer Science 2007-05-23 Mario Szegedy , Mikkel Thorup

Recognizing Variables from their Data via Deep Embeddings of Distributions

A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable. Because provided attribute labels are often uninformative in practice, this task may be…

Machine Learning · Computer Science 2019-09-12 Jonas Mueller , Alex Smola

Training generative neural networks via Maximum Mean Discrepancy optimization

We consider training a deep neural network to generate samples from an unknown distribution given i.i.d. data. We frame learning as an optimization minimizing a two-sample test statistic---informally speaking, a good generator network…

Machine Learning · Statistics 2015-05-18 Gintare Karolina Dziugaite , Daniel M. Roy , Zoubin Ghahramani

Maximum Mean Discrepancy with Unequal Sample Sizes via Generalized U-Statistics

Existing two-sample testing techniques, particularly those based on choosing a kernel for the Maximum Mean Discrepancy (MMD), often assume equal sample sizes from the two distributions. Applying these methods in practice can require…

Machine Learning · Statistics 2025-12-17 Aaron Wei , Milad Jalali , Danica J. Sutherland

Two-sample inference for high-dimensional Markov networks

Markov networks are frequently used in sciences to represent conditional independence relationships underlying observed variables arising from a complex system. It is often of interest to understand how an underlying network differs between…

Methodology · Statistics 2021-04-26 Byol Kim , Song Liu , Mladen Kolar

Robust variable selection for model-based learning in presence of adulteration

The problem of identifying the most discriminating features when performing supervised learning has been extensively investigated. In particular, several methods for variable selection in model-based classification have been proposed.…

Applications · Statistics 2020-12-16 Andrea Cappozzo , Francesca Greselin , Thomas Brendan Murphy

Scalable variable selection for two-view learning tasks with projection operators

In this paper we propose a novel variable selection method for two-view settings, or for vector-valued supervised learning problems. Our framework is able to handle extremely large scale selection tasks, where number of data samples could…

Machine Learning · Computer Science 2023-07-06 Sandor Szedmak , Riikka Huusari , Tat Hong Duong Le , Juho Rousu

Optimal Feature Selection in High-Dimensional Discriminant Analysis

We consider the high-dimensional discriminant analysis problem. For this problem, different methods have been proposed and justified by establishing exact convergence rates for the classification risk, as well as the l2 convergence results…

Machine Learning · Statistics 2013-06-28 Mladen Kolar , Han Liu

Variable selection in semiparametric regression modeling

In this paper, we are concerned with how to select significant variables in semiparametric modeling. Variable selection for semiparametric regression models consists of two components: model selection for nonparametric components and…

Statistics Theory · Mathematics 2008-12-18 Runze Li , Hua Liang

Post Selection Inference with Incomplete Maximum Mean Discrepancy Estimator

Measuring divergence between two distributions is essential in machine learning and statistics and has various applications including binary classification, change point detection, and two-sample test. Furthermore, in the era of big data,…

Machine Learning · Statistics 2018-02-20 Makoto Yamada , Denny Wu , Yao-Hung Hubert Tsai , Ichiro Takeuchi , Ruslan Salakhutdinov , Kenji Fukumizu

Output-weighted optimal sampling for Bayesian regression and rare event statistics using few samples

For many important problems the quantity of interest is an unknown function of the parameters, which is a random vector with known statistics. Since the dependence of the output on this random vector is unknown, the challenge is to identify…

Machine Learning · Statistics 2021-04-28 Themistoklis P. Sapsis