English

Fair Bayesian Data Selection via Generalized Discrepancy Measures

Machine Learning 2025-11-11 v1 Machine Learning

Abstract

Fairness concerns are increasingly critical as machine learning models are deployed in high-stakes applications. While existing fairness-aware methods typically intervene at the model level, they often suffer from high computational costs, limited scalability, and poor generalization. To address these challenges, we propose a Bayesian data selection framework that ensures fairness by aligning group-specific posterior distributions of model parameters and sample weights with a shared central distribution. Our framework supports flexible alignment via various distributional discrepancy measures, including Wasserstein distance, maximum mean discrepancy, and ff-divergence, allowing geometry-aware control without imposing explicit fairness constraints. This data-centric approach mitigates group-specific biases in training data and improves fairness in downstream tasks, with theoretical guarantees. Experiments on benchmark datasets show that our method consistently outperforms existing data selection and model-based fairness methods in both fairness and accuracy.

Keywords

Cite

@article{arxiv.2511.07032,
  title  = {Fair Bayesian Data Selection via Generalized Discrepancy Measures},
  author = {Yixuan Zhang and Jiabin Luo and Zhenggang Wang and Feng Zhou and Quyu Kong},
  journal= {arXiv preprint arXiv:2511.07032},
  year   = {2025}
}