English

Variable Selection for Multi-Source Count Data with Controlled False Discovery Rate

Applications 2025-11-11 v2 Methodology

Abstract

The rapid generation of complex, highly skewed, and zero-inflated multi-source count data poses significant challenges for variable selection, particularly in biomedical domains like tumor development and metabolic dysregulation. To address this, we propose a new variable selection method, Zero-Inflated Poisson-Gamma Simultaneous Knockoff (ZIPG-SK), specifically designed for multi-source count data. Our method leverages a gaussian copula based on the Zero-Inflated Poisson-Gamma (ZIPG) distribution to construct knockoffs that properly account for the properties of count data, including high skewness and zero inflation, while effectively incorporating covariate information. This framework enables the detection of common features across multi-source datasets with guaranteed false discovery rate (FDR) control. Furthermore, we enhance the power of the method by incorporating e-value aggregation, which effectively mitigates the inherent randomness in knockoff generation. Through extensive simulations, we demonstrate that ZIPG-SK significantly outperforms existing methods, achieving superior power across various scenarios. We validate the utility of our method on real-world colorectal cancer (CRC) and type 2 diabetes (T2D) datasets, identifying key variables whose characteristics align with established findings and simultaneously provide new mechanistic insights.

Keywords

Cite

@article{arxiv.2411.18986,
  title  = {Variable Selection for Multi-Source Count Data with Controlled False Discovery Rate},
  author = {Shan Tang and Shanjun Mao and Shourong Ma and Falong Tan},
  journal= {arXiv preprint arXiv:2411.18986},
  year   = {2025}
}
R2 v1 2026-06-28T20:15:39.088Z