Feature selection for high-dimensional integrated data

Charles Zheng; Scott Schwartz; Robert Chapkin; Raymond Carroll; Ivan Ivanov

Feature selection for high-dimensional integrated data

Applications 2011-11-29 v1

Authors: Charles Zheng , Scott Schwartz , Robert Chapkin , Raymond Carroll , Ivan Ivanov

Abstract

Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of \emph{feature selection} in which only a subset of the predictors $X_t$ are dependent on the multidimensional variate $Y$ , and the remainder of the predictors constitute a "noise set" $X_u$ independent of $Y$ . Using Monte Carlo simulations, we investigated the relative performance of two methods: thresholding and singular-value decomposition, in combination with stochastic optimization to determine "empirical bounds" on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset.

Keywords

sufficient dimension reduction high-dimensional regression genomics statistics

Cite

@article{arxiv.1111.6283,
  title  = {Feature selection for high-dimensional integrated data},
  author = {Charles Zheng and Scott Schwartz and Robert Chapkin and Raymond Carroll and Ivan Ivanov},
  journal= {arXiv preprint arXiv:1111.6283},
  year   = {2011}
}

Comments

Submitted

Feature selection for high-dimensional integrated data

Abstract

Keywords

Cite

Comments

Related papers