English

Feature selection in high-dimensional dataset using MapReduce

Distributed, Parallel, and Cluster Computing 2017-09-08 v1 Machine Learning Machine Learning

Abstract

This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features.

Keywords

Cite

@article{arxiv.1709.02327,
  title  = {Feature selection in high-dimensional dataset using MapReduce},
  author = {Claudio Reggiani and Yann-Aël Le Borgne and Gianluca Bontempi},
  journal= {arXiv preprint arXiv:1709.02327},
  year   = {2017}
}