Feature selection in high-dimensional dataset using MapReduce

Claudio Reggiani; Yann-Aël Le Borgne; Gianluca Bontempi

Feature selection in high-dimensional dataset using MapReduce

Distributed, Parallel, and Cluster Computing 2017-09-08 v1 Machine Learning Machine Learning

Authors: Claudio Reggiani , Yann-Aël Le Borgne , Gianluca Bontempi

Abstract

This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features.

Feature selection in high-dimensional dataset using MapReduce

Abstract

Keywords

Cite

Related papers