Related papers: Feature selection in high-dimensional dataset usin…

Analyzing Large-Scale, Distributed and Uncertain Data

The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively…

Databases · Computer Science 2017-12-06 Yaron Gonen

Scalable mRMR feature selection to handle high dimensional datasets: Vertical partitioning based Iterative MapReduce framework

While building machine learning models, Feature selection (FS) stands out as an essential preprocessing step used to handle the uncertainty and vagueness in the data. Recently, the minimum Redundancy and Maximum Relevance (mRMR) approach…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-25 Yelleti Vivek , P. S. V. S. Sai Prasad

Feature selection using nearest attributes

Feature selection is an important problem in high-dimensional data analysis and classification. Conventional feature selection approaches focus on detecting the features based on a redundancy criterion using learning and feature searching…

Computer Vision and Pattern Recognition · Computer Science 2012-01-31 Alex Pappachen James , Sima Dimitrijev

Distributed ReliefF based Feature Selection in Spark

Feature selection (FS) is a key research area in the machine learning and data mining fields, removing irrelevant and redundant features usually helps to reduce the effort required to process a dataset while maintaining or even improving…

Machine Learning · Computer Science 2018-11-02 Raul-Jose Palma-Mendoza , Daniel Rodriguez , Luis de-Marcos

N$^3$LARS: Minimum Redundancy Maximum Relevance Feature Selection for Large and High-dimensional Data

We propose a feature selection method that finds non-redundant features from a large and high-dimensional data in nonlinear way. Specifically, we propose a nonlinear extension of the non-negative least-angle regression (LARS) called…

Machine Learning · Statistics 2014-11-11 Makoto Yamada , Avishek Saha , Hua Ouyang , Dawei Yin , Yi Chang

Optimizing MapReduce for Highly Distributed Environments

MapReduce, the popular programming paradigm for large-scale data processing, has traditionally been deployed over tightly-coupled clusters where the data is already locally available. The assumption that the data and compute resources are…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-07-31 Benjamin Heintz , Abhishek Chandra , Ramesh K. Sitaraman

Max-Margin Feature Selection

Many machine learning applications such as in vision, biology and social networking deal with data in high dimensions. Feature selection is typically employed to select a subset of features which im- proves generalization accuracy as well…

Machine Learning · Computer Science 2016-06-15 Yamuna Prasad , Dinesh Khandelwal , K. K. Biswas

Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform

In machine learning applications for online product offerings and marketing strategies, there are often hundreds or thousands of features available to build such models. Feature selection is one essential method in such applications for…

Machine Learning · Statistics 2019-08-16 Zhenyu Zhao , Radhika Anand , Mallory Wang

A Distributed Deep Representation Learning Model for Big Image Data Classification

This paper describes an effective and efficient image classification framework nominated distributed deep representation learning model (DDRL). The aim is to strike the balance between the computational intensive deep learning approaches…

Computer Vision and Pattern Recognition · Computer Science 2016-07-05 Le Dong , Na Lv , Qianni Zhang , Shanshan Xie , Ling He , Mengdie Mao

Distributed Parameter Map-Reduce

This paper describes how to convert a machine learning problem into a series of map-reduce tasks. We study logistic regression algorithm. In logistic regression algorithm, it is assumed that samples are independent and each sample is…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-06 Qi Li

DimReduction - Interactive Graphic Environment for Dimensionality Reduction

Feature selection is a pattern recognition approach to choose important variables according to some criteria to distinguish or explain certain phenomena. There are many genomic and proteomic applications which rely on feature selection to…

Computer Vision and Pattern Recognition · Computer Science 2011-06-13 Fabricio Martins Lopes , David Correa Martins-Jr , Roberto M. Cesar-Jr

Permutation-based multi-objective evolutionary feature selection for high-dimensional data

Feature selection is a critical step in the analysis of high-dimensional data, where the number of features often vastly exceeds the number of samples. Effective feature selection not only improves model performance and interpretability but…

Machine Learning · Computer Science 2025-01-27 Raquel Espinosa , Gracia Sánchez , José Palma , Fernando Jiménez

Feature Selection in High-dimensional Spaces Using Graph-Based Methods

High-dimensional feature selection is a central problem in a variety of application domains such as machine learning, image analysis, and genomics. In this paper, we propose graph-based tests as a useful basis for feature selection. We…

Methodology · Statistics 2024-08-13 Swarnadip Ghosh , Somabha Mukherjee , Divyansh Agarwal , Yichen He , Mingzhi Song , Xuejiao Pei

High Dimensional Low Rank plus Sparse Matrix Decomposition

This paper is concerned with the problem of low rank plus sparse matrix decomposition for big data. Conventional algorithms for matrix decomposition use the entire data to extract the low-rank and sparse components, and are based on…

Numerical Analysis · Computer Science 2017-03-17 Mostafa Rahmani , George Atia

On feature selection in double-imbalanced data settings: a Random Forest approach

Feature selection is a critical step in high-dimensional classification tasks, particularly under challenging conditions of double imbalance, namely settings characterized by both class imbalance in the response variable and dimensional…

Methodology · Statistics 2025-06-13 Fabio Demaria

Optimization and analysis of large scale data sorting algorithm based on Hadoop

When dealing with massive data sorting, we usually use Hadoop which is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. A common approach in implement of…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-02 Zhuo Wang , Longlong Tian , Dianjie Guo , Xiaoming Jiang

Enumerating Maximal Bicliques from a Large Graph using MapReduce

We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-22 Arko Provo Mukherjee , Srikanta Tirthapura

An Alternative C++ based HPC system for Hadoop MapReduce

MapReduce is a technique used to vastly improve distributed processing of data and can massively speed up computation. Hadoop and its MapReduce relies on JVM and Java which is expensive on memory. High Performance Computing based MapReduce…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-29 Vignesh S. , Muthumanikandan V. , Siddarth S. , Sainath G

Scalable and Efficient Statistical Inference with Estimating Functions in the MapReduce Paradigm for Big Data

The theory of statistical inference along with the strategy of divide-and-conquer for large- scale data analysis has recently attracted considerable interest due to great popularity of the MapReduce programming paradigm in the Apache Hadoop…

Methodology · Statistics 2017-09-14 Ling Zhou , Peter X. -K. Song

A High-Dimensional Feature Selection Algorithm Based on Multiobjective Differential Evolution

Multiobjective feature selection seeks to determine the most discriminative feature subset by simultaneously optimizing two conflicting objectives: minimizing the number of selected features and the classification error rate. The goal is to…

Neural and Evolutionary Computing · Computer Science 2025-05-12 Zhenxing Zhang , Qianxiang An , Yilei Wang , Chenfeng Wu , Baoling Dong , Chunjie Zhou