Related papers: Distributed Correlation-Based Feature Selection in…

Distributed ReliefF based Feature Selection in Spark

Feature selection (FS) is a key research area in the machine learning and data mining fields, removing irrelevant and redundant features usually helps to reduce the effort required to process a dataset while maintaining or even improving…

Machine Learning · Computer Science 2018-11-02 Raul-Jose Palma-Mendoza , Daniel Rodriguez , Luis de-Marcos

CFS: A Distributed File System for Large Scale Container Platforms

We propose CFS, a distributed file system for large scale container platforms. CFS supports both sequential and random file accesses with optimized storage for both large files and small files, and adopts different replication protocols for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-11 Haifeng Liu , Wei Ding , Yuan Chen , Weilong Guo , Shuoran Liu , Tianpeng Li , Mofei Zhang , Jianxing Zhao , Hongyin Zhu , Zhengyi Zhu

An Information Theoretic Feature Selection Framework for Big Data under Apache Spark

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on…

Artificial Intelligence · Computer Science 2016-10-20 Sergio Ramírez-Gallego , Héctor Mouriño-Talín , David Martínez-Rego , Verónica Bolón-Canedo , José Manuel Benítez , Amparo Alonso-Betanzos , Francisco Herrera

Feature subset selection for Big Data via Chaotic Binary Differential Evolution under Apache Spark

Feature subset selection (FSS) using a wrapper approach is essentially a combinatorial optimization problem having two objective functions namely cardinality of the selected-feature-subset, which should be minimized and the corresponding…

Neural and Evolutionary Computing · Computer Science 2022-02-09 Yelleti Vivek , Vadlamani Ravi , P. Radhakrishna

Clustering High-dimensional Data via Feature Selection

High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called…

Methodology · Statistics 2022-10-31 Tianqi Liu , Yu Lu , Biqing Zhu , Hongyu Zhao

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

A Novel Scalable Apache Spark Based Feature Extraction Approaches for Huge Protein Sequence and their Clustering Performance Analysis

Genome sequencing projects are rapidly increasing the number of high-dimensional protein sequence datasets. Clustering a high-dimensional protein sequence dataset using traditional machine learning approaches poses many challenges. Many…

Quantitative Methods · Quantitative Biology 2022-04-27 Preeti Jha , Aruna Tiwari , Neha Bharill , Milind Ratnaparkhe , Om Prakash Patel , Nilagiri Harshith , Mukkamalla Mounika , Neha Nagendra

Scaling associative classification for very large datasets

Supervised learning algorithms are nowadays successfully scaling up to datasets that are very large in volume, leveraging the potential of in-memory cluster-computing Big Data frameworks. Still, massive datasets with a number of…

Machine Learning · Computer Science 2018-05-11 Luca Venturini , Elena Baralis , Paolo Garza

A Distributed Collaborative Filtering Algorithm Using Multiple Data Sources

Collaborative Filtering (CF) is one of the most commonly used recommendation methods. CF consists in predicting whether, or how much, a user will like (or dislike) an item by leveraging the knowledge of the user's preferences as well as…

Information Retrieval · Computer Science 2018-07-17 Mohamed Reda Bouadjenek , Esther Pacitti , Maximilien Servajean , Florent Masseglia , Amr El Abbadi

Causality-based Feature Selection: Methods and Evaluations

Feature selection is a crucial preprocessing step in data analytics and machine learning. Classical feature selection algorithms select features based on the correlations between predictive features and the class variable and do not attempt…

Machine Learning · Computer Science 2019-11-19 Kui Yu , Xianjie Guo , Lin Liu , Jiuyong Li , Hao Wang , Zhaolong Ling , Xindong Wu

Sparse Decentralized Federated Learning

Decentralized Federated Learning (DFL) enables collaborative model training without a central server but faces challenges in efficiency, stability, and trustworthiness due to communication and computational limitations among distributed…

Machine Learning · Computer Science 2025-03-18 Shan Sha , Shenglong Zhou , Lingchen Kong , Geoffrey Ye Li

A Stochastic Large-scale Machine Learning Algorithm for Distributed Features and Observations

As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine…

Machine Learning · Statistics 2019-12-10 Biyi Fang , Diego Klabjan

DFCA: Decentralized Federated Clustering Algorithm

Clustered Federated Learning has emerged as an effective approach for handling heterogeneous data across clients by partitioning them into clusters with similar or identical data distributions. However, most existing methods, including the…

Machine Learning · Computer Science 2026-03-03 Jonas Kirch , Sebastian Becker , Tiago Koketsu Rodrigues , Stefan Harmeling

Coordinated Replay Sample Selection for Continual Federated Learning

Continual Federated Learning (CFL) combines Federated Learning (FL), the decentralized learning of a central model on a number of client devices that may not communicate their data, and Continual Learning (CL), the learning of a model from…

Machine Learning · Computer Science 2023-10-24 Jack Good , Jimit Majmudar , Christophe Dupuy , Jixuan Wang , Charith Peris , Clement Chung , Richard Zemel , Rahul Gupta

Distributed Community Detection for Large Scale Networks Using Stochastic Block Model

With rapid developments of information and technology, large scale network data are ubiquitous. In this work we develop a distributed spectral clustering algorithm for community detection in large scale networks. To handle the problem, we…

Methodology · Statistics 2021-06-01 Shihao Wu , Zhe Li , Xuening Zhu

Gradient Boosted Feature Selection

A feature selection algorithm should ideally satisfy four conditions: reliably extract relevant features; be able to identify non-linear feature interactions; scale linearly with the number of features and dimensions; allow the…

Machine Learning · Computer Science 2019-01-15 Zhixiang Eddie Xu , Gao Huang , Kilian Q. Weinberger , Alice X. Zheng

Causally-Guided Diffusion for Stable Feature Selection

Feature selection is fundamental to robust data-centric AI, but most existing methods optimize predictive performance under a single data distribution. This often selects spurious features that fail under distribution shifts. Motivated by…

Machine Learning · Computer Science 2026-03-24 Arun Vignesh Malarkkan , Xinyuan Wang , Kunpeng Liu , Denghui Zhang , Yanjie Fu

Scalable Feature Subset Selection for Big Data using Parallel Hybrid Evolutionary Algorithm based Wrapper in Apache Spark

Owing to the emergence of large datasets, applying current sequential wrapper-based feature subset selection (FSS) algorithms increases the complexity. This limitation motivated us to propose a wrapper for feature subset selection (FSS)…

Neural and Evolutionary Computing · Computer Science 2022-10-28 Yelleti Vivek , Vadlamani Ravi , Pisipati Radhakrishna

Compactness Score: A Fast Filter Method for Unsupervised Feature Selection

Along with the flourish of the information age, massive amounts of data are generated day by day. Due to the large-scale and high-dimensional characteristics of these data, it is often difficult to achieve better decision-making in…

Machine Learning · Computer Science 2023-04-04 Peican Zhu , Xin Hou , Keke Tang , Zhen Wang , Feiping Nie

Feature Selection in the Contrastive Analysis Setting

Contrastive analysis (CA) refers to the exploration of variations uniquely enriched in a target dataset as compared to a corresponding background dataset generated from sources of variation that are irrelevant to a given task. For example,…

Machine Learning · Computer Science 2023-10-31 Ethan Weinberger , Ian Covert , Su-In Lee