Related papers: Data Nuggets: A Method for Reducing Big Data While…

Clustering of Big Data with Mixed Features

Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We…

Machine Learning · Statistics 2020-11-13 Joshua Tobin , Mimi Zhang

An efficient K -means clustering algorithm for massive data

The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. In this sense, cluster analysis algorithms are a key element of exploratory data analysis, due to their easiness in the…

Machine Learning · Statistics 2018-01-10 Marco Capó , Aritz Pérez , Jose A. Lozano

Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-means

This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to…

Machine Learning · Computer Science 2024-03-28 Rustam Mussabayev , Ravil Mussabayev

A sampling-based approach for efficient clustering in large datasets

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our…

Machine Learning · Computer Science 2022-03-30 Georgios Exarchakis , Omar Oubari , Gregor Lenz

An efficient K-means algorithm for Massive Data

Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to ma- nipulate and analyze such information. Even though datasets have grown in size, the K-means algorithm…

Machine Learning · Statistics 2016-05-11 Marco Capó , Aritz Pérez , José Antonio Lozano

How to Use K-means for Big Data Clustering?

K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of…

Machine Learning · Computer Science 2023-11-27 Rustam Mussabayev , Nenad Mladenovic , Bassem Jarboui , Ravil Mussabayev

Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

We develop and analyze a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set…

Data Structures and Algorithms · Computer Science 2018-07-13 Dan Feldman , Melanie Schmidt , Christian Sohler

Probabilistic Partitive Partitioning (PPP)

Clustering is a NP-hard problem. Thus, no optimal algorithm exists, heuristics are applied to cluster the data. Heuristics can be very resource-intensive, if not applied properly. For substantially large data sets computational efficiencies…

Databases · Computer Science 2020-03-11 Mujahid Sultan

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with…

Machine Learning · Computer Science 2024-05-21 Ravil Mussabayev , Rustam Mussabayev

Practical Introduction to Clustering Data

Data clustering is an approach to seek for structure in sets of complex data, i.e., sets of "objects". The main objective is to identify groups of objects which are similar to each other, e.g., for classification. Here, an introduction to…

Data Analysis, Statistics and Probability · Physics 2016-02-17 Alexander K. Hartmann

The importance of the clustering model to detect new types of intrusion in data traffic

In the current digital age, the volume of data generated by various cyber activities has become enormous and is constantly increasing. The data may contain valuable insights that can be harnessed to improve cyber security measures. However,…

Cryptography and Security · Computer Science 2025-03-27 Noor Saud Abd , Noor Walid Khalid , Basim Hussein Ali

A Short Survey on Data Clustering Algorithms

With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial…

Data Structures and Algorithms · Computer Science 2015-12-01 Ka-Chun Wong

Cluster-based pruning techniques for audio data

Deep learning models have become widely adopted in various domains, but their performance heavily relies on a vast amount of data. Datasets often contain a large number of irrelevant or redundant samples, which can lead to computational…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-22 Boris Bergsma , Marta Brzezinska , Oleg V. Yazyev , Milos Cernak

Too Much Information Kills Information: A Clustering Perspective

Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks,…

Machine Learning · Computer Science 2020-09-17 Yicheng Xu , Vincent Chau , Chenchen Wu , Yong Zhang , Vassilis Zissimopoulos , Yifei Zou

A Mathematical Theory for Clustering in Metric Spaces

Clustering is one of the most fundamental problems in data analysis and it has been studied extensively in the literature. Though many clustering algorithms have been proposed, clustering theories that justify the use of these clustering…

Machine Learning · Computer Science 2016-02-22 Cheng-Shang Chang , Wanjiun Liao , Yu-Sheng Chen , Li-Heng Liou

Settling Time vs. Accuracy Tradeoffs for Clustering Big Data

We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly…

Machine Learning · Computer Science 2024-04-03 Andrew Draganov , David Saulpic , Chris Schwiegelshohn

Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization

K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets. To address this limitation, we propose a novel heuristic algorithm that leverages the Variable Neighborhood Search…

Machine Learning · Computer Science 2024-10-21 Ravil Mussabayev , Rustam Mussabayev

Mixed Data Clustering Survey and Challenges

The advent of the big data paradigm has transformed how industries manage and analyze information, ushering in an era of unprecedented data volume, velocity, and variety. Within this landscape, mixed-data clustering has become a critical…

Machine Learning · Computer Science 2025-12-04 Guillaume Guerard , Sonia Djebali

Data clustering: a fundamental method in data science and management

This paper explores the critical role of data clustering in data science, emphasizing its methodologies, tools, and diverse applications. Traditional techniques, such as partitional and hierarchical clustering, are analyzed alongside…

Artificial Intelligence · Computer Science 2025-10-07 Tai Dinh , Wong Hauchi , Daniil Lisik , Michal Koren , Dat Tran , Philip S. Yu , Joaquín Torres-Sospedra

A clustering algorithm for multivariate data streams with correlated components

Common clustering algorithms require multiple scans of all the data to achieve convergence, and this is prohibitive when large databases, with data arriving in streams, must be processed. Some algorithms to extend the popular K-means method…

Applications · Statistics 2017-12-22 Giacomo Aletti , Alessandra Micheletti