Related papers: Clustering of Big Data with Mixed Features

How to Use K-means for Big Data Clustering?

K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of…

Machine Learning · Computer Science 2023-11-27 Rustam Mussabayev , Nenad Mladenovic , Bassem Jarboui , Ravil Mussabayev

Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach

Clustering is a widely used technique in data mining applications for discovering patterns in underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either numeric or categorical attributes.…

Artificial Intelligence · Computer Science 2007-05-23 Zengyou He , Xiaofei Xu , Shengchun Deng

A sampling-based approach for efficient clustering in large datasets

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our…

Machine Learning · Computer Science 2022-03-30 Georgios Exarchakis , Omar Oubari , Gregor Lenz

Survey of state-of-the-art mixed data clustering algorithms

Mixed data comprises both numeric and categorical features, and mixed datasets occur frequently in many domains, such as health, finance, and marketing. Clustering is often applied to mixed datasets to find structures and to group similar…

Machine Learning · Computer Science 2019-03-20 Amir Ahmad , Shehroz S. Khan

Probabilistic Partitive Partitioning (PPP)

Clustering is a NP-hard problem. Thus, no optimal algorithm exists, heuristics are applied to cluster the data. Heuristics can be very resource-intensive, if not applied properly. For substantially large data sets computational efficiencies…

Databases · Computer Science 2020-03-11 Mujahid Sultan

Comparative Analysis of Optimization Strategies for K-means Clustering in Big Data Contexts: A Review

This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with…

Machine Learning · Computer Science 2024-05-21 Ravil Mussabayev , Rustam Mussabayev

Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers

Clustering is a widely used technique with a long and rich history in a variety of areas. However, most existing algorithms do not scale well to large datasets, or are missing theoretical guarantees of convergence. This paper introduces a…

Machine Learning · Statistics 2024-10-16 Yijia Zhou , Kyle A. Gallivan , Adrian Barbu

Practical Introduction to Clustering Data

Data clustering is an approach to seek for structure in sets of complex data, i.e., sets of "objects". The main objective is to identify groups of objects which are similar to each other, e.g., for classification. Here, an introduction to…

Data Analysis, Statistics and Probability · Physics 2016-02-17 Alexander K. Hartmann

A Computational Approach to Improving Fairness in K-means Clustering

The popular K-means clustering algorithm potentially suffers from a major weakness for further analysis or interpretation. Some cluster may have disproportionately more (or fewer) points from one of the subpopulations in terms of some…

Machine Learning · Computer Science 2026-02-10 Guancheng Zhou , Haiping Xu , Hongkang Xu , Chenyu Li , Donghui Yan

A Short Survey on Data Clustering Algorithms

With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial…

Data Structures and Algorithms · Computer Science 2015-12-01 Ka-Chun Wong

Explainable $k$-Means and $k$-Medians Clustering

Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a…

Machine Learning · Computer Science 2020-09-23 Sanjoy Dasgupta , Nave Frost , Michal Moshkovitz , Cyrus Rashtchian

Clustering Approaches for Mixed-Type Data: A Comparative Study

Clustering is widely used in unsupervised learning to find homogeneous groups of observations within a dataset. However, clustering mixed-type data remains a challenge, as few existing approaches are suited for this task. This study…

Machine Learning · Statistics 2025-11-26 Badih Ghattas , Alvaro Sanchez San-Benito

Quartile Clustering: A quartile based technique for Generating Meaningful Clusters

Clustering is one of the main tasks in exploratory data analysis and descriptive statistics where the main objective is partitioning observations in groups. Clustering has a broad range of application in varied domains like climate,…

Databases · Computer Science 2012-03-20 Saptarsi Goswami , Amlan Chakrabarti

Balanced k-Means and Min-Cut Clustering

Clustering is an effective technique in data mining to generate groups that are the matter of interest. Among various clustering approaches, the family of k-means algorithms and min-cut algorithms gain most popularity due to their…

Machine Learning · Computer Science 2014-11-25 Xiaojun Chang , Feiping Nie , Zhigang Ma , Yi Yang

Influence of Swarm Intelligence in Data Clustering Mechanisms

Data mining focuses on discovering interesting, non-trivial and meaningful information from large datasets. Data clustering is one of the unsupervised and descriptive data mining task which group data based on similarity features and…

Neural and Evolutionary Computing · Computer Science 2023-05-09 Pitawelayalage Dasun Dileepa Pitawela , Gamage Upeksha Ganegoda

Micro-Clustering: Finding Small Clusters in Large Diversity

We address the problem of un-supervised soft-clustering called micro-clustering. The aim of the problem is to enumerate all groups composed of records strongly related to each other, while standard clustering methods separate records at…

Data Structures and Algorithms · Computer Science 2016-06-07 Takeaki Uno , Hiroki Maegawa , Takanobu Nakahara , Yukinobu Hamuro , Ryo Yoshinaka , Makoto Tatsuta

A clustering algorithm for multivariate data streams with correlated components

Common clustering algorithms require multiple scans of all the data to achieve convergence, and this is prohibitive when large databases, with data arriving in streams, must be processed. Some algorithms to extend the popular K-means method…

Applications · Statistics 2017-12-22 Giacomo Aletti , Alessandra Micheletti

A hybrid clustering algorithm for data mining

Data clustering is a process of arranging similar data into groups. A clustering algorithm partitions a data set into several groups such that the similarity within a group is better than among groups. In this paper a hybrid clustering…

Databases · Computer Science 2012-05-25 Ravindra Jain

Mixed Data Clustering Survey and Challenges

The advent of the big data paradigm has transformed how industries manage and analyze information, ushering in an era of unprecedented data volume, velocity, and variety. Within this landscape, mixed-data clustering has become a critical…

Machine Learning · Computer Science 2025-12-04 Guillaume Guerard , Sonia Djebali

Clust-Splitter - an Efficient Nonsmooth Optimization-Based Algorithm for Clustering Large Datasets

Clustering is a fundamental task in data mining and machine learning, particularly for analyzing large-scale data. In this paper, we introduce Clust-Splitter, an efficient algorithm based on nonsmooth optimization, designed to solve the…

Machine Learning · Computer Science 2026-03-19 Jenni Lampainen , Kaisa Joki , Napsu Karmitsa , Marko M. Mäkelä