Related papers: Distributed Silhouette Algorithm: Evaluating Clust…

Scalable Distributed Approximation of Internal Measures for Clustering Evaluation

The most widely used internal measure for clustering evaluation is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, which is clearly unfeasible for massive datasets. Surprisingly,…

Data Structures and Algorithms · Computer Science 2021-01-21 Federico Altieri , Andrea Pietracaprina , Geppino Pucci , Fabio Vandin

When Does the Silhouette Score Work? A Comprehensive Study in Network Clustering

Selecting the number of communities is a fundamental challenge in network clustering. The silhouette score offers an intuitive, model-free criterion that balances within-cluster cohesion and between-cluster separation. Albeit its widespread…

Social and Information Networks · Computer Science 2026-01-01 Zongyue Teng , Jun Yan , Dandan Liu , Panpan Zhang

Parallel D2-Clustering: Large-Scale Clustering of Discrete Distributions

The discrete distribution clustering algorithm, namely D2-clustering, has demonstrated its usefulness in image classification and annotation where each object is represented by a bag of weighed vectors. The high computational complexity of…

Machine Learning · Computer Science 2013-02-07 Yu Zhang , James Z. Wang , Jia Li

Distributed Spatial Data Clustering as a New Approach for Big Data Analysis

In this paper we propose a new approach for Big Data mining and analysis. This new approach works well on distributed datasets and deals with data clustering task of the analysis. The approach consists of two main phases, the first phase…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-05 Malika Bendechache , Nhien-An Le-Khac , M-Tahar Kechadi

Efficient Large Scale Clustering based on Data Partitioning

Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high…

Databases · Computer Science 2018-02-27 Malika Bendechache , Nhien-An Le-Khac , M-Tahar Kechadi

Modeling Scalability of Distributed Machine Learning

Present day machine learning is computationally intensive and processes large amounts of data. It is implemented in a distributed fashion in order to address these scalability issues. The work is parallelized across a number of computing…

Machine Learning · Computer Science 2017-03-28 Alexander Ulanov , Andrey Simanovsky , Manish Marwah

Writing summary for the state-of-the-art methods for big data clustering in distributed environment

Big Data processing systems handle huge unstructured and structured data to store, process, and analyze through cluster analysis which helps in identifying unseen patterns to find the relationships between them. Clustering analysis over the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-11 Dipesh Gyawali

Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm

Distributed Computation has been a recent trend in engineering research. Parallel Computation is widely used in different areas of Data Mining, Image Processing, Simulating Models, Aerodynamics and so forth. One of the major usage of…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-28 C Rashmi

Fast communication-efficient spectral clustering over distributed data

The last decades have seen a surge of interests in distributed computing thanks to advances in clustered computing and big data technology. Existing distributed algorithms typically assume {\it all the data are already in one place}, and…

Machine Learning · Computer Science 2019-05-07 Donghui Yan , Yingjie Wang , Jin Wang , Guodong Wu , Honggang Wang

A parallel sampling based clustering

The problem of automatically clustering data is an age old problem. People have created numerous algorithms to tackle this problem. The execution time of any of this algorithm grows with the number of input points and the number of cluster…

Machine Learning · Computer Science 2014-12-08 Aditya AV Sastry , Kalyan Netti

Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation

Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-01 Subhadip Mitra

Clustering of Big Data with Mixed Features

Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We…

Machine Learning · Statistics 2020-11-13 Joshua Tobin , Mimi Zhang

A Stochastic Large-scale Machine Learning Algorithm for Distributed Features and Observations

As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine…

Machine Learning · Statistics 2019-12-10 Biyi Fang , Diego Klabjan

Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-means

This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to…

Machine Learning · Computer Science 2024-03-28 Rustam Mussabayev , Ravil Mussabayev

On a Distributed Approach for Density-based Clustering

Efficient extraction of useful knowledge from these data is still a challenge, mainly when the data is distributed, heterogeneous and of different quality depending on its corresponding local infrastructure. To reduce the overhead cost,…

Databases · Computer Science 2017-04-17 Nhien-An Le-Khac , M-Tahar Kechadi

A Short Survey on Data Clustering Algorithms

With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial…

Data Structures and Algorithms · Computer Science 2015-12-01 Ka-Chun Wong

Revisiting Silhouette Aggregation

Silhouette coefficient is an established internal clustering evaluation measure that produces a score per data point, assessing the quality of its clustering assignment. To assess the quality of the clustering of the whole dataset, the…

Machine Learning · Computer Science 2024-06-25 John Pavlopoulos , Georgios Vardakas , Aristidis Likas

Efficient techniques for mining spatial databases

Clustering is one of the major tasks in data mining. In the last few years, Clustering of spatial data has received a lot of research attention. Spatial databases are components of many advanced information systems like geographic…

Databases · Computer Science 2012-06-04 Mohamed A. El-Zawawy

Distributed Kernel K-Means for Large Scale Clustering

Clustering samples according to an effective metric and/or vector space representation is a challenging unsupervised learning task with a wide spectrum of applications. Among several clustering algorithms, k-means and its kernelized version…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-10 Marco Jacopo Ferrarotti , Sergio Decherchi , Walter Rocchia

Accuracy Evaluation of Overlapping and Multi-resolution Clustering Algorithms on Large Datasets

Performance of clustering algorithms is evaluated with the help of accuracy metrics. There is a great diversity of clustering algorithms, which are key components of many data analysis and exploration systems. However, there exist only few…

Data Structures and Algorithms · Computer Science 2019-02-18 Artem Lutov , Mourad Khayati , Philippe Cudré-Mauroux