Related papers: Multiple Sample Clustering

Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers

Clustering is a widely used technique with a long and rich history in a variety of areas. However, most existing algorithms do not scale well to large datasets, or are missing theoretical guarantees of convergence. This paper introduces a…

Machine Learning · Statistics 2024-10-16 Yijia Zhou , Kyle A. Gallivan , Adrian Barbu

A simulation study of cluster search algorithms in data set generated by Gaussian mixture models

Determining the number of clusters is a fundamental issue in data clustering. Several algorithms have been proposed, including centroid-based algorithms using the Euclidean distance and model-based algorithms using a mixture of probability…

Machine Learning · Computer Science 2024-07-30 Ryosuke Motegi , Yoichi Seki

An Agglomerative Clustering of Simulation Output Distributions Using Regularized Wasserstein Distance

Using statistical learning methods to analyze stochastic simulation outputs can significantly enhance decision-making by uncovering relationships between different simulated systems and between a system's inputs and outputs. We focus on…

Methodology · Statistics 2026-05-28 Mohammadmahdi Ghasemloo , David J. Eckman

Bayesian Consensus Clustering

The task of clustering a set of objects based on multiple sources of data arises in several modern applications. We propose an integrative statistical model that permits a separate clustering of the objects for each data source. These…

Machine Learning · Statistics 2015-12-01 Eric F. Lock , David B. Dunson

Wasserstein $K$-means for clustering probability distributions

Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used $K$-means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the…

Machine Learning · Statistics 2022-10-14 Yubo Zhuang , Xiaohui Chen , Yun Yang

The Exploitation of Distance Distributions for Clustering

Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis,…

Machine Learning · Computer Science 2021-08-24 Michael C. Thrun

Linear Time Clustering for High Dimensional Mixtures of Gaussian Clouds

Clustering mixtures of Gaussian distributions is a fundamental and challenging problem that is ubiquitous in various high-dimensional data processing tasks. While state-of-the-art work on learning Gaussian mixture models has focused…

Machine Learning · Computer Science 2018-03-05 Dan Kushnir , Shirin Jalali , Iraj Saniee

Differentially Private Algorithms for Clustering with Stability Assumptions

We study the problem of differentially private clustering under input-stability assumptions. Despite the ever-growing volume of works on differential privacy in general and differentially private clustering in particular, only three works…

Machine Learning · Computer Science 2021-12-20 Moshe Shechner

Learning Mixtures of Gaussians using the k-means Algorithm

One of the most popular algorithms for clustering in Euclidean space is the $k$-means algorithm; $k$-means is difficult to analyze mathematically, and few theoretical guarantees are known about it, particularly when the data is {\em…

Machine Learning · Computer Science 2009-12-02 Kamalika Chaudhuri , Sanjoy Dasgupta , Andrea Vattani

A Short Survey on Data Clustering Algorithms

With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial…

Data Structures and Algorithms · Computer Science 2015-12-01 Ka-Chun Wong

Schema matching using Gaussian mixture models with Wasserstein distance

Gaussian mixture models find their place as a powerful tool, mostly in the clustering problem, but with proper preparation also in feature extraction, pattern recognition, image segmentation and in general machine learning. When faced with…

Machine Learning · Computer Science 2022-04-01 Mateusz Przyborowski , Mateusz Pabiś , Andrzej Janusz , Dominik Ślęzak

Differentially-Private Clustering of Easy Instances

Clustering is a fundamental problem in data analysis. In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points. Despite significant research progress, the…

Machine Learning · Computer Science 2021-12-30 Edith Cohen , Haim Kaplan , Yishay Mansour , Uri Stemmer , Eliad Tsfadia

Multilevel Clustering via Wasserstein Means

We propose a novel approach to the problem of multilevel clustering, which aims to simultaneously partition data in each group and discover grouping patterns among groups in a potentially large hierarchically structured corpus of data. Our…

Machine Learning · Statistics 2017-06-14 Nhat Ho , XuanLong Nguyen , Mikhail Yurochkin , Hung Hai Bui , Viet Huynh , Dinh Phung

Wasserstein $k$-Centers Clustering for Distributional Data

We develop a novel clustering method for distributional data, where each data point is regarded as a probability distribution on the real line. For distributional data, it has been challenging to develop a clustering method that utilizes…

Methodology · Statistics 2025-06-24 Ryo Okano , Masaaki Imaizumi

Predictive K-means with local models

Supervised classification can be effective for prediction but sometimes weak on interpretability or explainability (XAI). Clustering, on the other hand, tends to isolate categories or profiles that can be meaningful but there is no…

Machine Learning · Computer Science 2021-04-27 Vincent Lemaire , Oumaima Alaoui Ismaili , Antoine Cornuéjols , Dominique Gay

Bayesian Distance Clustering

Model-based clustering is widely-used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density.…

Machine Learning · Statistics 2019-06-27 Leo L Duan , David B Dunson

On Efficient Multilevel Clustering via Wasserstein Distances

We propose a novel approach to the problem of multilevel clustering, which aims to simultaneously partition data in each group and discover grouping patterns among groups in a potentially large hierarchically structured corpus of data. Our…

Machine Learning · Statistics 2021-05-26 Viet Huynh , Nhat Ho , Nhan Dam , XuanLong Nguyen , Mikhail Yurochkin , Hung Bui , and Dinh Phung

Clustering -- Basic concepts and methods

We review clustering as an analysis tool and the underlying concepts from an introductory perspective. What is clustering and how can clusterings be realised programmatically? How can data be represented and prepared for a clustering task?…

Machine Learning · Computer Science 2022-12-05 Jan-Oliver Felix Kapp-Joswig , Bettina G. Keller

Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures

We consider the problem of clustering data points in high dimensions, i.e. when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with non-spherical…

Statistics Theory · Mathematics 2014-06-10 Martin Azizyan , Aarti Singh , Larry Wasserman

Spectral Clustering for Discrete Distributions

The discrete distribution is often used to describe complex instances in machine learning, such as images, sequences, and documents. Traditionally, clustering of discrete distributions (D2C) has been approached using Wasserstein barycenter…

Machine Learning · Computer Science 2024-08-19 Zixiao Wang , Dong Qiao , Jicong Fan