Related papers: Learning Balanced Mixtures of Discrete Distributio…

Separating populations with wide data: A spectral analysis

In this paper, we consider the problem of partitioning a small data sample drawn from a mixture of $k$ product distributions. We are interested in the case that individual features are of low average quality $\gamma$, and we want to use as…

Machine Learning · Statistics 2017-11-17 Avrim Blum , Amin Coja-Oghlan , Alan Frieze , Shuheng Zhou

Distributed Balanced Partitioning via Linear Embedding

Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems, e.g., in some cases, a big graph can be chopped into pieces that fit on one machine to be processed independently before stitching the…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-12-10 Kevin Aydin , MohammadHossein Bateni , Vahab Mirrokni

Semidefinite programming on population clustering: a global analysis

In this paper, we consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions. Our work is motivated by the application of clustering individuals according to their population…

Statistics Theory · Mathematics 2023-01-05 Shuheng Zhou

Distributed estimation from relative measurements of heterogeneous and uncertain quality

This paper studies the problem of estimation from relative measurements in a graph, in which a vector indexed over the nodes has to be reconstructed from pairwise measurements of differences between its components associated to nodes…

Systems and Control · Computer Science 2018-07-27 Chiara Ravazzi , Nelson P. K. Chan , Paolo Frasca

Learning Mixtures of Arbitrary Distributions over Large Discrete Domains

We give an algorithm for learning a mixture of {\em unstructured} distributions. This problem arises in various unsupervised learning scenarios, for example in learning {\em topic models} from a corpus of documents spanning several topics.…

Machine Learning · Computer Science 2013-09-19 Yuval Rabani , Leonard Schulman , Chaitanya Swamy

Semidefinite programming relaxations and debiasing for MAXCUT-based clustering

In this paper, we consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of 2 sub-gaussian distributions in $\R^p$. We consider semidefinite programming relaxations of an integer quadratic program that is…

Machine Learning · Statistics 2025-03-19 Shuheng Zhou

Distributed Minimum Cut Approximation

We study the problem of computing approximate minimum edge cuts by distributed algorithms. We use a standard synchronous message passing model where in each round, $O(\log n)$ bits can be transmitted over each edge (a.k.a. the CONGEST…

Data Structures and Algorithms · Computer Science 2013-11-21 Mohsen Ghaffari , Fabian Kuhn

The Informativeness of K -Means for Learning Mixture Models

The learning of mixture models can be viewed as a clustering problem. Indeed, given data samples independently generated from a mixture of distributions, we often would like to find the {\it correct target clustering} of the samples…

Machine Learning · Statistics 2022-08-26 Zhaoqiang Liu , Vincent Y. F. Tan

Top-k data selection via distributed sample quantile inference

We consider the problem of determining the top-$k$ largest measurements from a dataset distributed among a network of $n$ agents with noisy communication links. We show that this scenario can be cast as a distributed convex optimization…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-02 Xu Zhang , Marcos Vasconcelos

On Learning Mixtures of Well-Separated Gaussians

We consider the problem of efficiently learning mixtures of a large number of spherical Gaussians, when the components of the mixture are well separated. In the most basic form of this problem, we are given samples from a uniform mixture of…

Data Structures and Algorithms · Computer Science 2017-11-01 Oded Regev , Aravindan Vijayaraghavan

On Computing Total Variation Distance Between Mixtures of Product Distributions

We study the problem of approximating the total variation distance between two mixtures of product distributions over an $n$-dimensional discrete domain. Given two mixtures $\mathbb{P}$ and $\mathbb{Q}$ with $k_1$ and $k_2$ product…

Data Structures and Algorithms · Computer Science 2026-05-06 Weiming Feng , Yucheng Fu , Minji Yang , Anqi Zhang

A Distribution Testing Approach to Clustering Distributions

We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters…

Data Structures and Algorithms · Computer Science 2025-12-10 Gunjan Kumar , Yash Pote , Jonathan Scarlett

Near-optimal edge partitioning via intersecting families

We study the problem of edge partitioning, where the goal is to partition the edge set of a graph into several parts. The replication factor of a vertex $v$ is the number of parts that contain edges incident to $v$. The goal is to minimize…

Discrete Mathematics · Computer Science 2026-05-08 Alexander Yakunin , Andrey Kupavskii , Alexander Sushin , Stanislav Moiseev

Sampling Large Data on Graphs

We consider the problem of sampling from data defined on the nodes of a weighted graph, where the edge weights capture the data correlation structure. As shown recently, using spectral graph theory one can define a cut-off frequency for the…

Information Theory · Computer Science 2014-11-13 Ilan Shomorony , A. Salman Avestimehr

Learning Arbitrary Statistical Mixtures of Discrete Distributions

We study the problem of learning from unlabeled samples very general statistical mixture models on large finite sets. Specifically, the model to be learned, $\vartheta$, is a probability distribution over probability distributions $p$,…

Machine Learning · Computer Science 2015-04-13 Jian Li , Yuval Rabani , Leonard J. Schulman , Chaitanya Swamy

Multi-Dimensional Balanced Graph Partitioning via Projected Gradient Descent

Motivated by performance optimization of large-scale graph processing systems that distribute the graph across multiple machines, we consider the balanced graph partitioning problem. Compared to the previous work, we study the…

Data Structures and Algorithms · Computer Science 2019-02-19 Dmitrii Avdiukhin , Sergey Pupyrev , Grigory Yaroslavtsev

Efficient Distributed Algorithms for the $K$-Nearest Neighbors Problem

The $K$-nearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of $n$ data points with labels and a query point $p$, we want to assign a label to $p$ based on the labels…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-25 Reza Fathi , Anisur Rahaman Molla , Gopal Pandurangan

Learning Mixtures of Gaussians Using Diffusion Models

We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly\,log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample…

Machine Learning · Computer Science 2025-03-05 Khashayar Gatmiry , Jonathan Kelner , Holden Lee

The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians

We consider the problem of spherical Gaussian Mixture models with $k \geq 3$ components when the components are well separated. A fundamental previous result established that separation of $\Omega(\sqrt{\log k})$ is necessary and sufficient…

Machine Learning · Computer Science 2020-06-22 Jeongyeol Kwon , Constantine Caramanis

Semidefinite programming on population clustering: a local analysis

In this paper, we consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions. In particular, we design and analyze two computational efficient algorithms to partition data…

Statistics Theory · Mathematics 2024-03-20 Shuheng Zhou