Related papers: Data-Efficient Learning via Clustering-Based Sensi…

Too Much Information Kills Information: A Clustering Perspective

Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks,…

Machine Learning · Computer Science 2020-09-17 Yicheng Xu , Vincent Chau , Chenchen Wu , Yong Zhang , Vassilis Zissimopoulos , Yifei Zou

Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions

In this paper, we investigate the learning-augmented $k$-median clustering problem, which aims to improve the performance of traditional clustering algorithms by preprocessing the point set with a predictor of error rate $\alpha \in [0,1)$.…

Data Structures and Algorithms · Computer Science 2026-03-12 Kangke Cheng , Shihong Song , Guanlin Mo , Hu Ding

Meta-Learning to Cluster

Clustering is one of the most fundamental and wide-spread techniques in exploratory data analysis. Yet, the basic approach to clustering has not really changed: a practitioner hand-picks a task-specific clustering loss to optimize and fit…

Machine Learning · Computer Science 2019-11-01 Yibo Jiang , Nakul Verma

Differentiable Deep Clustering with Cluster Size Constraints

Clustering is a fundamental unsupervised learning approach. Many clustering algorithms -- such as $k$-means -- rely on the euclidean distance as a similarity measure, which is often not the most relevant metric for high dimensional data…

Machine Learning · Computer Science 2019-10-22 Aude Genevay , Gabriel Dulac-Arnold , Jean-Philippe Vert

Selective Embedding for Deep Learning

Deep learning has revolutionized many industries by enabling models to automatically learn complex patterns from raw data, reducing dependence on manual feature engineering. However, deep learning algorithms are sensitive to input data, and…

Machine Learning · Computer Science 2025-07-21 Mert Sehri , Zehui Hua , Francisco de Assis Boldt , Patrick Dumond

Foundation Model Makes Clustering A Better Initialization For Cold-Start Active Learning

Active learning selects the most informative samples from the unlabelled dataset to annotate in the context of a limited annotation budget. While numerous methods have been proposed for subsequent sample selection based on an initialized…

Machine Learning · Computer Science 2024-03-28 Han Yuan , Chuan Hong

A sampling-based approach for efficient clustering in large datasets

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our…

Machine Learning · Computer Science 2022-03-30 Georgios Exarchakis , Omar Oubari , Gregor Lenz

Average Sensitivity of Hierarchical $k$-Median Clustering

Hierarchical clustering is a widely used method for unsupervised learning with numerous applications. However, in the application of modern algorithms, the datasets studied are usually large and dynamic. If the hierarchical clustering is…

Machine Learning · Computer Science 2025-07-15 Shijie Li , Weiqiang He , Ruobing Bai , Pan Peng

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes…

Computation and Language · Computer Science 2024-09-18 Simon Yu , Liangyu Chen , Sara Ahmadian , Marzieh Fadaee

A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

Model performance evaluation is a critical and expensive task in machine learning and computer vision. Without clear guidelines, practitioners often estimate model accuracy using a one-time completely random selection of the data. However,…

Computer Vision and Pattern Recognition · Computer Science 2024-07-19 Riccardo Fogliato , Pratik Patil , Mathew Monfort , Pietro Perona

LASER: Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy

Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this…

Computation and Language · Computer Science 2025-09-25 Paramita Mirza , Lucas Weber , Fabian Küch

A Weighted K-Center Algorithm for Data Subset Selection

The success of deep learning hinges on enormous data and large models, which require labor-intensive annotations and heavy computation costs. Subset selection is a fundamental problem that can play a key role in identifying smaller portions…

Machine Learning · Computer Science 2023-12-19 Srikumar Ramalingam , Pranjal Awasthi , Sanjiv Kumar

Parameterized Complexity of Feature Selection for Categorical Data Clustering

We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known…

Data Structures and Algorithms · Computer Science 2021-08-20 Sayan Bandyapadhyay , Fedor V. Fomin , Petr A. Golovach , Kirill Simonov

Deep Kernel Learning for Clustering

We propose a deep learning approach for discovering kernels tailored to identifying clusters over sample data. Our neural network produces sample embeddings that are motivated by--and are at least as expressive as--spectral clustering. Our…

Machine Learning · Computer Science 2020-01-03 Chieh Wu , Zulqarnain Khan , Yale Chang , Stratis Ioannidis , Jennifer Dy

Clustering evolving data using kernel-based methods

In this thesis, we propose several modelling strategies to tackle evolving data in different contexts. In the framework of static clustering, we start by introducing a soft kernel spectral clustering (SKSC) algorithm, which can better deal…

Social and Information Networks · Computer Science 2014-11-24 Rocco Langone

Effective Sampling: Fast Segmentation Using Robust Geometric Model Fitting

Identifying the underlying models in a set of data points contaminated by noise and outliers, leads to a highly complex multi-model fitting problem. This problem can be posed as a clustering problem by the projection of higher order…

Computer Vision and Pattern Recognition · Computer Science 2018-08-01 Ruwan Tennakoon , Alireza Sadri , Reza Hoseinnezhad , Alireza Bab-Hadiashar

TBDFiltering: Sample-Efficient Tree-Based Data Filtering

The quality of machine learning models depends heavily on their training data. Selecting high-quality, diverse training sets for large language models (LLMs) is a difficult task, due to the lack of cheap and reliable quality metrics. While…

Machine Learning · Computer Science 2026-01-30 Robert Istvan Busa-Fekete , Julian Zimmert , Anne Xiangyi Zheng , Claudio Gentile , Andras Gyorgy

Explainable $k$-Means and $k$-Medians Clustering

Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a…

Machine Learning · Computer Science 2020-09-23 Sanjoy Dasgupta , Nave Frost , Michal Moshkovitz , Cyrus Rashtchian

Learning to Select Pivotal Samples for Meta Re-weighting

Sample re-weighting strategies provide a promising mechanism to deal with imperfect training data in machine learning, such as noisily labeled or class-imbalanced data. One such strategy involves formulating a bi-level optimization problem…

Machine Learning · Computer Science 2023-02-10 Yinjun Wu , Adam Stein , Jacob Gardner , Mayur Naik

Clustering by Attention: Leveraging Prior Fitted Transformers for Data Partitioning

Clustering is a core task in machine learning with wide-ranging applications in data mining and pattern recognition. However, its unsupervised nature makes it inherently challenging. Many existing clustering algorithms suffer from critical…

Machine Learning · Computer Science 2025-07-29 Ahmed Shokry , Ayman Khalafallah