Related papers: TBDFiltering: Sample-Efficient Tree-Based Data Fil…

Large Language Model-guided Document Selection

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts…

Computation and Language · Computer Science 2024-06-10 Xiang Kong , Tom Gunter , Ruoming Pang

Information-Theoretic Generative Clustering of Documents

We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs…

Machine Learning · Computer Science 2024-12-19 Xin Du , Kumiko Tanaka-Ishii

Text Clustering as Classification with LLMs

Text clustering serves as a fundamental technique for organizing and interpreting unstructured textual data, particularly in contexts where manual annotation is prohibitively costly. With the rapid advancement of Large Language Models…

Computation and Language · Computer Science 2025-10-08 Chen Huang , Guoxiu He

Experimental Estimation of Number of Clusters Based on Cluster Quality

Text Clustering is a text mining technique which divides the given set of text documents into significant clusters. It is used for organizing a huge number of text documents into a well-organized form. In the majority of the clustering…

Information Retrieval · Computer Science 2015-03-12 G. Hannah Grace , Kalyani Desikan

Model-Based Hierarchical Clustering

We present an approach to model-based hierarchical clustering by formulating an objective function based on a Bayesian analysis. This model organizes the data into a cluster hierarchy while specifying a complex feature-set partitioning that…

Machine Learning · Computer Science 2013-01-18 Shivakumar Vaithyanathan , Byron E Dom

Text Clustering with Large Language Model Embeddings

Text clustering is an important method for organising the increasing volume of digital content, aiding in the structuring and discovery of hidden patterns in uncategorised data. The effectiveness of text clustering largely depends on the…

Computation and Language · Computer Science 2024-12-06 Alina Petukhova , João P. Matos-Carvalho , Nuno Fachada

Automated Document Indexing via Intelligent Hierarchical Clustering: A Novel Approach

With the rising quantity of textual data available in electronic format, the need to organize it become a highly challenging task. In the present paper, we explore a document organization framework that exploits an intelligent hierarchical…

Information Retrieval · Computer Science 2015-04-02 Rajendra Kumar Roul , Shubham Rohan Asthana , Sanjay Kumar Sahay

A comparison of two suffix tree-based document clustering algorithms

Document clustering as an unsupervised approach extensively used to navigate, filter, summarize and manage large collection of document repositories like the World Wide Web (WWW). Recently, focuses in this domain shifted from traditional…

Information Retrieval · Computer Science 2012-01-11 Muhammad Rafi , M. Maujood , M. M. Fazal , S. M. Ali

Document Clustering with K-tree

This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document…

Information Retrieval · Computer Science 2010-01-07 Christopher M. De Vries , Shlomo Geva

Balanced Data Sampling for Language Model Training with Clustering

Data plays a fundamental role in the training of Large Language Models (LLMs). While attention has been paid to the collection and composition of datasets, determining the data sampling strategy in training remains an open question. Most…

Computation and Language · Computer Science 2024-06-04 Yunfan Shao , Linyang Li , Zhaoye Fei , Hang Yan , Dahua Lin , Xipeng Qiu

Optimized Algorithms for Text Clustering with LLM-Generated Constraints

Clustering is a fundamental tool that has garnered significant interest across a wide range of applications including text analysis. To improve clustering accuracy, many researchers have incorporated background knowledge, typically in the…

Machine Learning · Computer Science 2026-01-19 Chaoqi Jia , Weihong Wu , Longkun Guo , Zhigang Lu , Chao Chen , Kok-Leong Ong

Interpretable Structure-aware Document Encoders with Hierarchical Attention

We propose a method to create document representations that reflect their internal structure. We modify Tree-LSTMs to hierarchically merge basic elements such as words and sentences into blocks of increasing complexity. Our Structure…

Computation and Language · Computer Science 2019-10-08 Khalil Mrini , Claudiu Musat , Michael Baeriswyl , Martin Jaggi

From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning

Effective and controllable data selection is critical for LLM instruction tuning, especially with massive open-source datasets. Existing approaches primarily rely on instance-level quality scores, or diversity metrics based on embedding…

Computation and Language · Computer Science 2026-01-21 Zihan Niu , Wenping Hu , Junmin Chen , Xiyue Wang , Tong Xu , Ruiming Tang

LASER: Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy

Recent work shows that post-training datasets for LLMs can be substantially downsampled without noticeably deteriorating performance. However, data selection often incurs high computational costs or is limited to narrow domains. In this…

Computation and Language · Computer Science 2025-09-25 Paramita Mirza , Lucas Weber , Fabian Küch

Explainable $k$-Means and $k$-Medians Clustering

Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a…

Machine Learning · Computer Science 2020-09-23 Sanjoy Dasgupta , Nave Frost , Michal Moshkovitz , Cyrus Rashtchian

ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation

Text clustering is a fundamental task in natural language processing, yet traditional clustering algorithms with pre-trained embeddings often struggle in domain-specific contexts without costly fine-tuning. Large language models (LLMs)…

Computation and Language · Computer Science 2025-12-05 Yiming Xu , Yuan Yuan , Vijay Viswanathan , Graham Neubig

Document clustering with evolved multiword search queries

Text clustering holds significant value across various domains due to its ability to identify patterns and group related information. Current approaches which rely heavily on a computed similarity measure between documents are often limited…

Information Retrieval · Computer Science 2025-04-09 Laurence Hirsch , Robin Hirsch , Bayode Ogunleye

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and…

Machine Learning · Computer Science 2024-02-28 Kyriakos Axiotis , Vincent Cohen-Addad , Monika Henzinger , Sammy Jerome , Vahab Mirrokni , David Saulpic , David Woodruff , Michael Wunder

Tree Index: A New Cluster Evaluation Technique

We introduce a cluster evaluation technique called Tree Index. Our Tree Index algorithm aims at describing the structural information of the clustering rather than the quantitative format of cluster-quality indexes (where the representation…

Machine Learning · Computer Science 2020-03-25 A. H. Beg , Md Zahidul Islam , Vladimir Estivill-Castro

An Analytical Approach to Document Clustering Based on Internal Criterion Function

Fast and high quality document clustering is an important task in organizing information, search engine results obtaining from user query, enhancing web crawling and information retrieval. With the large amount of data available and with a…

Information Retrieval · Computer Science 2010-03-11 Alok Ranjan , Harish Verma , Eatesh Kandpal , Joydip Dhar