Related papers: Text Mining Through Label Induction Grouping Algor…

LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels

In this paper, we propose an intuitive, training-free and label-free method for intent clustering in conversational search. Current approaches to short text clustering use LLM-generated pseudo-labels to enrich text representations or to…

Computation and Language · Computer Science 2026-02-26 I-Fan Lin , Faegheh Hasibi , Suzan Verberne

Text Clustering as Classification with LLMs

Text clustering serves as a fundamental technique for organizing and interpreting unstructured textual data, particularly in contexts where manual annotation is prohibitively costly. With the rapid advancement of Large Language Models…

Computation and Language · Computer Science 2025-10-08 Chen Huang , Guoxiu He

Tango: Taming Visual Signals for Efficient Video Large Language Models

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Shukang Yin , Sirui Zhao , Hanchao Wang , Baozhi Jia , Xianquan Wang , Chaoyou Fu , Enhong Chen

Optimized Algorithms for Text Clustering with LLM-Generated Constraints

Clustering is a fundamental tool that has garnered significant interest across a wide range of applications including text analysis. To improve clustering accuracy, many researchers have incorporated background knowledge, typically in the…

Machine Learning · Computer Science 2026-01-19 Chaoqi Jia , Weihong Wu , Longkun Guo , Zhigang Lu , Chao Chen , Kok-Leong Ong

Vocabulary-Defined Semantics: Latent Space Clustering for Improving In-Context Learning

In-context learning enables language models (LM) to adapt to downstream data or tasks by incorporating few samples as demonstrations within the prompts. It offers strong performance without the expense of fine-tuning. However, the…

Computation and Language · Computer Science 2024-10-15 Jian Gu , Aldeida Aleti , Chunyang Chen , Hongyu Zhang

Indexing by Latent Dirichlet Allocation and Ensemble Model

The contribution of this paper is two-fold. First, we present Indexing by Latent Dirichlet Allocation (LDI), an automatic document indexing method. The probability distributions in LDI utilize those in Latent Dirichlet Allocation (LDA), a…

Information Retrieval · Computer Science 2014-12-12 Yanshan Wang , Jae-Sung Lee , In-Chan Choi

Text Mining using Nonnegative Matrix Factorization and Latent Semantic Analysis

Text clustering is arguably one of the most important topics in modern data mining. Nevertheless, text data require tokenization which usually yields a very large and highly sparse term-document matrix, which is usually difficult to process…

Machine Learning · Computer Science 2020-02-25 Ali Hassani , Amir Iranmanesh , Najme Mansouri

Text Clustering with Large Language Model Embeddings

Text clustering is an important method for organising the increasing volume of digital content, aiding in the structuring and discovery of hidden patterns in uncategorised data. The effectiveness of text clustering largely depends on the…

Computation and Language · Computer Science 2024-12-06 Alina Petukhova , João P. Matos-Carvalho , Nuno Fachada

DINGO: Constrained Inference for Diffusion LLMs

Diffusion LLMs have emerged as a promising alternative to conventional autoregressive LLMs, offering significant potential for improved runtime efficiency. However, existing diffusion models lack the ability to provably enforce…

Machine Learning · Computer Science 2025-05-30 Tarun Suresh , Debangshu Banerjee , Shubham Ugare , Sasa Misailovic , Gagandeep Singh

Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs

This paper proposes a Clustering, Labeling, then Augmenting framework that significantly enhances performance in Semi-Supervised Text Classification (SSTC) tasks, effectively addressing the challenge of vast datasets with limited labeled…

Computation and Language · Computer Science 2024-12-30 Shan Zhong , Jiahao Zeng , Yongxin Yu , Bohong Lin

Information retrieval for label noise document ranking by bag sampling and group-wise loss

Long Document retrieval (DR) has always been a tremendous challenge for reading comprehension and information retrieval. The pre-training model has achieved good results in the retrieval stage and Ranking for long documents in recent years.…

Information Theory · Computer Science 2022-03-15 Chunyu Li , Jiajia Ding , Xing hu , Fan Wang

Experimental Estimation of Number of Clusters Based on Cluster Quality

Text Clustering is a text mining technique which divides the given set of text documents into significant clusters. It is used for organizing a huge number of text documents into a well-organized form. In the majority of the clustering…

Information Retrieval · Computer Science 2015-03-12 G. Hannah Grace , Kalyani Desikan

Towards Easier and Faster Sequence Labeling for Natural Language Processing: A Search-based Probabilistic Online Learning Framework (SAPO)

There are two major approaches for sequence labeling. One is the probabilistic gradient-based methods such as conditional random fields (CRF) and neural networks (e.g., RNN), which have high accuracy but drawbacks: slow training, and no…

Machine Learning · Computer Science 2018-11-20 Xu Sun , Shuming Ma , Yi Zhang , Xuancheng Ren

Human-interpretable clustering of short-text using large language models

Clustering short text is a difficult problem, due to the low word co-occurrence between short text documents. This work shows that large language models (LLMs) can overcome the limitations of traditional clustering approaches by generating…

Computation and Language · Computer Science 2025-04-08 Justin K. Miller , Tristram J. Alexander

Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics

Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix…

Databases · Computer Science 2025-08-27 Jungwoo Kim , Minsang Kim , Jaeheon Lee , Chanwoo Moon , Heejin Kim , Taeho Hwang , Woosuk Chung , Yeseong Kim , Sungjin Lee

A Learned Index for Exact Similarity Search in Metric Spaces

Indexing is an effective way to support efficient query processing in large databases. Recently the concept of learned index, which replaces or complements traditional index structures with machine learning models, has been actively…

Databases · Computer Science 2022-08-01 Yao Tian , Tingyun Yan , Xi Zhao , Kai Huang , Xiaofang Zhou

Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLMs

Scaling test-time computation--generating and analyzing multiple or sequential outputs for a single input--has become a promising strategy for improving the reliability and quality of large language models (LLMs), as evidenced by advances…

Computation and Language · Computer Science 2025-06-03 Sungjae Lee , Hoyoung Kim , Jeongyeon Hwang , Eunhyeok Park , Jungseul Ok

Information-Theoretic Generative Clustering of Documents

We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs…

Machine Learning · Computer Science 2024-12-19 Xin Du , Kumiko Tanaka-Ishii

Enhancing Lexicon-Based Text Embeddings with Large Language Models

Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first lexicon-based embeddings (LENS) leveraging…

Computation and Language · Computer Science 2026-03-20 Yibin Lei , Tao Shen , Yu Cao , Andrew Yates

Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification

We introduce a method for efficient multi-label text classification with large language models (LLMs), built on reformulating classification tasks as sequences of dichotomic (yes/no) decisions. Instead of generating all labels in a single…

Computation and Language · Computer Science 2025-11-07 Mikołaj Langner , Jan Eliasz , Ewa Rudnicka , Jan Kocoń