Related papers: Exclusive Topic Modeling

Concentrated Document Topic Model

We propose a Concentrated Document Topic Model(CDTM) for unsupervised text classification, which is able to produce a concentrated and sparse document topic distribution. In particular, an exponential entropy penalty is imposed on the…

Machine Learning · Statistics 2021-02-10 Hao Lei , Ying Chen

Topic Modeling in Embedding Spaces

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the Embedded Topic…

Information Retrieval · Computer Science 2019-07-12 Adji B. Dieng , Francisco J. R. Ruiz , David M. Blei

Topic Modeling over Short Texts by Incorporating Word Embeddings

Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks, such as content charactering, user interest profiling, and emerging topic detecting. Existing methods such…

Computation and Language · Computer Science 2016-09-28 Jipeng Qiang , Ping Chen , Tong Wang , Xindong Wu

Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural…

Computation and Language · Computer Science 2025-08-01 Carolina Zheng , Nicolas Beltran-Velez , Sweta Karlekar , Claudia Shi , Achille Nazaret , Asif Mallik , Amir Feder , David M. Blei

Keyword Assisted Embedded Topic Model

By illuminating latent structures in a corpus of text, topic models are an essential tool for categorizing, summarizing, and exploring large collections of documents. Probabilistic topic models, such as latent Dirichlet allocation (LDA),…

Information Retrieval · Computer Science 2021-12-07 Bahareh Harandizadeh , J. Hunter Priniski , Fred Morstatter

LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization

We introduce LLM-Lasso, a novel framework that leverages large language models (LLMs) to guide feature selection in Lasso $\ell_1$ regression. Unlike traditional methods that rely solely on numerical data, LLM-Lasso incorporates…

Machine Learning · Computer Science 2025-08-13 Erica Zhang , Ryunosuke Goto , Naomi Sagan , Jurik Mutter , Nick Phillips , Ash Alizadeh , Kangwook Lee , Jose Blanchet , Mert Pilanci , Robert Tibshirani

Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs

Topic modeling is a powerful technique for uncovering hidden themes within a collection of documents. However, the effectiveness of traditional topic models often relies on sufficient word co-occurrence, which is lacking in short texts.…

Computation and Language · Computer Science 2024-10-22 Pritom Saha Akash , Kevin Chen-Chuan Chang

A Joint Learning Approach for Semi-supervised Neural Topic Modeling

Topic models are some of the most popular ways to represent textual data in an interpret-able manner. Recently, advances in deep generative models, specifically auto-encoding variational Bayes (AEVB), have led to the introduction of…

Information Retrieval · Computer Science 2022-04-08 Jeffrey Chiu , Rajat Mittal , Neehal Tumma , Abhishek Sharma , Finale Doshi-Velez

Supervised topic models for clinical interpretability

Supervised topic models can help clinical researchers find interpretable cooccurence patterns in count data that are relevant for diagnostics. However, standard formulations of supervised Latent Dirichlet Allocation have two problems.…

Machine Learning · Statistics 2016-12-07 Michael C. Hughes , Huseyin Melih Elibol , Thomas McCoy , Roy Perlis , Finale Doshi-Velez

Topics in the Haystack: Extracting and Evaluating Topics beyond Coherence

Extracting and identifying latent topics in large text corpora has gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic…

Computation and Language · Computer Science 2023-03-31 Anton Thielmann , Quentin Seifert , Arik Reuter , Elisabeth Bergherr , Benjamin Säfken

Text classification based on ensemble extreme learning machine

In this paper, we propose a novel approach based on cost-sensitive ensemble weighted extreme learning machine; we call this approach AE1-WELM. We apply this approach to text classification. AE1-WELM is an algorithm including balanced and…

Information Retrieval · Computer Science 2018-05-18 Ming Li , Peilun Xiao , Ju Zhang

LTSG: Latent Topical Skip-Gram for Mutually Learning Topic Model and Vector Representations

Topic models have been widely used in discovering latent topics which are shared across documents in text mining. Vector representations, word embeddings and topic embeddings, map words and topics into a low-dimensional and dense real-value…

Computation and Language · Computer Science 2017-02-24 Jarvan Law , Hankz Hankui Zhuo , Junhua He , Erhu Rong

Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence

An important aspect of text mining involves information retrieval in form of discovery of semantic themes (topics) from documents using topic modelling. While generative topic models like Latent Dirichlet Allocation (LDA) or Latent Semantic…

Machine Learning · Computer Science 2025-11-04 Satyajeet Sahoo , Jhareswar Maiti

Stochastic Divergence Minimization for Biterm Topic Model

As the emergence and the thriving development of social networks, a huge number of short texts are accumulated and need to be processed. Inferring latent topics of collected short texts is useful for understanding its hidden structure and…

Machine Learning · Statistics 2018-04-04 Zhenghang Cui , Issei Sato , Masashi Sugiyama

Topic Modeling based on Keywords and Context

Current topic models often suffer from discovering topics not matching human intuition, unnatural switching of topics within documents and high computational demands. We address these concerns by proposing a topic model and an inference…

Computation and Language · Computer Science 2018-02-06 Johannes Schneider

Learning Topic Models: Identifiability and Finite-Sample Analysis

Topic models provide a useful text-mining tool for learning, extracting, and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, lacking in the literature is a formal…

Machine Learning · Statistics 2022-08-12 Yinyin Chen , Shishuang He , Yun Yang , Feng Liang

Poisson-Process Topic Model for Integrating Knowledge from Pre-trained Language Models

Topic modeling is traditionally applied to word counts without accounting for the context in which words appear. Recent advancements in large language models (LLMs) offer contextualized word embeddings, which capture deeper meaning and…

Machine Learning · Statistics 2025-12-30 Morgane Austern , Yuanchuan Guo , Zheng Tracy Ke , Tianle Liu

Time-to-event prediction for grouped variables using Exclusive Lasso

The integration of high-dimensional genomic data and clinical data into time-to-event prediction models has gained significant attention due to the growing availability of these datasets. Traditionally, a Cox regression model is employed,…

Methodology · Statistics 2025-04-03 Dayasri Ravi , Andreas Groll

Interactive Topic Models with Optimal Transport

Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content…

Computation and Language · Computer Science 2024-07-01 Garima Dhanania , Sheshera Mysore , Chau Minh Pham , Mohit Iyyer , Hamed Zamani , Andrew McCallum

Prior matters: simple and general methods for evaluating and improving topic quality in topic modeling

Latent Dirichlet Allocation (LDA) models trained without stopword removal often produce topics with high posterior probabilities on uninformative words, obscuring the underlying corpus content. Even when canonical stopwords are manually…

Computation and Language · Computer Science 2017-10-17 Angela Fan , Finale Doshi-Velez , Luke Miratrix