Related papers: Multilingual Topic Models for Unaligned Text

Legal document retrieval across languages: topic hierarchies based on synsets

Cross-lingual annotations of legislative texts enable us to explore major themes covered in multilingual legal data and are a key facilitator of semantic similarity when searching for similar documents. Multilingual probabilistic topic…

Information Retrieval · Computer Science 2019-12-02 Carlos Badenes-Olmedo , Jose-Luis Redondo-Garcia , Oscar Corcho

Bilingual Topic Models for Comparable Corpora

Probabilistic topic models like Latent Dirichlet Allocation (LDA) have been previously extended to the bilingual setting. A fundamental modeling assumption in several of these extensions is that the input corpora are in the form of document…

Computation and Language · Computer Science 2021-12-01 Georgios Balikas , Massih-Reza Amini , Marianne Clausel

Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence

An important aspect of text mining involves information retrieval in form of discovery of semantic themes (topics) from documents using topic modelling. While generative topic models like Latent Dirichlet Allocation (LDA) or Latent Semantic…

Machine Learning · Computer Science 2025-11-04 Satyajeet Sahoo , Jhareswar Maiti

Learning Multilingual Topics from Incomparable Corpus

Multilingual topic models enable crosslingual tasks by extracting consistent topics from multilingual corpora. Most models require parallel or comparable training corpora, which limits their ability to generalize. In this paper, we first…

Computation and Language · Computer Science 2018-06-13 Shudong Hao , Michael J. Paul

Multi-environment Topic Models

Probabilistic topic models are a powerful tool for extracting latent themes from large text datasets. In many text datasets, we also observe per-document covariates (e.g., source, style, political affiliation) that act as environments that…

Computation and Language · Computer Science 2024-11-04 Dominic Sobhani , Amir Feder , David Blei

Multilingual and Multimodal Topic Modelling with Pretrained Embeddings

This paper presents M3L-Contrast -- a novel multimodal multilingual (M3L) neural topic model for comparable data that maps texts from multiple languages and images into a shared topic space. Our model is trained jointly on texts and images…

Computation and Language · Computer Science 2022-11-16 Elaine Zosa , Lidia Pivovarova

A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery

In this paper, we present a Bayesian multilingual document model for learning language-independent document embeddings. The model is an extension of BaySMM [Kesiraju et al 2020] to the multilingual scenario. It learns to represent the…

Computation and Language · Computer Science 2024-03-26 Santosh Kesiraju , Sangeet Sagar , Ondřej Glembek , Lukáš Burget , Ján Černocký , Suryakanth V Gangashetty

TMT: A Simple Way to Translate Topic Models Using Dictionaries

The training of topic models for a multilingual environment is a challenging task, requiring the use of sophisticated algorithms, topic-aligned corpora, and manual evaluation. These difficulties are further exacerbated when the developer…

Computation and Language · Computer Science 2025-09-03 Felix Engl , Andreas Henrich

Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in…

Computation and Language · Computer Science 2022-02-10 Yu Meng , Yunyi Zhang , Jiaxin Huang , Yu Zhang , Jiawei Han

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models…

Computation and Language · Computer Science 2021-01-11 Carlos Badenes-Olmedo , Jose-Luis Redondo García , Oscar Corcho

A network approach to topic models

One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a…

Machine Learning · Statistics 2018-07-20 Martin Gerlach , Tiago P. Peixoto , Eduardo G. Altmann

Topic Modeling in Embedding Spaces

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the Embedded Topic…

Information Retrieval · Computer Science 2019-07-12 Adji B. Dieng , Francisco J. R. Ruiz , David M. Blei

An Empirical Study on Crosslingual Transfer in Probabilistic Topic Models

Probabilistic topic modeling is a popular choice as the first step of crosslingual tasks to enable knowledge transfer and extract multilingual features. While many multilingual topic models have been developed, their assumptions on the…

Computation and Language · Computer Science 2019-06-11 Shudong Hao , Michael J. Paul

Nonparametric Relational Topic Models through Dependent Gamma Processes

Traditional Relational Topic Models provide a way to discover the hidden topics from a document network. Many theoretical and practical tasks, such as dimensional reduction, document clustering, link prediction, benefit from this revealed…

Machine Learning · Statistics 2015-03-31 Junyu Xuan , Jie Lu , Guangquan Zhang , Richard Yi Da Xu , Xiangfeng Luo

Cross-lingual Contextualized Topic Models with Zero-shot Learning

Many data sets (e.g., reviews, forums, news, etc.) exist parallelly in multiple languages. They all cover the same content, but the linguistic differences make it impossible to use traditional, bag-of-word-based topic models. Models have to…

Computation and Language · Computer Science 2021-02-05 Federico Bianchi , Silvia Terragni , Dirk Hovy , Debora Nozza , Elisabetta Fersini

A Probabilistic Model of Machine Translation

A probabilistic model for computer-based generation of a machine translation system on the basis of English-Russian parallel text corpora is suggested. The model is trained using parallel text corpora with pre-aligned source and target…

Computation and Language · Computer Science 2007-05-23 G. E. Miram , V. K. Petrov

Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling

Topic modelling, as a well-established unsupervised technique, has found extensive use in automatically detecting significant topics within a corpus of documents. However, classic topic modelling approaches (e.g., LDA) have certain…

Computation and Language · Computer Science 2024-03-27 Yida Mu , Chun Dong , Kalina Bontcheva , Xingyi Song

Text Modeling using Unsupervised Topic Models and Concept Hierarchies

Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. While topic models can potentially discover a broad range of themes in a data set,…

Artificial Intelligence · Computer Science 2008-08-08 Chaitanya Chemudugunta , Padhraic Smyth , Mark Steyvers

A new evaluation framework for topic modeling algorithms based on synthetic corpora

Topic models are in widespread use in natural language processing and beyond. Here, we propose a new framework for the evaluation of probabilistic topic modeling algorithms based on synthetic corpora containing an unambiguously defined…

Computation and Language · Computer Science 2019-01-29 Hanyu Shi , Martin Gerlach , Isabel Diersen , Doug Downey , Luis A. N. Amaral

Discovering topics with neural topic models built from PLSA assumptions

In this paper we present a model for unsupervised topic discovery in texts corpora. The proposed model uses documents, words, and topics lookup table embedding as neural network model parameters to build probabilities of words given topics,…

Computation and Language · Computer Science 2019-11-26 Sileye 0. Ba