Multilingual Topic Models for Unaligned Text

Jordan Boyd-Graber; David Blei

Multilingual Topic Models for Unaligned Text

Computation and Language 2012-05-14 v1 Information Retrieval Machine Learning Machine Learning

Authors: Jordan Boyd-Graber , David Blei

Abstract

We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual latent topics. We demonstrate that MuTo is able to find shared topics on real-world multilingual corpora, successfully pairing related documents across languages. MuTo provides a new framework for creating multilingual topic models without needing carefully curated parallel corpora and allows applications built using the topic model formalism to be applied to a much wider class of corpora.

Keywords

topic modeling corpus machine translation

Cite

@article{arxiv.1205.2657,
  title  = {Multilingual Topic Models for Unaligned Text},
  author = {Jordan Boyd-Graber and David Blei},
  journal= {arXiv preprint arXiv:1205.2657},
  year   = {2012}
}

Comments

Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

Multilingual Topic Models for Unaligned Text

Abstract

Keywords

Cite

Comments

Related papers