English

A Deep Generative Model for Code-Switched Text

Computation and Language 2019-06-24 v1

Abstract

Code-switching, the interleaving of two or more languages within a sentence or discourse is pervasive in multilingual societies. Accurate language models for code-switched text are critical for NLP tasks. State-of-the-art data-intensive neural language models are difficult to train well from scarce language-labeled code-switched text. A potential solution is to use deep generative models to synthesize large volumes of realistic code-switched text. Although generative adversarial networks and variational autoencoders can synthesize plausible monolingual text from continuous latent space, they cannot adequately address code-switched text, owing to their informal style and complex interplay between the constituent languages. We introduce VACS, a novel variational autoencoder architecture specifically tailored to code-switching phenomena. VACS encodes to and decodes from a two-level hierarchical representation, which models syntactic contextual signals in the lower level, and language switching signals in the upper layer. Sampling representations from the prior and decoding them produced well-formed, diverse code-switched sentences. Extensive experiments show that using synthetic code-switched text with natural monolingual data results in significant (33.06%) drop in perplexity.

Keywords

Cite

@article{arxiv.1906.08972,
  title  = {A Deep Generative Model for Code-Switched Text},
  author = {Bidisha Samanta and Sharmila Reddy and Hussain Jagirdar and Niloy Ganguly and Soumen Chakrabarti},
  journal= {arXiv preprint arXiv:1906.08972},
  year   = {2019}
}
R2 v1 2026-06-23T09:59:39.223Z