English

Multi-Modal Transformers Utterance-Level Code-Switching Detection

Audio and Speech Processing 2020-11-05 v1

Abstract

An utterance that contains speech from multiple languages is known as a code-switched sentence. In this work, we propose a novel technique to predict whether given audio is mono-lingual or code-switched. We propose a multi-modal learning approach by utilising the phoneme information along with audio features for code-switch detection. Our model consists of a Phoneme Network that processes phoneme sequence and Audio Network(AN), which processes the mfcc features. We fuse representation learned from both the Networks to predict if the utterance is code-switched or not. The Audio Network and Phonetic Network consist of initial convolution, Bi-LSTM, and transformer encoder layers. The transformer encoder layer helps in selecting important and relevant features for better classification by using self-attention. We show that utilising the phoneme sequence of the utterance along with the mfcc features improves the performance of code-switch detection significantly. We train and evaluate our model on Microsoft code-switching challenge datasets for Telugu, Tamil, and Gujarati languages. Our experiments show that the multi-modal learning approach significantly improved accuracy over the uni-modal approaches for Telugu-English, Gujarati-English, and Tamil-English datasets. We also study the system performance using different neural layers and show that the transformers help obtain better performance.

Keywords

Cite

@article{arxiv.2011.02132,
  title  = {Multi-Modal Transformers Utterance-Level Code-Switching Detection},
  author = {Krishna D N},
  journal= {arXiv preprint arXiv:2011.02132},
  year   = {2020}
}

Comments

8 pages, 2 figures

R2 v1 2026-06-23T19:54:20.478Z