Automatic Textual Normalization for Hate Speech Detection

Anh Thi-Hoang Nguyen; Dung Ha Nguyen; Nguyet Thi Nguyen; Khanh Thanh-Duy Ho; Kiet Van Nguyen

doi:10.1007/978-3-031-64779-6_1

Automatic Textual Normalization for Hate Speech Detection

Computation and Language 2024-07-26 v4

Authors: Anh Thi-Hoang Nguyen , Dung Ha Nguyen , Nguyet Thi Nguyen , Khanh Thanh-Duy Ho , Kiet Van Nguyen

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.

Keywords

hate speech detection speech recognition shared task evaluation

Cite

@article{arxiv.2311.06851,
  title  = {Automatic Textual Normalization for Hate Speech Detection},
  author = {Anh Thi-Hoang Nguyen and Dung Ha Nguyen and Nguyet Thi Nguyen and Khanh Thanh-Duy Ho and Kiet Van Nguyen},
  journal= {arXiv preprint arXiv:2311.06851},
  year   = {2024}
}

Comments

2023 International Conference on Intelligent Systems Design and Applications (ISDA2023)

Automatic Textual Normalization for Hate Speech Detection

Abstract

Keywords

Cite

Comments

Related papers