Related papers: Exploiting Dialect Identification in Automatic Dia…

Automatic Arabic Dialect Identification Systems for Written Texts: A Survey

Arabic dialect identification is a specific task of natural language processing, aiming to automatically predict the Arabic dialect of a given text. Arabic dialect identification is the first step in various natural language processing…

Computation and Language · Computer Science 2020-09-29 Maha J. Althobaiti

MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script

Social media user-generated text is actually the main resource for many NLP tasks. This text however, does not follow the standard rules of writing. Moreover, the use of dialect such as Moroccan Arabic in written communications increases…

Computation and Language · Computer Science 2022-06-22 Randa Zarnoufi , Walid Bachri , Hamid Jaafar , Mounia Abik

Multi-Dialect Arabic Speech Recognition

This paper presents the design and development of multi-dialect automatic speech recognition for Arabic. Deep neural networks are becoming an effective tool to solve sequential data problems, particularly, adopting an end-to-end training of…

Audio and Speech Processing · Electrical Eng. & Systems 2021-12-30 Abbas Raza Ali

Dotless Representation of Arabic Text: Analysis and Modeling

This paper presents a novel dotless representation of Arabic text as an alternative to the standard Arabic text representation. We delve into its implications through comprehensive analysis across five diverse corpora and four different…

Computation and Language · Computer Science 2023-12-27 Maged S. Al-Shaibani , Irfan Ahmad

ALDi: Quantifying the Arabic Level of Dialectness of Text

Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this…

Computation and Language · Computer Science 2023-10-24 Amr Keleg , Sharon Goldwater , Walid Magdy

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

Natural Language Processing (NLP) is today a very active field of research and innovation. Many applications need however big sets of data for supervised learning, suitably labelled for the training purpose. This includes applications for…

Computation and Language · Computer Science 2021-02-23 ElMehdi Boujou , Hamza Chataoui , Abdellah El Mekki , Saad Benjelloun , Ikram Chairi , Ismail Berrada

Automatic Dialect Detection in Arabic Broadcast Speech

We investigate different approaches for dialect identification in Arabic broadcast speech, using phonetic, lexical features obtained from a speech recognition system, and acoustic features using the i-vector framework. We studied both…

Computation and Language · Computer Science 2016-08-12 Ahmed Ali , Najim Dehak , Patrick Cardinal , Sameer Khurana , Sree Harsha Yella , James Glass , Peter Bell , Steve Renals

Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

The widespread absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP). This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild,"…

Computation and Language · Computer Science 2024-06-11 Salman Elgamal , Ossama Obeid , Tameem Kabbani , Go Inoue , Nizar Habash

Computational Approaches to Arabic-English Code-Switching

Natural Language Processing (NLP) is a vital computational method for addressing language processing, analysis, and generation. NLP tasks form the core of many daily applications, from automatic text correction to speech recognition. While…

Computation and Language · Computer Science 2024-10-18 Caroline Sabty

Beyond Orthography: Automatic Recovery of Short Vowels and Dialectal Sounds in Arabic

This paper presents a novel Dialectal Sound and Vowelization Recovery framework, designed to recognize borrowed and dialectal sounds within phonologically diverse and dialect-rich languages, that extends beyond its standard orthographic…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-06 Yassine El Kheir , Hamdy Mubarak , Ahmed Ali , Shammur Absar Chowdhury

Automatic Standardization of Arabic Dialects for Machine Translation

Based on an annotated multimedia corpus, television series Mar{\=a}y{\=a} 2013, we dig into the question of ''automatic standardization'' of Arabic dialects for machine translation. Here we distinguish between rule-based machine translation…

Computation and Language · Computer Science 2023-01-10 Abidrabbo Alnassan

Sadeed: Advancing Arabic Diacritization Through Small Language Model

Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model…

Computation and Language · Computer Science 2025-08-22 Zeina Aldallal , Sara Chrouf , Khalil Hennara , Mohamed Motaism Hamed , Muhammad Hreden , Safwan AlModhayan

Supporting Undotted Arabic with Pre-trained Language Models

We observe a recent behaviour on social media, in which users intentionally remove consonantal dots from Arabic letters, in order to bypass content-classification algorithms. Content classification is typically done by fine-tuning…

Computation and Language · Computer Science 2021-11-19 Aviad Rom , Kfir Bar

Diacritization of Maghrebi Arabic Sub-Dialects

Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA)…

Computation and Language · Computer Science 2019-06-03 Ahmed Abdelali , Mohammed Attia , Younes Samih , Kareem Darwish , Hamdy Mubarak

Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

Automatic Arabic Dialect Identification (ADI) of text has gained great popularity since it was introduced in the early 2010s. Multiple datasets were developed, and yearly shared tasks have been running since 2018. However, ADI systems are…

Computation and Language · Computer Science 2023-10-23 Amr Keleg , Walid Magdy

Evaluating Various Tokenizers for Arabic Text Classification

The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in…

Computation and Language · Computer Science 2021-09-30 Zaid Alyafeai , Maged S. Al-shaibani , Mustafa Ghaleb , Irfan Ahmad

A Multitask Learning Approach for Diacritic Restoration

In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings. Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word. This results in a…

Computation and Language · Computer Science 2020-06-09 Sawsan Alqahtani , Ajay Mishra , Mona Diab

Arab Voices: Mapping Standard and Dialectal Arabic Speech Technology

Dialectal Arabic (DA) speech data vary widely in domain coverage, dialect labeling practices, and recording conditions, complicating cross-dataset comparison and model evaluation. To characterize this landscape, we conduct a computational…

Computation and Language · Computer Science 2026-01-30 Peter Sullivan , AbdelRahim Elmadany , Alcides Alcoba Inciarte , Muhammad Abdul-Mageed

A Survey of Code-switched Arabic NLP: Progress, Challenges, and Future Directions

Language in the Arab world presents a complex diglossic and multilingual setting, involving the use of Modern Standard Arabic, various dialects and sub-dialects, as well as multiple European languages. This diverse linguistic landscape has…

Computation and Language · Computer Science 2025-01-24 Injy Hamed , Caroline Sabty , Slim Abdennadher , Ngoc Thang Vu , Thamar Solorio , Nizar Habash

Arabic Text Diacritization Using Deep Neural Networks

Diacritization of Arabic text is both an interesting and a challenging problem at the same time with various applications ranging from speech synthesis to helping students learning the Arabic language. Like many other tasks or problems in…

Computation and Language · Computer Science 2019-05-07 Ali Fadel , Ibraheem Tuffaha , Bara' Al-Jawarneh , Mahmoud Al-Ayyoub