Related papers: Romanized to Native Malayalam Script Transliterati…

Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration

The development of robust transliteration techniques to enhance the effectiveness of transforming Romanized scripts into native scripts is crucial for Natural Language Processing tasks, including sentiment analysis, speech recognition,…

Computation and Language · Computer Science 2025-12-01 Kanchon Gharami , Quazi Sarwar Muhtaseem , Deepti Gupta , Lavanya Elluri , Shafika Showkat Moni

A two-stage transliteration approach to improve performance of a multilingual ASR

End-to-end Automatic Speech Recognition (ASR) systems are rapidly claiming to become state-of-art over other modeling methods. Several techniques have been introduced to improve their ability to handle multiple languages. However, due to…

Computation and Language · Computer Science 2024-10-22 Rohit Kumar

IndoNLP 2025: Shared Task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages

The paper overviews the shared task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages. It focuses on the reverse transliteration of low-resourced languages in the Indo-Aryan family to their native scripts. Typing…

Computation and Language · Computer Science 2025-02-25 Deshan Sumanathilaka , Isuri Anuradha , Ruvan Weerasinghe , Nicholas Micallef , Julian Hough

A Dual-Decoder Conformer for Multilingual Speech Recognition

Transformer-based models have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. This work proposes a dual-decoder transformer model for low-resource multilingual speech…

Computation and Language · Computer Science 2021-09-09 Krishna D N

Improving Speech Recognition for Indic Languages using Language Model

We study the effect of applying a language model (LM) on the output of Automatic Speech Recognition (ASR) systems for Indic languages. We fine-tune wav2vec $2.0$ models for $18$ Indic languages and adjust the results with language models…

Computation and Language · Computer Science 2022-06-16 Ankur Dhuriya , Harveen Singh Chadha , Anirudh Gupta , Priyanshi Shah , Neeraj Chhimwal , Rishabh Gaur , Vivek Raghavan

Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches

Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their…

Computation and Language · Computer Science 2025-03-05 Yomal De Mel , Kasun Wickramasinghe , Nisansa de Silva , Surangika Ranathunga

MATra: A Multilingual Attentive Transliteration System for Indian Scripts

Transliteration is a task in the domain of NLP where the output word is a similar-sounding word written using the letters of any foreign language. Today this system has been developed for several language pairs that involve English as…

Computation and Language · Computer Science 2022-08-24 Yash Raj , Bhavesh Laddagiri

Romanization Encoding For Multilingual ASR

We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a…

Computation and Language · Computer Science 2024-12-18 Wen Ding , Fei Jia , Hainan Xu , Yu Xi , Junjie Lai , Boris Ginsburg

Evaluating Cross-lingual Knowledge Consistency in Code-Mixed vis-a-vis Indian Languages using IndicKLAR

Large language models recall knowledge reliably in English but often fail on the same query posed in a lower-resourced language -- a crosslingual consistency gap that remains underexplored for Indian languages and their code-mixed…

Computation and Language · Computer Science 2026-05-29 Debajyoti Mazumder , Divyansh Pathak , Prashant Kodali , Aditya Joshi , Akshay Agarwal , Jasabanta Patro

One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models

Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization…

Computation and Language · Computer Science 2026-01-12 Benedikt Ebing , Lennart Keller , Goran Glavaš

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit. Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman. Currently, there exists no dataset…

Computation and Language · Computer Science 2018-09-10 Amrith Krishna , Bodhisattwa Prasad Majumder , Rajesh Shreedhar Bhat , Pawan Goyal

Handwriting Trajectory Recovery using End-to-End Deep Encoder-Decoder Network

In this paper, we introduce a novel technique to recover the pen trajectory of offline characters which is a crucial step for handwritten character recognition. Generally, online acquisition approach has more advantage than its offline…

Computer Vision and Pattern Recognition · Computer Science 2018-06-05 Ayan Kumar Bhunia , Abir Bhowmick , Ankan Kumar Bhunia , Aishik Konwer , Prithaj Banerjee , Partha Pratim Roy , Umapada Pal

Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Romanized Scripts in a Real World Setting

Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. Speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely…

Computation and Language · Computer Science 2026-04-01 Manurag Khullar , Utkarsh Desai , Poorva Malviya , Aman Dalmia , Zheyuan Ryan Shi

Spectral Analysis of Projection Histogram for Enhancing Close matching character Recognition in Malayalam

The success rates of Optical Character Recognition (OCR) systems for printed Malayalam documents is quite impressive with the state of the art accuracy levels in the range of 85-95% for various. However for real applications, further…

Computation and Language · Computer Science 2012-05-09 Sajilal Divakaran

Neural Machine Transliteration: Preliminary Results

Machine transliteration is the process of automatically transforming the script of a word from a source language to a target language, while preserving pronunciation. Sequence to sequence learning has recently emerged as a new paradigm in…

Computation and Language · Computer Science 2016-09-15 Amir H. Jadidinejad

Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma…

Computation and Language · Computer Science 2025-11-27 Adity Khisa , Nusrat Jahan Lia , Tasnim Mahfuz Nafis , Zarif Masud , Tanzir Pial , Shebuti Rayana , Ahmedul Kabir

Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari

In medieval India, the Marathi language was written using the Modi script. The texts written in Modi script include extensive knowledge about medieval sciences, medicines, land records and authentic evidence about Indian history. Around 40…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Harshal Kausadikar , Tanvi Kale , Onkar Susladkar , Sparsh Mittal

DuDe: Dual-Decoder Multilingual ASR for Indian Languages using Common Label Set

In a multilingual country like India, multilingual Automatic Speech Recognition (ASR) systems have much scope. Multilingual ASR systems exhibit many advantages like scalability, maintainability, and improved performance over the monolingual…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-01 Arunkumar A , Mudit Batra , Umesh S

Word level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script

India is a multi-lingual country where Roman script is often used alongside different Indic scripts in a text document. To develop a script specific handwritten Optical Character Recognition (OCR) system, it is therefore necessary to…

Machine Learning · Computer Science 2010-03-25 Ram Sarkar , Nibaran Das , Subhadip Basu , Mahantapas Kundu , Mita Nasipuri , Dipak Kumar Basu

Multilingual Speech Recognition for Low-Resource Indian Languages using Multi-Task conformer

Transformers have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. In this work, we propose a multi-task learning-based transformer model for low-resource multilingual…

Computation and Language · Computer Science 2021-09-13 Krishna D N