Related papers: Script-Agnostic Language Identification

Optical Script Identification for multi-lingual Indic-script

Script identification and text recognition are some of the major domains in the application of Artificial Intelligence. In this era of digitalization, the use of digital note-taking has become a common practice. Still, conventional methods…

Artificial Intelligence · Computer Science 2023-08-14 Sidhantha Poddar , Rohan Gupta

Discrimination of English to other Indian languages (Kannada and Hindi) for OCR system

India is a multilingual multi-script country. In every state of India there are two languages one is state local language and the other is English. For example in Andhra Pradesh, a state in India, the document may contain text words in…

Computer Vision and Pattern Recognition · Computer Science 2012-05-11 Ankit Kumar , Tushar Patnaik , Vivek Kr Verma

Automatic Script Identification in the Wild

With the rapid increase of transnational communication and cooperation, people frequently encounter multilingual scenarios in various situations. In this paper, we are concerned with a relatively new problem: script identification at word…

Computer Vision and Pattern Recognition · Computer Science 2015-05-13 Baoguang Shi , Cong Yao , Chengquan Zhang , Xiaowei Guo , Feiyue Huang , Xiang Bai

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Script identification plays a vital role in applications that involve handwriting and document analysis within a multi-script and multi-lingual environment. Moreover, it exhibits a profound connection with human cognition. This paper…

Computer Vision and Pattern Recognition · Computer Science 2024-05-30 Miguel A. Ferrer , Abhijit Das , Moises Diaz , Aythami Morales , Cristina Carmona-Duarte , Umapada Pal

ILID: Native Script Language Identification for Indian Languages

The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and…

Computation and Language · Computer Science 2026-01-08 Yash Ingle , Pruthwik Mishra

Cross-language Framework for Word Recognition and Spotting of Indic Scripts

Handwritten word recognition and spotting of low-resource scripts are difficult as sufficient training data is not available and it is often expensive for collecting data of such scripts. This paper presents a novel cross language platform…

Computer Vision and Pattern Recognition · Computer Science 2018-02-06 Ayan Kumar Bhunia , Partha Pratim Roy , Akash Mohta , Umapada Pal

Handwritten Script Identification from Text Lines

In a multilingual country like India where 12 different official scripts are in use, automatic identification of handwritten script facilitates many important applications such as automatic transcription of multilingual documents, searching…

Computer Vision and Pattern Recognition · Computer Science 2020-09-17 Pawan Kumar Singh , Iman Chatterjee , Ram Sarkar , Mita Nasipuri

Word level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script

India is a multi-lingual country where Roman script is often used alongside different Indic scripts in a text document. To develop a script specific handwritten Optical Character Recognition (OCR) system, it is therefore necessary to…

Machine Learning · Computer Science 2010-03-25 Ram Sarkar , Nibaran Das , Subhadip Basu , Mahantapas Kundu , Mita Nasipuri , Dipak Kumar Basu

Language Lexicons for Hindi-English Multilingual Text Processing

Language Identification in textual documents is the process of automatically detecting the language contained in a document based on its content. The present Language Identification techniques presume that a document contains text in one of…

Computation and Language · Computer Science 2021-06-30 Mohd Zeeshan Ansari , Tanvir Ahmad , Noaima Bari

Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages

Language Identification (LI) is crucial for various natural language processing tasks, serving as a foundational step in applications such as sentiment analysis, machine translation, and information retrieval. In multilingual societies like…

Computation and Language · Computer Science 2025-03-13 Aniket Deroy , Subhankar Maity

Improving Informally Romanized Language Identification

The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), the lack of conventional spelling in the Latin script results in high spelling variability. Such…

Computation and Language · Computer Science 2025-11-19 Adrian Benton , Alexander Gutkin , Christo Kirov , Brian Roark

Benchmarking Scene Text Recognition in Devanagari, Telugu and Malayalam

Inspired by the success of Deep Learning based approaches to English scene text recognition, we pose and benchmark scene text recognition for three Indic scripts - Devanagari, Telugu and Malayalam. Synthetic word images rendered from…

Computer Vision and Pattern Recognition · Computer Science 2021-04-12 Minesh Mathew , Mohit Jain , CV Jawahar

LanideNN: Multilingual Language Identification on Character Window

In language identification, a common first step in natural language processing, we want to automatically determine the language of some input text. Monolingual language identification assumes that the given document is written in one…

Computation and Language · Computer Science 2017-08-01 Tom Kocmi , Ondřej Bojar

Optical Character Recognition (OCR) for Telugu: Database, Algorithm and Application

Telugu is a Dravidian language spoken by more than 80 million people worldwide. The optical character recognition (OCR) of the Telugu script has wide ranging applications including education, health-care, administration etc. The beautiful…

Computer Vision and Pattern Recognition · Computer Science 2018-12-27 Chandra Prakash Konkimalla , Manikanta Srikar Yellapragada , Trishal Gayam , Souraj Mandal , Sumohana S. Channappayya

Deep Learning for Hindi Text Classification: A Comparison

Natural Language Processing (NLP) and especially natural language text analysis have seen great advances in recent times. Usage of deep learning in text processing has revolutionized the techniques for text processing and achieved…

Information Retrieval · Computer Science 2020-07-07 Ramchandra Joshi , Purvi Goel , Raviraj Joshi

Stress Detection on Code-Mixed Texts in Dravidian Languages using Machine Learning

Stress is a common feeling in daily life, but it can affect mental well-being in some situations, the development of robust detection models is imperative. This study introduces a methodical approach to the stress identification in…

Computation and Language · Computer Science 2024-10-10 L. Ramos , M. Shahiki-Tash , Z. Ahani , A. Eponon , O. Kolesnikova , H. Calvo

Detecting Everyday Scenarios in Narrative Texts

Script knowledge consists of detailed information on everyday activities. Such information is often taken for granted in text and needs to be inferred by readers. Therefore, script knowledge is a central component to language comprehension.…

Computation and Language · Computer Science 2019-06-11 Lilian D. A. Wanzare , Michael Roth , Manfred Pinkal

Survey of Pseudonymization, Abstractive Summarization & Spell Checker for Hindi and Marathi

India's vast linguistic diversity presents unique challenges and opportunities for technological advancement, especially in the realm of Natural Language Processing (NLP). While there has been significant progress in NLP applications for…

Computation and Language · Computer Science 2024-12-25 Rasika Ransing , Mohammed Amaan Dhamaskar , Ayush Rajpurohit , Amey Dhoke , Sanket Dalvi

Devnagari document segmentation using histogram approach

Document segmentation is one of the critical phases in machine recognition of any language. Correct segmentation of individual symbols decides the accuracy of character recognition technique. It is used to decompose image of a sequence of…

Computer Vision and Pattern Recognition · Computer Science 2011-09-07 Vikas J Dongre , Vijay H Mankar

Language-agnostic Multilingual Modeling

Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model. This enables data and parameter sharing across languages, which is especially beneficial for the…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-22 Arindrima Datta , Bhuvana Ramabhadran , Jesse Emond , Anjuli Kannan , Brian Roark