Related papers: Automatic Language Identification for Celtic Texts

Language Segmentation

Language segmentation consists in finding the boundaries where one language ends and another language begins in a text written in more than one language. This is important for all natural language processing tasks. The problem can be solved…

Computation and Language · Computer Science 2015-10-07 David Alfter

A Semisupervised Approach for Language Identification based on Ladder Networks

In this study we address the problem of training a neuralnetwork for language identification using both labeled and unlabeled speech samples in the form of i-vectors. We propose a neural network architecture that can also handle out-of-set…

Computation and Language · Computer Science 2016-04-04 Ehud Ben-Reuven , Jacob Goldberger

Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages…

Computation and Language · Computer Science 2020-11-24 Juntao Li , Ruidan He , Hai Ye , Hwee Tou Ng , Lidong Bing , Rui Yan

Unsupervised Automatic Speech Recognition: A Review

Automatic Speech Recognition (ASR) systems can be trained to achieve remarkable performance given large amounts of manually transcribed speech, but large labeled data sets can be difficult or expensive to acquire for all languages of…

Computation and Language · Computer Science 2022-03-22 Hanan Aldarmaki , Asad Ullah , Nazar Zaki

Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets

In this research, we advanced a spoken language recognition system, moving beyond traditional feature vector-based models. Our improvements focused on effectively capturing language characteristics over extended periods using a specialized…

Sound · Computer Science 2025-01-22 Or Haim Anidjar , Roi Yozevitch

Low-rank Dictionary Learning for Unsupervised Feature Selection

There exist many high-dimensional data in real-world applications such as biology, computer vision, and social networks. Feature selection approaches are devised to confront with high-dimensional data challenges with the aim of efficient…

Machine Learning · Computer Science 2021-06-22 Mohsen Ghassemi Parsa , Hadi Zare , Mehdi Ghatee

Universal Cross-Lingual Text Classification

Text classification, an integral task in natural language processing, involves the automatic categorization of text into predefined classes. Creating supervised labeled datasets for low-resource languages poses a considerable challenge.…

Computation and Language · Computer Science 2024-06-18 Riya Savant , Anushka Shelke , Sakshi Todmal , Sanskruti Kanphade , Ananya Joshi , Raviraj Joshi

Efficient Spoken Language Recognition via Multilabel Classification

Spoken language recognition (SLR) is the task of automatically identifying the language present in a speech signal. Existing SLR models are either too computationally expensive or too large to run effectively on devices with limited…

Computation and Language · Computer Science 2023-06-06 Oriol Nieto , Zeyu Jin , Franck Dernoncourt , Justin Salamon

Robust Multilingual Named Entity Recognition with Shallow Semi-Supervised Features

We present a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information with clustering semi-supervised features induced on large…

Computation and Language · Computer Science 2017-02-03 Rodrigo Agerri , German Rigau

Improved Language Identification Through Cross-Lingual Self-Supervised Learning

Language identification greatly impacts the success of downstream tasks such as automatic speech recognition. Recently, self-supervised speech representations learned by wav2vec 2.0 have been shown to be very effective for a range of speech…

Computation and Language · Computer Science 2021-10-19 Andros Tjandra , Diptanu Gon Choudhury , Frank Zhang , Kritika Singh , Alexis Conneau , Alexei Baevski , Assaf Sela , Yatharth Saraf , Michael Auli

Unsupervised Data Validation Methods for Efficient Model Training

This paper investigates the challenges and potential solutions for improving machine learning systems for low-resource languages. State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT), and…

Computation and Language · Computer Science 2024-10-11 Yurii Paniv

Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages

Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised…

Computation and Language · Computer Science 2025-06-13 Ali Almutairi , Abdullah Alsuhaibani , Shoaib Jameel , Usman Naseem , Gelareh Mohammadi , Imran Razzak

Representation Learning for Weakly Supervised Relation Extraction

Recent years have seen rapid development in Information Extraction, as well as its subtask, Relation Extraction. Relation Extraction is able to detect semantic relations between entities in sentences. Currently, many efficient approaches…

Computation and Language · Computer Science 2024-03-19 Zhuang Li

Weakly Supervised Scene Text Generation for Low-resource Languages

A large number of annotated training images is crucial for training successful scene text recognition models. However, collecting sufficient datasets can be a labor-intensive and costly process, particularly for low-resource languages. To…

Computer Vision and Pattern Recognition · Computer Science 2023-06-28 Yangchen Xie , Xinyuan Chen , Hongjian Zhan , Palaiahankote Shivakum , Bing Yin , Cong Liu , Yue Lu

Semi-supervised Classification for Natural Language Processing

Semi-supervised classification is an interesting idea where classification models are learned from both labeled and unlabeled data. It has several advantages over supervised classification in natural language processing domain. For…

Computation and Language · Computer Science 2014-09-29 Rushdi Shams

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has…

Computation and Language · Computer Science 2026-02-20 Taja Kuzman Pungeršek , Peter Rupnik , Ivan Porupski , Vuk Dinić , Nikola Ljubešić

DocLangID: Improving Few-Shot Training to Identify the Language of Historical Documents

Language identification describes the task of recognizing the language of written text in documents. This information is crucial because it can be used to support the analysis of a document's vocabulary and context. Supervised learning…

Computer Vision and Pattern Recognition · Computer Science 2023-09-19 Furkan Simsek , Brian Pfitzmann , Hendrik Raetz , Jona Otholt , Haojin Yang , Christoph Meinel

Improving Language Identification of Accented Speech

Language identification from speech is a common preprocessing step in many spoken language processing systems. In recent years, this field has seen fast progress, mostly due to the use of self-supervised models pretrained on multilingual…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-04 Kunnar Kukk , Tanel Alumäe

Unsupervised Machine Translation On Dravidian Languages

Unsupervised neural machine translation (UNMT) is beneficial especially for low resource languages such as those from the Dravidian family. However, UNMT systems tend to fail in realistic scenarios involving actual low resource languages.…

Computation and Language · Computer Science 2021-03-31 Sai Koneru , Danni Liu , Jan Niehues

Joint unsupervised and supervised learning for context-aware language identification

Language identification (LID) recognizes the language of a spoken utterance automatically. According to recent studies, LID models trained with an automatic speech recognition (ASR) task perform better than those trained with a LID task…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-17 Jinseok Park , Hyung Yong Kim , Jihwan Park , Byeong-Yeol Kim , Shukjae Choi , Yunkyu Lim