Related papers: Linguistic Classification using Instance-Based Lea…

From Isolates to Families: Using Neural Networks for Automated Language Affiliation

In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of…

Computation and Language · Computer Science 2025-12-09 Frederic Blum , Steffen Herbold , Johann-Mattis List

Language discrimination and clustering via a neural network approach

We classify twenty-one Indo-European languages starting from written text. We use neural networks in order to define a distance among different languages, construct a dendrogram and analyze the ultrametric structure that emerges. Four or…

Disordered Systems and Neural Networks · Physics 2015-07-16 Angelo Mariano , Giorgio Parisi , Saverio Pascazio

Semantically Cohesive Word Grouping in Indian Languages

Indian languages are inflectional and agglutinative and typically follow clause-free word order. The structure of sentences across most major Indian languages are similar when their dependency parse trees are considered. While some…

Computation and Language · Computer Science 2025-01-08 N J Karthika , Adyasha Patra , Nagasai Saketh Naidu , Arnab Bhattacharya , Ganesh Ramakrishnan , Chaitali Dangarikar

Constructing a Family Tree of Ten Indo-European Languages with Delexicalized Cross-linguistic Transfer Patterns

It is reasonable to hypothesize that the divergence patterns formulated by historical linguists and typologists reflect constraints on human languages, and are thus consistent with Second Language Acquisition (SLA) in a certain way. In this…

Computation and Language · Computer Science 2020-07-20 Yuanyuan Zhao , Weiwei Sun , Xiaojun Wan

Sampling the Swadesh List to Identify Similar Languages with Tree Spaces

Communication plays a vital role in human interaction. Studying language is a worthwhile task and more recently has become quantitative in nature with developments of fields like quantitative comparative linguistics and lexicostatistics.…

Applications · Statistics 2024-05-13 Garett Ordway , Vic Patrangenaru

Text Classification Using Label Names Only: A Language Model Self-Training Approach

Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled…

Computation and Language · Computer Science 2020-10-15 Yu Meng , Yunyi Zhang , Jiaxin Huang , Chenyan Xiong , Heng Ji , Chao Zhang , Jiawei Han

Vocabulary-Defined Semantics: Latent Space Clustering for Improving In-Context Learning

In-context learning enables language models (LM) to adapt to downstream data or tasks by incorporating few samples as demonstrations within the prompts. It offers strong performance without the expense of fine-tuning. However, the…

Computation and Language · Computer Science 2024-10-15 Jian Gu , Aldeida Aleti , Chunyang Chen , Hongyu Zhang

Utilizing Language Relatedness to improve Machine Translation: A Case Study on Languages of the Indian Subcontinent

In this work, we present an extensive study of statistical machine translation involving languages of the Indian subcontinent. These languages are related by genetic and contact relationships. We describe the similarities between Indic…

Computation and Language · Computer Science 2020-03-20 Anoop Kunchukuttan , Pushpak Bhattacharyya

Sort by Structure: Language Model Ranking as Dependency Probing

Making an informed choice of pre-trained language model (LM) is critical for performance, yet environmentally costly, and as such widely underexplored. The field of Computer Vision has begun to tackle encoder ranking, with promising forays…

Computation and Language · Computer Science 2022-06-13 Max Müller-Eberstein , Rob van der Goot , Barbara Plank

Grouping Words Using Statistical Context

This paper (cmp-lg/yymmnnn) has been accepted for publication in the student session of EACL-95. It outlines ongoing work using statistical and unsupervised neural network methods for clustering words in untagged corpora. Such approaches…

cmp-lg · Computer Science 2008-02-03 Christopher C. Huckle

IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context

The pervasive influence of social biases in language data has sparked the need for benchmark datasets that capture and evaluate these biases in Large Language Models (LLMs). Existing efforts predominantly focus on English language and the…

Computation and Language · Computer Science 2024-04-04 Nihar Ranjan Sahoo , Pranamya Prashant Kulkarni , Narjis Asad , Arif Ahmad , Tanu Goyal , Aparna Garimella , Pushpak Bhattacharyya

Improving Language Models by Clustering Training Sentences

Many of the kinds of language model used in speech understanding suffer from imperfect modeling of intra-sentential contextual influences. I argue that this problem can be addressed by clustering the sentences in a training corpus…

cmp-lg · Computer Science 2008-02-03 David Carter

On the accuracy of language trees

Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical,…

Physics and Society · Physics 2015-05-27 Simone Pompei , Vittorio Loreto , Francesca Tria

Cross-Lingual Induction and Transfer of Verb Classes Based on Word Vector Space Specialisation

Existing approaches to automatic VerbNet-style verb classification are heavily dependent on feature engineering and therefore limited to languages with mature NLP pipelines. In this work, we propose a novel cross-lingual transfer method for…

Computation and Language · Computer Science 2017-07-24 Ivan Vulić , Nikola Mrkšić , Anna Korhonen

Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification

In cross-lingual text classification, one seeks to exploit labeled data from one language to train a text classification model that can then be applied to a completely different language. Recent multilingual representation models have made…

Computation and Language · Computer Science 2020-07-31 Xin Dong , Yaxin Zhu , Yupeng Zhang , Zuohui Fu , Dongkuan Xu , Sen Yang , Gerard de Melo

Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

Recent advances in pre-trained language modeling have facilitated significant progress across various natural language processing (NLP) tasks. Word masking during model training constitutes a pivotal component of language modeling in…

Computation and Language · Computer Science 2024-02-27 Anas Belfathi , Ygor Gallina , Nicolas Hernandez , Richard Dufour , Laura Monceaux

The Tag-Team Approach: Leveraging CLS and Language Tagging for Enhancing Multilingual ASR

Building a multilingual Automated Speech Recognition (ASR) system in a linguistically diverse country like India can be a challenging task due to the differences in scripts and the limited availability of speech data. This problem can be…

Computation and Language · Computer Science 2023-06-01 Kaousheik Jayakumar , Vrunda N. Sukhadia , A Arunkumar , S. Umesh

Classifying Syntactic Regularities for Hundreds of Languages

This paper presents a comparison of classification methods for linguistic typology for the purpose of expanding an extensive, but sparse language resource: the World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2013). We…

Computation and Language · Computer Science 2016-04-28 Reed Coke , Ben King , Dragomir Radev

Topics in Contextualised Attention Embeddings

Contextualised word vectors obtained via pre-trained language models encode a variety of knowledge that has already been exploited in applications. Complementary to these language models are probabilistic topic models that learn thematic…

Computation and Language · Computer Science 2023-01-12 Mozhgan Talebpour , Alba Garcia Seco de Herrera , Shoaib Jameel

Bangla Word Clustering Based on Tri-gram, 4-gram and 5-gram Language Model

In this paper, we describe a research method that generates Bangla word clusters on the basis of relating to meaning in language and contextual similarity. The importance of word clustering is in parts of speech (POS) tagging, word sense…

Computation and Language · Computer Science 2017-01-31 Dipaloke Saha , Md Saddam Hossain , MD. Saiful Islam , Sabir Ismail