Related papers: Linguistic Classification using Instance-Based Lea…
In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of…
We classify twenty-one Indo-European languages starting from written text. We use neural networks in order to define a distance among different languages, construct a dendrogram and analyze the ultrametric structure that emerges. Four or…
Indian languages are inflectional and agglutinative and typically follow clause-free word order. The structure of sentences across most major Indian languages are similar when their dependency parse trees are considered. While some…
It is reasonable to hypothesize that the divergence patterns formulated by historical linguists and typologists reflect constraints on human languages, and are thus consistent with Second Language Acquisition (SLA) in a certain way. In this…
Communication plays a vital role in human interaction. Studying language is a worthwhile task and more recently has become quantitative in nature with developments of fields like quantitative comparative linguistics and lexicostatistics.…
Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled…
In-context learning enables language models (LM) to adapt to downstream data or tasks by incorporating few samples as demonstrations within the prompts. It offers strong performance without the expense of fine-tuning. However, the…
In this work, we present an extensive study of statistical machine translation involving languages of the Indian subcontinent. These languages are related by genetic and contact relationships. We describe the similarities between Indic…
Making an informed choice of pre-trained language model (LM) is critical for performance, yet environmentally costly, and as such widely underexplored. The field of Computer Vision has begun to tackle encoder ranking, with promising forays…
This paper (cmp-lg/yymmnnn) has been accepted for publication in the student session of EACL-95. It outlines ongoing work using statistical and unsupervised neural network methods for clustering words in untagged corpora. Such approaches…
The pervasive influence of social biases in language data has sparked the need for benchmark datasets that capture and evaluate these biases in Large Language Models (LLMs). Existing efforts predominantly focus on English language and the…
Many of the kinds of language model used in speech understanding suffer from imperfect modeling of intra-sentential contextual influences. I argue that this problem can be addressed by clustering the sentences in a training corpus…
Historical linguistics aims at inferring the most likely language phylogenetic tree starting from information concerning the evolutionary relatedness of languages. The available information are typically lists of homologous (lexical,…
Existing approaches to automatic VerbNet-style verb classification are heavily dependent on feature engineering and therefore limited to languages with mature NLP pipelines. In this work, we propose a novel cross-lingual transfer method for…
In cross-lingual text classification, one seeks to exploit labeled data from one language to train a text classification model that can then be applied to a completely different language. Recent multilingual representation models have made…
Recent advances in pre-trained language modeling have facilitated significant progress across various natural language processing (NLP) tasks. Word masking during model training constitutes a pivotal component of language modeling in…
Building a multilingual Automated Speech Recognition (ASR) system in a linguistically diverse country like India can be a challenging task due to the differences in scripts and the limited availability of speech data. This problem can be…
This paper presents a comparison of classification methods for linguistic typology for the purpose of expanding an extensive, but sparse language resource: the World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2013). We…
Contextualised word vectors obtained via pre-trained language models encode a variety of knowledge that has already been exploited in applications. Complementary to these language models are probabilistic topic models that learn thematic…
In this paper, we describe a research method that generates Bangla word clusters on the basis of relating to meaning in language and contextual similarity. The importance of word clustering is in parts of speech (POS) tagging, word sense…