Related papers: data2lang2vec: Data Driven Typological Features Co…

Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

Linguistic typology aims to capture structural and semantic variation across the world's languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that…

Computation and Language · Computer Science 2020-10-28 Edoardo Maria Ponti , Helen O'Horan , Yevgeni Berzak , Ivan Vulić , Roi Reichart , Thierry Poibeau , Ekaterina Shutova , Anna Korhonen

Untangling the Influence of Typology, Data and Model Architecture on Ranking Transfer Languages for Cross-Lingual POS Tagging

Cross-lingual transfer learning is an invaluable tool for overcoming data scarcity, yet selecting a suitable transfer language remains a challenge. The precise roles of linguistic typology, training data, and model architecture in transfer…

Computation and Language · Computer Science 2025-03-27 Enora Rice , Ali Marashian , Hannah Haynie , Katharina von der Wense , Alexis Palmer

Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures

The use of linguistic typological resources in natural language processing has been steadily gaining more popularity. It has been observed that the use of typological information, often combined with distributed language representations,…

Computation and Language · Computer Science 2020-05-06 Alexander Gutkin , Tatiana Merkulova , Martin Jansche

Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models

Despite major advances in multilingual modeling, large quality disparities persist across languages. Besides the obvious impact of uneven training resources, typological properties have also been proposed to determine the intrinsic…

Computation and Language · Computer Science 2026-02-04 Vitalii Hirak , Jaap Jumelet , Arianna Bisazza

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

Syntactic annotation of corpora in the form of part-of-speech (POS) tags is a key requirement for both linguistic research and subsequent automated natural language processing (NLP) tasks. This problem is commonly tackled using machine…

Computation and Language · Computer Science 2024-10-30 Stefan Heid , Marcel Wever , Eyke Hüllermeier

Learning Language Representations for Typology Prediction

One central mystery of neural NLP is what neural models "know" about their subject matter. When a neural machine translation system learns to translate from one language to another, does it learn the syntax or semantics of the languages?…

Computation and Language · Computer Science 2017-08-01 Chaitanya Malaviya , Graham Neubig , Patrick Littell

Multilingual Gradient Word-Order Typology from Universal Dependencies

While information from the field of linguistic typology has the potential to improve performance on NLP tasks, reliable typological data is a prerequisite. Existing typological databases, including WALS and Grambank, suffer from…

Computation and Language · Computer Science 2024-02-05 Emi Baylor , Esther Ploeger , Johannes Bjerva

Machine Learning Approaches for Amharic Parts-of-speech Tagging

Part-of-speech (POS) tagging is considered as one of the basic but necessary tools which are required for many Natural Language Processing (NLP) applications such as word sense disambiguation, information retrieval, information processing,…

Computation and Language · Computer Science 2020-01-13 Ibrahim Gashaw , H L. Shashirekha

Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning

Linguistic resources such as part-of-speech (POS) tags have been extensively used in statistical machine translation (SMT) frameworks and have yielded better performances. However, usage of such linguistic annotations in neural machine…

Computation and Language · Computer Science 2017-08-04 Jan Niehues , Eunah Cho

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised…

Machine Learning · Computer Science 2022-10-27 Alexei Baevski , Wei-Ning Hsu , Qiantong Xu , Arun Babu , Jiatao Gu , Michael Auli

A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP. Linguistic diversity of these data sets is typically measured as the number of languages or language families included in the…

Computation and Language · Computer Science 2024-04-17 Tanja Samardzic , Ximena Gutierrez , Christian Bentz , Steven Moran , Olga Pelloni

Neural Factor Graph Models for Cross-lingual Morphological Tagging

Morphological analysis involves predicting the syntactic traits of a word (e.g. {POS: Noun, Case: Acc, Gender: Fem}). Previous work in morphological tagging improves performance for low-resource languages (LRLs) through cross-lingual…

Computation and Language · Computer Science 2018-07-12 Chaitanya Malaviya , Matthew R. Gormley , Graham Neubig

External Lexical Information for Multilingual Part-of-Speech Tagging

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of…

Computation and Language · Computer Science 2016-08-10 Benoît Sagot

A syntax-based part-of-speech analyser

There are two main methodologies for constructing the knowledge base of a natural language analyser: the linguistic and the data-driven. Recent state-of-the-art part-of-speech taggers are based on the data-driven approach. Because of the…

cmp-lg · Computer Science 2016-08-31 Atro Voutilainen

An Experimental Investigation of Part-Of-Speech Taggers for Vietnamese

Part-of-speech (POS) tagging plays an important role in Natural Language Processing (NLP). Its applications can be found in many NLP tasks such as named entity recognition, syntactic parsing, dependency parsing and text chunking. In the…

Computation and Language · Computer Science 2022-06-15 Tuan-Phong Nguyen , Quoc-Tuan Truong , Xuan-Nam Nguyen , Anh-Cuong Le

Fine-Grained Prediction of Syntactic Typology: Discovering Latent Structure with Supervised Learning

We show how to predict the basic word-order facts of a novel language given only a corpus of part-of-speech (POS) sequences. We predict how often direct objects follow their verbs, how often adjectives follow their nouns, and in general the…

Computation and Language · Computer Science 2017-10-12 Dingquan Wang , Jason Eisner

Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation

The performance of multilingual pretrained models is highly dependent on the availability of monolingual or parallel text present in a target language. Thus, the majority of the world's languages cannot benefit from recent progress in NLP…

Computation and Language · Computer Science 2022-04-07 Xinyi Wang , Sebastian Ruder , Graham Neubig

A Review on Part-of-Speech Technologies

Developing an automatic part-of-speech (POS) tagging for any new language is considered a necessary step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be fully applied to the language. Many…

Computation and Language · Computer Science 2021-10-12 Onyenwe Ikechukwu , Onyedikachukwu Ikechukwu-Onyenwe , Onyedinma Ebele

A Reproducibility Study on Quantifying Language Similarity: The Impact of Missing Values in the URIEL Knowledge Base

In the pursuit of supporting more languages around the world, tools that characterize properties of languages play a key role in expanding the existing multilingual NLP research. In this study, we focus on a widely used typological…

Computation and Language · Computer Science 2024-05-21 Hasti Toossi , Guo Qing Huai , Jinyu Liu , Eric Khiu , A. Seza Doğruöz , En-Shiun Annie Lee

Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data

Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings,…

Computation and Language · Computer Science 2025-08-22 Annika Tjuka , Robert Forkel , Christoph Rzymski , Johann-Mattis List