Related papers: Subword Pooling Makes a Difference

Impact of Subword Pooling Strategy on Cross-lingual Event Detection

Pre-trained multilingual language models (e.g., mBERT, XLM-RoBERTa) have significantly advanced the state-of-the-art for zero-shot cross-lingual information extraction. These language models ubiquitously rely on word segmentation techniques…

Computation and Language · Computer Science 2023-02-24 Shantanu Agarwal , Steven Fincke , Chris Jenkins , Scott Miller , Elizabeth Boschee

Evaluating Multilingual BERT for Estonian

Recently, large pre-trained language models, such as BERT, have reached state-of-the-art performance in many natural language processing tasks, but for many languages, including Estonian, BERT models are not yet available. However, there…

Computation and Language · Computer Science 2021-01-11 Claudia Kittask , Kirill Milintsevich , Kairit Sirts

Probing Multilingual Language Models for Discourse

Pre-trained multilingual language models have become an important building block in multilingual natural language processing. In the present paper, we investigate a range of such models to find out how well they transfer discourse-level…

Computation and Language · Computer Science 2021-06-10 Murathan Kurfalı , Robert Östling

Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation

Pretrained contextual and non-contextual subword embeddings have become available in over 250 languages, allowing massively multilingual NLP. However, while there is no dearth of pretrained embeddings, the distinct lack of systematic…

Computation and Language · Computer Science 2019-06-05 Benjamin Heinzerling , Michael Strube

Breaking Character: Are Subwords Good Enough for MRLs After All?

Large pretrained language models (PLMs) typically tokenize the input string into contiguous subwords before any pretraining or inference. However, previous studies have claimed that this form of subword tokenization is inadequate for…

Computation and Language · Computer Science 2022-04-12 Omri Keren , Tal Avinari , Reut Tsarfaty , Omer Levy

Comparative Analysis of Pooling Mechanisms in LLMs: A Sentiment Analysis Perspective

Large Language Models (LLMs) have revolutionized natural language processing (NLP) by delivering state-of-the-art performance across a variety of tasks. Among these, Transformer-based models like BERT and GPT rely on pooling layers to…

Computation and Language · Computer Science 2025-02-04 Jinming Xing , Dongwen Luo , Chang Xue , Ruilin Xing

Are All Languages Created Equal in Multilingual BERT?

Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer…

Computation and Language · Computer Science 2020-10-02 Shijie Wu , Mark Dredze

Evaluating Contextualized Language Models for Hungarian

We present an extended comparison of contextualized language models for Hungarian. We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model. We evaluate these models through three tasks,…

Computation and Language · Computer Science 2021-02-23 Judit Ács , Dániel Lévai , Dávid Márk Nemeskey , András Kornai

Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and…

Computation and Language · Computer Science 2023-05-30 Tomasz Limisiewicz , Jiří Balhar , David Mareček

Analyzing Cognitive Plausibility of Subword Tokenization

Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm…

Computation and Language · Computer Science 2023-10-23 Lisa Beinborn , Yuval Pinter

Effects of sub-word segmentation on performance of transformer language models

Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance…

Computation and Language · Computer Science 2023-10-30 Jue Hou , Anisia Katinskaia , Anh-Duc Vu , Roman Yangarber

Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages

One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for…

Computation and Language · Computer Science 2023-04-21 Verena Blaschke , Hinrich Schütze , Barbara Plank

Exploring the Maze of Multilingual Modeling

Multilingual language models have gained significant attention in recent years, enabling the development of applications that meet diverse linguistic contexts. In this paper, we present a comprehensive evaluation of three popular…

Computation and Language · Computer Science 2024-02-14 Sina Bagheri Nezhad , Ameeta Agrawal

Evaluation of Morphological Embeddings for the Russian Language

A number of morphology-based word embedding models were introduced in recent years. However, their evaluation was mostly limited to English, which is known to be a morphologically simple language. In this paper, we explore whether and to…

Computation and Language · Computer Science 2021-03-12 Vitaly Romanov , Albina Khusainova

Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning

Linguistic resources such as part-of-speech (POS) tags have been extensively used in statistical machine translation (SMT) frameworks and have yielded better performances. However, usage of such linguistic annotations in neural machine…

Computation and Language · Computer Science 2017-08-04 Jan Niehues , Eunah Cho

FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding

Large-scale cross-lingual language models (LM), such as mBERT, Unicoder and XLM, have achieved great success in cross-lingual representation learning. However, when applied to zero-shot cross-lingual transfer tasks, most existing methods…

Computation and Language · Computer Science 2020-12-16 Yuwei Fang , Shuohang Wang , Zhe Gan , Siqi Sun , Jingjing Liu

To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer

Choosing an appropriate tokenization scheme is often a bottleneck in low-resource cross-lingual transfer. To understand the downstream implications of text representation choices, we perform a comparative analysis on language models having…

Computation and Language · Computer Science 2023-10-13 Md Mushfiqur Rahman , Fardin Ahsan Sakib , Fahim Faisal , Antonios Anastasopoulos

The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer

Large multilingual language models such as mBERT or XLM-R enable zero-shot cross-lingual transfer in various IR and NLP tasks. Cao et al. (2020) proposed a data- and compute-efficient method for cross-lingual adjustment of mBERT that uses a…

Computation and Language · Computer Science 2023-11-01 Pavel Efimov , Leonid Boytsov , Elena Arslanova , Pavel Braslavski

Morphological evaluation of subwords vocabulary used by BETO language model

Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always…

Computation and Language · Computer Science 2024-10-04 Óscar García-Sierra , Ana Fernández-Pampillón Cesteros , Miguel Ortega-Martín

Xu at SemEval-2022 Task 4: Pre-BERT Neural Network Methods vs Post-BERT RoBERTa Approach for Patronizing and Condescending Language Detection

This paper describes my participation in the SemEval-2022 Task 4: Patronizing and Condescending Language Detection. I participate in both subtasks: Patronizing and Condescending Language (PCL) Identification and Patronizing and…

Computation and Language · Computer Science 2022-11-15 Jinghua Xu