English
Related papers

Related papers: Subword Pooling Makes a Difference

200 papers

Pre-trained multilingual language models (e.g., mBERT, XLM-RoBERTa) have significantly advanced the state-of-the-art for zero-shot cross-lingual information extraction. These language models ubiquitously rely on word segmentation techniques…

Computation and Language · Computer Science 2023-02-24 Shantanu Agarwal , Steven Fincke , Chris Jenkins , Scott Miller , Elizabeth Boschee

Recently, large pre-trained language models, such as BERT, have reached state-of-the-art performance in many natural language processing tasks, but for many languages, including Estonian, BERT models are not yet available. However, there…

Computation and Language · Computer Science 2021-01-11 Claudia Kittask , Kirill Milintsevich , Kairit Sirts

Pre-trained multilingual language models have become an important building block in multilingual natural language processing. In the present paper, we investigate a range of such models to find out how well they transfer discourse-level…

Computation and Language · Computer Science 2021-06-10 Murathan Kurfalı , Robert Östling

Pretrained contextual and non-contextual subword embeddings have become available in over 250 languages, allowing massively multilingual NLP. However, while there is no dearth of pretrained embeddings, the distinct lack of systematic…

Computation and Language · Computer Science 2019-06-05 Benjamin Heinzerling , Michael Strube

Large pretrained language models (PLMs) typically tokenize the input string into contiguous subwords before any pretraining or inference. However, previous studies have claimed that this form of subword tokenization is inadequate for…

Computation and Language · Computer Science 2022-04-12 Omri Keren , Tal Avinari , Reut Tsarfaty , Omer Levy

Large Language Models (LLMs) have revolutionized natural language processing (NLP) by delivering state-of-the-art performance across a variety of tasks. Among these, Transformer-based models like BERT and GPT rely on pooling layers to…

Computation and Language · Computer Science 2025-02-04 Jinming Xing , Dongwen Luo , Chang Xue , Ruilin Xing

Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer…

Computation and Language · Computer Science 2020-10-02 Shijie Wu , Mark Dredze

We present an extended comparison of contextualized language models for Hungarian. We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model. We evaluate these models through three tasks,…

Computation and Language · Computer Science 2021-02-23 Judit Ács , Dániel Lévai , Dávid Márk Nemeskey , András Kornai

Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and…

Computation and Language · Computer Science 2023-05-30 Tomasz Limisiewicz , Jiří Balhar , David Mareček

Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm…

Computation and Language · Computer Science 2023-10-23 Lisa Beinborn , Yuval Pinter

Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance…

Computation and Language · Computer Science 2023-10-30 Jue Hou , Anisia Katinskaia , Anh-Duc Vu , Roman Yangarber

One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for…

Computation and Language · Computer Science 2023-04-21 Verena Blaschke , Hinrich Schütze , Barbara Plank

Multilingual language models have gained significant attention in recent years, enabling the development of applications that meet diverse linguistic contexts. In this paper, we present a comprehensive evaluation of three popular…

Computation and Language · Computer Science 2024-02-14 Sina Bagheri Nezhad , Ameeta Agrawal

A number of morphology-based word embedding models were introduced in recent years. However, their evaluation was mostly limited to English, which is known to be a morphologically simple language. In this paper, we explore whether and to…

Computation and Language · Computer Science 2021-03-12 Vitaly Romanov , Albina Khusainova

Linguistic resources such as part-of-speech (POS) tags have been extensively used in statistical machine translation (SMT) frameworks and have yielded better performances. However, usage of such linguistic annotations in neural machine…

Computation and Language · Computer Science 2017-08-04 Jan Niehues , Eunah Cho

Large-scale cross-lingual language models (LM), such as mBERT, Unicoder and XLM, have achieved great success in cross-lingual representation learning. However, when applied to zero-shot cross-lingual transfer tasks, most existing methods…

Computation and Language · Computer Science 2020-12-16 Yuwei Fang , Shuohang Wang , Zhe Gan , Siqi Sun , Jingjing Liu

Choosing an appropriate tokenization scheme is often a bottleneck in low-resource cross-lingual transfer. To understand the downstream implications of text representation choices, we perform a comparative analysis on language models having…

Computation and Language · Computer Science 2023-10-13 Md Mushfiqur Rahman , Fardin Ahsan Sakib , Fahim Faisal , Antonios Anastasopoulos

Large multilingual language models such as mBERT or XLM-R enable zero-shot cross-lingual transfer in various IR and NLP tasks. Cao et al. (2020) proposed a data- and compute-efficient method for cross-lingual adjustment of mBERT that uses a…

Computation and Language · Computer Science 2023-11-01 Pavel Efimov , Leonid Boytsov , Elena Arslanova , Pavel Braslavski

Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always…

Computation and Language · Computer Science 2024-10-04 Óscar García-Sierra , Ana Fernández-Pampillón Cesteros , Miguel Ortega-Martín

This paper describes my participation in the SemEval-2022 Task 4: Patronizing and Condescending Language Detection. I participate in both subtasks: Patronizing and Condescending Language (PCL) Identification and Patronizing and…

Computation and Language · Computer Science 2022-11-15 Jinghua Xu
‹ Prev 1 2 3 10 Next ›