Related papers: MULTEXT-East

Towards a resource for multilingual lexicons: an MT assisted and human-in-the-loop multilingual parallel corpus with multi-word expression annotation

In this work, we introduce the construction of a machine translation (MT) assisted and human-in-the-loop multilingual parallel corpus with annotations of multi-word expressions (MWEs), named AlphaMWE. The MWEs include verbal MWEs (vMWEs)…

Computation and Language · Computer Science 2025-12-23 Lifeng Han , Najet Hadj Mohamed , Malak Rassem , Gareth Jones , Alan Smeaton , Goran Nenadic

CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages

Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the…

Computation and Language · Computer Science 2025-09-23 Wenhao Zhuang , Yuan Sun

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the…

Computation and Language · Computer Science 2024-10-03 Lucas Bandarkar , Davis Liang , Benjamin Muller , Mikel Artetxe , Satya Narayan Shukla , Donald Husa , Naman Goyal , Abhinandan Krishnan , Luke Zettlemoyer , Madian Khabsa

EMMT: A simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios

We present the Eyetracked Multi-Modal Translation (EMMT) corpus, a dataset containing monocular eye movement recordings, audio and 4-electrode electroencephalogram (EEG) data of 43 participants. The objective was to collect cognitive…

Computation and Language · Computer Science 2022-04-07 Sunit Bhattacharya , Věra Kloudová , Vilém Zouhar , Ondřej Bojar

Cross-lingual Named Entity Corpus for Slavic Languages

This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a…

Computation and Language · Computer Science 2024-04-09 Jakub Piskorski , Michał Marcińczuk , Roman Yangarber

MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic…

Machine Learning · Computer Science 2020-10-27 Jason Armitage , Endri Kacupaj , Golsa Tahmasebzadeh , Swati , Maria Maleshkova , Ralph Ewerth , Jens Lehmann

Multilingual Culture-Independent Word Analogy Datasets

In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different…

Computation and Language · Computer Science 2022-06-01 Matej Ulčar , Kristiina Vaik , Jessica Lindström , Milda Dailidėnaitė , Marko Robnik-Šikonja

MuTSE: A Human-in-the-Loop Multi-use Text Simplification Evaluator

As Large Language Models (LLMs) become increasingly prevalent in text simplification, systematically evaluating their outputs across diverse prompting strategies and architectures remains a critical methodological challenge in both NLP…

Computation and Language · Computer Science 2026-04-13 Rares-Alexandru Roscan , Gabriel Petre1 , Adrian-Marius Dumitran , Angela-Liliana Dumitran

MultiEURLEX -- A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC…

Computation and Language · Computer Science 2021-09-08 Ilias Chalkidis , Manos Fergadiotis , Ion Androutsopoulos

The First Parallel Multilingual Corpus of Persian: Toward a Persian BLARK

In this article, we have introduced the first parallel corpus of Persian with more than 10 other European languages. This article describes primary steps toward preparing a Basic Language Resources Kit (BLARK) for Persian. Up to now, we…

Computation and Language · Computer Science 2014-04-18 Behrang Qasemizadeh , Saeed Rahimi , Behrooz Mahmoodi Bakhtiari

A Morphologically-Aware Dictionary-based Data Augmentation Technique for Machine Translation of Under-Represented Languages

The availability of parallel texts is crucial to the performance of machine translation models. However, most of the world's languages face the predominant challenge of data scarcity. In this paper, we propose strategies to synthesize…

Computation and Language · Computer Science 2024-02-06 Md Mahfuz Ibn Alam , Sina Ahmadi , Antonios Anastasopoulos

Dialectal and Low-Resource Machine Translation for Aromanian

This paper presents the process of building a neural machine translation system with support for English, Romanian, and Aromanian - an endangered Eastern Romance language. The primary contribution of this research is twofold: (1) the…

Computation and Language · Computer Science 2025-01-08 Alexandru-Iulius Jerpelea , Alina Rădoi , Sergiu Nisioi

GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text

Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few…

Computation and Language · Computer Science 2024-11-14 Michael Ginn , Lindia Tjuatja , Taiqi He , Enora Rice , Graham Neubig , Alexis Palmer , Lori Levin

Beyond English-Only Reading Comprehension: Experiments in Zero-Shot Multilingual Transfer for Bulgarian

Recently, reading comprehension models achieved near-human performance on large-scale datasets such as SQuAD, CoQA, MS Macro, RACE, etc. This is largely due to the release of pre-trained contextualized representations such as BERT and ELMo,…

Computation and Language · Computer Science 2019-09-09 Momchil Hardalov , Ivan Koychev , Preslav Nakov

MELO: An Evaluation Benchmark for Multilingual Entity Linking of Occupations

We present the Multilingual Entity Linking of Occupations (MELO) Benchmark, a new collection of 48 datasets for evaluating the linking of entity mentions in 21 languages to the ESCO Occupations multilingual taxonomy. MELO was built using…

Computation and Language · Computer Science 2024-10-14 Federico Retyk , Luis Gasco , Casimiro Pio Carrino , Daniel Deniz , Rabih Zbib

Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction

In this paper, we introduce OmniGEC, a collection of multilingual silver-standard datasets for the task of Grammatical Error Correction (GEC), covering eleven languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian,…

Computation and Language · Computer Science 2025-09-19 Roman Kovalchuk , Mariana Romanyshyn , Petro Ivaniuk

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias…

Computation and Language · Computer Science 2024-11-12 Tomasz Limisiewicz , Terra Blevins , Hila Gonen , Orevaoghene Ahia , Luke Zettlemoyer

Evaluating Multilingual BERT for Estonian

Recently, large pre-trained language models, such as BERT, have reached state-of-the-art performance in many natural language processing tasks, but for many languages, including Estonian, BERT models are not yet available. However, there…

Computation and Language · Computer Science 2021-01-11 Claudia Kittask , Kirill Milintsevich , Kairit Sirts

AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts

This article presents two corpora of English and Czech texts generated with large language models (LLMs). The motivation is to create a resource for comparing human-written texts with LLM-generated text linguistically. Emphasis was placed…

Computation and Language · Computer Science 2025-11-11 Jiří Milička , Anna Marklová , Václav Cvrček

The Mediomatix Corpus: Parallel Data for Romansh Language Varieties via Comparable Schoolbooks

The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus…

Computation and Language · Computer Science 2026-02-16 Zachary Hopton , Jannis Vamvas , Andrin Büchler , Anna Rutkiewicz , Rico Cathomas , Rico Sennrich