English
Related papers

Related papers: MULTEXT-East

200 papers

In this work, we introduce the construction of a machine translation (MT) assisted and human-in-the-loop multilingual parallel corpus with annotations of multi-word expressions (MWEs), named AlphaMWE. The MWEs include verbal MWEs (vMWEs)…

Computation and Language · Computer Science 2025-12-23 Lifeng Han , Najet Hadj Mohamed , Malak Rassem , Gareth Jones , Alan Smeaton , Goran Nenadic

Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the…

Computation and Language · Computer Science 2025-09-23 Wenhao Zhuang , Yuan Sun

We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the…

We present the Eyetracked Multi-Modal Translation (EMMT) corpus, a dataset containing monocular eye movement recordings, audio and 4-electrode electroencephalogram (EEG) data of 43 participants. The objective was to collect cognitive…

Computation and Language · Computer Science 2022-04-07 Sunit Bhattacharya , Věra Kloudová , Vilém Zouhar , Ondřej Bojar

This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a…

Computation and Language · Computer Science 2024-04-09 Jakub Piskorski , Michał Marcińczuk , Roman Yangarber

In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic…

Machine Learning · Computer Science 2020-10-27 Jason Armitage , Endri Kacupaj , Golsa Tahmasebzadeh , Swati , Maria Maleshkova , Ralph Ewerth , Jens Lehmann

In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different…

Computation and Language · Computer Science 2022-06-01 Matej Ulčar , Kristiina Vaik , Jessica Lindström , Milda Dailidėnaitė , Marko Robnik-Šikonja

As Large Language Models (LLMs) become increasingly prevalent in text simplification, systematically evaluating their outputs across diverse prompting strategies and architectures remains a critical methodological challenge in both NLP…

Computation and Language · Computer Science 2026-04-13 Rares-Alexandru Roscan , Gabriel Petre1 , Adrian-Marius Dumitran , Angela-Liliana Dumitran

We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC…

Computation and Language · Computer Science 2021-09-08 Ilias Chalkidis , Manos Fergadiotis , Ion Androutsopoulos

In this article, we have introduced the first parallel corpus of Persian with more than 10 other European languages. This article describes primary steps toward preparing a Basic Language Resources Kit (BLARK) for Persian. Up to now, we…

Computation and Language · Computer Science 2014-04-18 Behrang Qasemizadeh , Saeed Rahimi , Behrooz Mahmoodi Bakhtiari

The availability of parallel texts is crucial to the performance of machine translation models. However, most of the world's languages face the predominant challenge of data scarcity. In this paper, we propose strategies to synthesize…

Computation and Language · Computer Science 2024-02-06 Md Mahfuz Ibn Alam , Sina Ahmadi , Antonios Anastasopoulos

This paper presents the process of building a neural machine translation system with support for English, Romanian, and Aromanian - an endangered Eastern Romance language. The primary contribution of this research is twofold: (1) the…

Computation and Language · Computer Science 2025-01-08 Alexandru-Iulius Jerpelea , Alina Rădoi , Sergiu Nisioi

Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few…

Computation and Language · Computer Science 2024-11-14 Michael Ginn , Lindia Tjuatja , Taiqi He , Enora Rice , Graham Neubig , Alexis Palmer , Lori Levin

Recently, reading comprehension models achieved near-human performance on large-scale datasets such as SQuAD, CoQA, MS Macro, RACE, etc. This is largely due to the release of pre-trained contextualized representations such as BERT and ELMo,…

Computation and Language · Computer Science 2019-09-09 Momchil Hardalov , Ivan Koychev , Preslav Nakov

We present the Multilingual Entity Linking of Occupations (MELO) Benchmark, a new collection of 48 datasets for evaluating the linking of entity mentions in 21 languages to the ESCO Occupations multilingual taxonomy. MELO was built using…

Computation and Language · Computer Science 2024-10-14 Federico Retyk , Luis Gasco , Casimiro Pio Carrino , Daniel Deniz , Rabih Zbib

In this paper, we introduce OmniGEC, a collection of multilingual silver-standard datasets for the task of Grammatical Error Correction (GEC), covering eleven languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian,…

Computation and Language · Computer Science 2025-09-19 Roman Kovalchuk , Mariana Romanyshyn , Petro Ivaniuk

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias…

Computation and Language · Computer Science 2024-11-12 Tomasz Limisiewicz , Terra Blevins , Hila Gonen , Orevaoghene Ahia , Luke Zettlemoyer

Recently, large pre-trained language models, such as BERT, have reached state-of-the-art performance in many natural language processing tasks, but for many languages, including Estonian, BERT models are not yet available. However, there…

Computation and Language · Computer Science 2021-01-11 Claudia Kittask , Kirill Milintsevich , Kairit Sirts

This article presents two corpora of English and Czech texts generated with large language models (LLMs). The motivation is to create a resource for comparing human-written texts with LLM-generated text linguistically. Emphasis was placed…

Computation and Language · Computer Science 2025-11-11 Jiří Milička , Anna Marklová , Václav Cvrček

The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus…

Computation and Language · Computer Science 2026-02-16 Zachary Hopton , Jannis Vamvas , Andrin Büchler , Anna Rutkiewicz , Rico Cathomas , Rico Sennrich
‹ Prev 1 2 3 10 Next ›