Related papers: A Parallel Evaluation Data Set of Software Documen…

Sinhala-English Parallel Word Dictionary Dataset

Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are…

Computation and Language · Computer Science 2023-09-26 Kasun Wickramasinghe , Nisansa de Silva

A High-Quality Multilingual Dataset for Structured Documentation Translation

This paper presents a high-quality multilingual dataset for the documentation domain to advance research on localization of structured text. Unlike widely-used datasets for translation of plain text, we collect XML-structured parallel text…

Computation and Language · Computer Science 2020-06-25 Kazuma Hashimoto , Raffaella Buschiazzo , James Bradbury , Teresa Marshall , Richard Socher , Caiming Xiong

Phrase Pair Mappings for Hindi-English Statistical Machine Translation

In this paper, we present our work on the creation of lexical resources for the Machine Translation between English and Hindi. We describes the development of phrase pair mappings for our experiments and the comparative performance…

Computation and Language · Computer Science 2017-11-13 Sreelekha S , Pushpak Bhattacharyya

A Comparison of Approaches to Document-level Machine Translation

Document-level machine translation conditions on surrounding sentences to produce coherent translations. There has been much recent work in this area with the introduction of custom model architectures and decoding algorithms. This paper…

Computation and Language · Computer Science 2021-01-28 Zhiyi Ma , Sergey Edunov , Michael Auli

Does Summary Evaluation Survive Translation to Other Languages?

The creation of a quality summarization dataset is an expensive, time-consuming effort, requiring the production and evaluation of summaries by both trained humans and machines. If such effort is made in one language, it would be beneficial…

Computation and Language · Computer Science 2021-12-09 Spencer Braun , Oleg Vasilyev , Neslihan Iskender , John Bohannon

scb-mt-en-th-2020: A Large English-Thai Parallel Corpus

The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news,…

Computation and Language · Computer Science 2021-08-10 Lalita Lowphansirikul , Charin Polpanumas , Attapol T. Rutherford , Sarana Nutanong

Document-aligned Japanese-English Conversation Parallel Corpus

Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main…

Computation and Language · Computer Science 2020-12-14 Matīss Rikters , Ryokan Ri , Tong Li , Toshiaki Nakazawa

Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis

Machine translation has become a critical tool in bridging linguistic gaps, especially between languages as diverse as English and Hindi. This paper comprehensively evaluates various machine translation models for translating between…

Computation and Language · Computer Science 2025-05-27 Ahan Prasannakumar Shetty

Improving Indonesian Text Classification Using Multilingual Language Model

Compared to English, the amount of labeled data for Indonesian text classification tasks is very small. Recently developed multilingual language models have shown its ability to create multilingual representations effectively. This paper…

Computation and Language · Computer Science 2020-09-15 Ilham Firdausi Putra , Ayu Purwarianti

Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation

Data curation is a critical yet under-researched step in the machine translation training paradigm. To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree,…

Computation and Language · Computer Science 2026-03-12 Saumitra Yadav , Manish Shrivastava

MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages

Most legal text in the Indian judiciary is written in complex English due to historical reasons. However, only a small fraction of the Indian population is comfortable in reading English. Hence legal text needs to be made available in…

Computation and Language · Computer Science 2024-11-08 Sayan Mahapatra , Debtanu Datta , Shubham Soni , Adrijit Goswami , Saptarshi Ghosh

Exploiting Parallel Corpora to Improve Multilingual Embedding based Document and Sentence Alignment

Multilingual sentence representations pose a great advantage for low-resource languages that do not have enough data to build monolingual models on their own. These multilingual sentence representations have been separately exploited by few…

Computation and Language · Computer Science 2021-06-15 Dilan Sachintha , Lakmali Piyarathna , Charith Rajitha , Surangika Ranathunga

Document-Level Language Models for Machine Translation

Despite the known limitations, most machine translation systems today still operate on the sentence-level. One reason for this is, that most parallel training data is only sentence-level aligned, without document-level meta information…

Computation and Language · Computer Science 2023-10-20 Frithjof Petrick , Christian Herold , Pavel Petrushkov , Shahram Khadivi , Hermann Ney

The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to…

Computation and Language · Computer Science 2019-09-17 Francisco Guzmán , Peng-Jen Chen , Myle Ott , Juan Pino , Guillaume Lample , Philipp Koehn , Vishrav Chaudhary , Marc'Aurelio Ranzato

Learning Semantic Correspondences in Technical Documentation

We consider the problem of translating high-level textual descriptions to formal representations in technical documentation as part of an effort to model the meaning of such documentation. We focus specifically on the problem of learning…

Computation and Language · Computer Science 2017-09-18 Kyle Richardson , Jonas Kuhn

Enabling Medical Translation for Low-Resource Languages

We present research towards bridging the language gap between migrant workers in Qatar and medical staff. In particular, we present the first steps towards the development of a real-world Hindi-English machine translation system for…

Computation and Language · Computer Science 2016-10-11 Ahmad Musleh , Nadir Durrani , Irina Temnikova , Preslav Nakov , Stephan Vogel , Osama Alsaad

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Translating source code from one programming language to another is a critical, time-consuming task in modernizing legacy applications and codebases. Recent work in this space has drawn inspiration from the software naturalness hypothesis…

Computation and Language · Computer Science 2021-10-12 Mayank Agarwal , Kartik Talamadupula , Fernando Martinez , Stephanie Houde , Michael Muller , John Richards , Steven I Ross , Justin D. Weisz

Escaping the sentence-level paradigm in machine translation

It is well-known that document context is vital for resolving a range of translation ambiguities, and in fact the document setting is the most natural setting for nearly all translation. It is therefore unfortunate that machine translation…

Computation and Language · Computer Science 2024-05-17 Matt Post , Marcin Junczys-Dowmunt

Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel…

Computation and Language · Computer Science 2025-04-23 Rahul Raja , Arpita Vats

Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus

Several recent papers claim human parity at sentence-level Machine Translation (MT), especially in high-resource languages. Thus, in response, the MT community has, in part, shifted its focus to document-level translation. Translating…

Computation and Language · Computer Science 2023-05-19 Yuchen Eleanor Jiang , Tianyu Liu , Shuming Ma , Dongdong Zhang , Mrinmaya Sachan , Ryan Cotterell