Related papers: Unsupervised Parallel Corpus Mining on Web Data

Parallel Corpus Filtering via Pre-trained Language Models

Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to…

Computation and Language · Computer Science 2020-05-14 Boliang Zhang , Ajay Nagesh , Kevin Knight

Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model…

Computation and Language · Computer Science 2020-10-16 Phillip Keung , Julian Salazar , Yichao Lu , Noah A. Smith

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on…

Computation and Language · Computer Science 2021-05-24 Ivana Kvapilıkova , Mikel Artetxe , Gorka Labaka , Eneko Agirre , Ondřej Bojar

Unsupervised Neural Machine Translation

In spite of the recent success of neural machine translation (NMT) in standard benchmarks, the lack of large parallel corpora poses a major practical problem for many language pairs. There have been several proposals to alleviate this issue…

Computation and Language · Computer Science 2018-02-27 Mikel Artetxe , Gorka Labaka , Eneko Agirre , Kyunghyun Cho

Machine Translation Model based on Non-parallel Corpus and Semi-supervised Transductive Learning

Although the parallel corpus has an irreplaceable role in machine translation, its scale and coverage is still beyond the actual needs. Non-parallel corpus resources on the web have an inestimable potential value in machine translation and…

Computation and Language · Computer Science 2014-05-23 Lijiang Chen

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from…

Computation and Language · Computer Science 2016-03-23 Krzysztof Wołk , Emilia Rejmund , Krzysztof Marasek

Phrase-Based & Neural Unsupervised Machine Translation

Machine translation systems achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large amounts of parallel sentences, which hinders their applicability to the majority of…

Computation and Language · Computer Science 2018-08-15 Guillaume Lample , Myle Ott , Alexis Conneau , Ludovic Denoyer , Marc'Aurelio Ranzato

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel…

Computation and Language · Computer Science 2023-11-08 Haiyue Song , Raj Dabre , Chenhui Chu , Atsushi Fujita , Sadao Kurohashi

A Multilingual View of Unsupervised Machine Translation

We present a probabilistic framework for multilingual neural machine translation that encompasses supervised and unsupervised setups, focusing on unsupervised translation. In addition to studying the vanilla case where there is only…

Computation and Language · Computer Science 2020-10-20 Xavier Garcia , Pierre Foret , Thibault Sellam , Ankur P. Parikh

Harvesting comparable corpora and mining them for equivalent bilingual sentences using statistical classification and analogy- based heuristics

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from…

Computation and Language · Computer Science 2015-11-20 Krzysztof Wołk , Emilia Rejmund , Krzysztof Marasek

Unsupervised Machine Translation Using Monolingual Corpora Only

Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource…

Computation and Language · Computer Science 2018-04-16 Guillaume Lample , Alexis Conneau , Ludovic Denoyer , Marc'Aurelio Ranzato

Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text

Parallel corpora are a valuable resource for machine translation, but at present their availability and utility is limited by genre- and domain-specificity, licensing restrictions, and the basic difficulty of locating parallel texts in all…

cmp-lg · Computer Science 2007-05-23 Philip Resnik

USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation

The vast majority of evaluation metrics for machine translation are supervised, i.e., (i) are trained on human scores, (ii) assume the existence of reference translations, or (iii) leverage parallel data. This hinders their applicability to…

Computation and Language · Computer Science 2024-03-05 Jonas Belouadi , Steffen Eger

An Effective Approach to Unsupervised Machine Translation

While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line has managed to train both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) systems using monolingual…

Computation and Language · Computer Science 2021-12-28 Mikel Artetxe , Gorka Labaka , Eneko Agirre

Original or Translated? On the Use of Parallel Data for Translation Quality Estimation

Machine Translation Quality Estimation (QE) is the task of evaluating translation output in the absence of human-written references. Due to the scarcity of human-labeled QE data, previous works attempted to utilize the abundant unlabeled…

Computation and Language · Computer Science 2022-12-21 Baopu Qiu , Liang Ding , Di Wu , Lin Shang , Yibing Zhan , Dacheng Tao

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from…

Computation and Language · Computer Science 2015-09-30 Krzysztof Wołk , Krzysztof Marasek

Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given…

Computation and Language · Computer Science 2025-02-19 Abdellah El Mekki , Muhammad Abdul-Mageed

Boosting Unsupervised Machine Translation with Pseudo-Parallel Data

Even with the latest developments in deep learning and large-scale language modeling, the task of machine translation (MT) of low-resource languages remains a challenge. Neural MT systems can be trained in an unsupervised way without any…

Computation and Language · Computer Science 2023-10-24 Ivana Kvapilíková , Ondřej Bojar

Filtering and Mining Parallel Data in a Joint Multilingual Space

We learn a joint multilingual sentence embedding and use the distance between sentences in different languages to filter noisy parallel data and to mine for parallel data in large news collections. We are able to improve a competitive…

Computation and Language · Computer Science 2018-05-28 Holger Schwenk

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality…

Computation and Language · Computer Science 2015-12-08 Krzysztof Wołk , Krzysztof Marasek