English
Related papers

Related papers: Unsupervised Parallel Corpus Mining on Web Data

200 papers

Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to…

Computation and Language · Computer Science 2020-05-14 Boliang Zhang , Ajay Nagesh , Kevin Knight

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model…

Computation and Language · Computer Science 2020-10-16 Phillip Keung , Julian Salazar , Yichao Lu , Noah A. Smith

Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on…

Computation and Language · Computer Science 2021-05-24 Ivana Kvapilıkova , Mikel Artetxe , Gorka Labaka , Eneko Agirre , Ondřej Bojar

In spite of the recent success of neural machine translation (NMT) in standard benchmarks, the lack of large parallel corpora poses a major practical problem for many language pairs. There have been several proposals to alleviate this issue…

Computation and Language · Computer Science 2018-02-27 Mikel Artetxe , Gorka Labaka , Eneko Agirre , Kyunghyun Cho

Although the parallel corpus has an irreplaceable role in machine translation, its scale and coverage is still beyond the actual needs. Non-parallel corpus resources on the web have an inestimable potential value in machine translation and…

Computation and Language · Computer Science 2014-05-23 Lijiang Chen

Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from…

Computation and Language · Computer Science 2016-03-23 Krzysztof Wołk , Emilia Rejmund , Krzysztof Marasek

Machine translation systems achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large amounts of parallel sentences, which hinders their applicability to the majority of…

Computation and Language · Computer Science 2018-08-15 Guillaume Lample , Myle Ott , Alexis Conneau , Ludovic Denoyer , Marc'Aurelio Ranzato

Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel…

Computation and Language · Computer Science 2023-11-08 Haiyue Song , Raj Dabre , Chenhui Chu , Atsushi Fujita , Sadao Kurohashi

We present a probabilistic framework for multilingual neural machine translation that encompasses supervised and unsupervised setups, focusing on unsupervised translation. In addition to studying the vanilla case where there is only…

Computation and Language · Computer Science 2020-10-20 Xavier Garcia , Pierre Foret , Thibault Sellam , Ankur P. Parikh

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from…

Computation and Language · Computer Science 2015-11-20 Krzysztof Wołk , Emilia Rejmund , Krzysztof Marasek

Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource…

Computation and Language · Computer Science 2018-04-16 Guillaume Lample , Alexis Conneau , Ludovic Denoyer , Marc'Aurelio Ranzato

Parallel corpora are a valuable resource for machine translation, but at present their availability and utility is limited by genre- and domain-specificity, licensing restrictions, and the basic difficulty of locating parallel texts in all…

cmp-lg · Computer Science 2007-05-23 Philip Resnik

The vast majority of evaluation metrics for machine translation are supervised, i.e., (i) are trained on human scores, (ii) assume the existence of reference translations, or (iii) leverage parallel data. This hinders their applicability to…

Computation and Language · Computer Science 2024-03-05 Jonas Belouadi , Steffen Eger

While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line has managed to train both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) systems using monolingual…

Computation and Language · Computer Science 2021-12-28 Mikel Artetxe , Gorka Labaka , Eneko Agirre

Machine Translation Quality Estimation (QE) is the task of evaluating translation output in the absence of human-written references. Due to the scarcity of human-labeled QE data, previous works attempted to utilize the abundant unlabeled…

Computation and Language · Computer Science 2022-12-21 Baopu Qiu , Liang Ding , Di Wu , Lin Shang , Yibing Zhan , Dacheng Tao

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from…

Computation and Language · Computer Science 2015-09-30 Krzysztof Wołk , Krzysztof Marasek

Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given…

Computation and Language · Computer Science 2025-02-19 Abdellah El Mekki , Muhammad Abdul-Mageed

Even with the latest developments in deep learning and large-scale language modeling, the task of machine translation (MT) of low-resource languages remains a challenge. Neural MT systems can be trained in an unsupervised way without any…

Computation and Language · Computer Science 2023-10-24 Ivana Kvapilíková , Ondřej Bojar

We learn a joint multilingual sentence embedding and use the distance between sentences in different languages to filter noisy parallel data and to mine for parallel data in large news collections. We are able to improve a competitive…

Computation and Language · Computer Science 2018-05-28 Holger Schwenk

The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality…

Computation and Language · Computer Science 2015-12-08 Krzysztof Wołk , Krzysztof Marasek
‹ Prev 1 2 3 10 Next ›