English

Exploiting Sentence Order in Document Alignment

Computation and Language 2020-10-29 v2

Abstract

We present a simple document alignment method that incorporates sentence order information in both candidate generation and candidate re-scoring. Our method results in 61% relative reduction in error compared to the best previously published result on the WMT16 document alignment shared task. Our method improves downstream MT performance on web-scraped Sinhala--English documents from ParaCrawl, outperforming the document alignment method used in the most recent ParaCrawl release. It also outperforms a comparable corpora method which uses the same multilingual embeddings, demonstrating that exploiting sentence order is beneficial even if the end goal is sentence-level bitext.

Keywords

Cite

@article{arxiv.2004.14523,
  title  = {Exploiting Sentence Order in Document Alignment},
  author = {Brian Thompson and Philipp Koehn},
  journal= {arXiv preprint arXiv:2004.14523},
  year   = {2020}
}

Comments

EMNLP2020

R2 v1 2026-06-23T15:12:02.361Z