English

WikiCREM: A Large Unsupervised Corpus for Coreference Resolution

Computation and Language 2019-10-15 v3

Abstract

Pronoun resolution is a major area of natural language understanding. However, large-scale training sets are still scarce, since manually labelling data is costly. In this work, we introduce WikiCREM (Wikipedia CoREferences Masked) a large-scale, yet accurate dataset of pronoun disambiguation instances. We use a language-model-based approach for pronoun resolution in combination with our WikiCREM dataset. We compare a series of models on a collection of diverse and challenging coreference resolution problems, where we match or outperform previous state-of-the-art approaches on 6 out of 7 datasets, such as GAP, DPR, WNLI, PDP, WinoBias, and WinoGender. We release our model to be used off-the-shelf for solving pronoun disambiguation.

Keywords

Cite

@article{arxiv.1908.08025,
  title  = {WikiCREM: A Large Unsupervised Corpus for Coreference Resolution},
  author = {Vid Kocijan and Oana-Maria Camburu and Ana-Maria Cretu and Yordan Yordanov and Phil Blunsom and Thomas Lukasiewicz},
  journal= {arXiv preprint arXiv:1908.08025},
  year   = {2019}
}

Comments

Accepted to the EMNLP 2019 conference