Data-Efficient Information Extraction from Form-Like Documents

Beliz Gunel; Navneet Potti; Sandeep Tata; James B. Wendt; Marc Najork; Jing Xie

Data-Efficient Information Extraction from Form-Like Documents

Machine Learning 2022-01-14 v1 Information Retrieval

Authors: Beliz Gunel , Navneet Potti , Sandeep Tata , James B. Wendt , Marc Najork , Jing Xie

Abstract

Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training (~50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this.

Keywords

information extraction text classification information retrieval

Cite

@article{arxiv.2201.02647,
  title  = {Data-Efficient Information Extraction from Form-Like Documents},
  author = {Beliz Gunel and Navneet Potti and Sandeep Tata and James B. Wendt and Marc Najork and Jing Xie},
  journal= {arXiv preprint arXiv:2201.02647},
  year   = {2022}
}

Comments

Published at the 2nd Document Intelligence Workshop @ KDD 2021 (https://document-intelligence.github.io/DI-2021/)

Data-Efficient Information Extraction from Form-Like Documents

Abstract

Keywords

Cite

Comments

Related papers