Optimized Table Tokenization for Table Structure Recognition

Maksym Lysak; Ahmed Nassar; Nikolaos Livathinos; Christoph Auer; Peter Staar

Optimized Table Tokenization for Table Structure Recognition

Computer Vision and Pattern Recognition 2023-05-08 v1

Authors: Maksym Lysak , Ahmed Nassar , Nikolaos Livathinos , Christoph Auer , Peter Staar

Abstract

Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs.

Keywords

tokenization handwritten character recognition

Cite

@article{arxiv.2305.03393,
  title  = {Optimized Table Tokenization for Table Structure Recognition},
  author = {Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar},
  journal= {arXiv preprint arXiv:2305.03393},
  year   = {2023}
}

Comments

Accepted to ICDAR 2023, 12 pages, 6 figures

Optimized Table Tokenization for Table Structure Recognition

Abstract

Keywords

Cite

Comments

Related papers