Related papers: OCR Post Correction for Endangered Language Texts

OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion

With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for…

Computation and Language · Computer Science 2012-04-03 Youssef Bassil , Mohammad Alwani

User-Centric Evaluation of OCR Systems for Kwak'wala

There has been recent interest in improving optical character recognition (OCR) for endangered languages, particularly because a large number of documents and books in these languages are not in machine-readable formats. The performance of…

Computation and Language · Computer Science 2023-02-28 Shruti Rijhwani , Daisy Rosenblum , Michayla King , Antonios Anastasopoulos , Graham Neubig

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit. Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman. Currently, there exists no dataset…

Computation and Language · Computer Science 2018-09-10 Amrith Krishna , Bodhisattwa Prasad Majumder , Rajesh Shreedhar Bhat , Pawan Goyal

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the…

Computation and Language · Computer Science 2021-11-05 Shruti Rijhwani , Daisy Rosenblum , Antonios Anastasopoulos , Graham Neubig

OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set

Since the dawn of the computing era, information has been represented digitally so that it can be processed by electronic computers. Paper books and documents were abundant and widely being published at that time; and hence, there was a…

Computation and Language · Computer Science 2012-04-03 Youssef Bassil , Mohammad Alwani

Quality of OCR for Degraded Text Images

Commercial OCR packages work best with high-quality scanned images. They often produce poor results when the image is degraded, either because the original itself was poor quality, or because of excessive photocopying. The ability to…

Digital Libraries · Computer Science 2007-05-23 Roger T. Hartley , Kathleen Crumpton

Neural OCR Post-Hoc Correction of Historical Corpora

Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of…

Computation and Language · Computer Science 2021-02-02 Lijun Lyu , Maria Koutraki , Martin Krickl , Besnik Fetahu

Text Detection Forgot About Document OCR

Detection and recognition of text from scans and other images, commonly denoted as Optical Character Recognition (OCR), is a widely used form of automated document processing with a number of methods available. Yet OCR systems still do not…

Computer Vision and Pattern Recognition · Computer Science 2023-01-24 Krzysztof Olejniczak , Milan Šulc

OCR Improves Machine Translation for Low-Resource Languages

We aim to investigate the performance of current OCR systems on low resource languages and low resource scripts. We introduce and make publicly available a novel benchmark, OCR4MT, consisting of real and synthetic data, enriched with noise,…

Computation and Language · Computer Science 2022-03-15 Oana Ignat , Jean Maillard , Vishrav Chaudhary , Francisco Guzmán

A Cost Efficient Approach to Correct OCR Errors in Large Document Collections

Word error rate of an ocr is often higher than its character error rate. This is especially true when ocrs are designed by recognizing characters. High word accuracies are critical to tasks like the creation of content in digital libraries…

Computer Vision and Pattern Recognition · Computer Science 2019-05-29 Deepayan Das , Jerin Philip , Minesh Mathew , C. V. Jawahar

Statistical Learning for OCR Text Correction

The accuracy of Optical Character Recognition (OCR) is crucial to the success of subsequent applications used in text analyzing pipeline. Recent models of OCR post-processing significantly improve the quality of OCR-generated text, but are…

Computer Vision and Pattern Recognition · Computer Science 2016-11-22 Jie Mei , Aminul Islam , Yajing Wu , Abidalrahman Moh'd , Evangelos E. Milios

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to…

Computation and Language · Computer Science 2025-01-23 Jonathan Bourne

Scrambled text: training Language Models to correct OCR errors using synthetic data

OCR errors are common in digitised historical archives significantly affecting their usability and value. Generative Language Models (LMs) have shown potential for correcting these errors using the context provided by the corrupted text and…

Computation and Language · Computer Science 2024-10-01 Jonathan Bourne

Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models

In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to…

Computation and Language · Computer Science 2022-01-26 Juan Ramirez-Orta , Eduardo Xamena , Ana Maguitman , Evangelos Milios , Axel J. Soto

Noisy Parallel Data Alignment

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of…

Computation and Language · Computer Science 2023-02-13 Ruoyu Xie , Antonios Anastasopoulos

OCR accuracy improvement on document images through a novel pre-processing approach

Digital camera and mobile document image acquisition are new trends arising in the world of Optical Character Recognition and text detection. In some cases, such process integrates many distortions and produces poorly scanned text or…

Computer Vision and Pattern Recognition · Computer Science 2015-09-14 Abdeslam El Harraj , Naoufal Raissouni

Estimating Post-OCR Denoising Complexity on Numerical Texts

Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices,…

Computation and Language · Computer Science 2023-07-04 Arthur Hemmer , Jérôme Brachat , Mickaël Coustaty , Jean-Marc Ogier

Detection Masking for Improved OCR on Noisy Documents

Optical Character Recognition (OCR), the task of extracting textual information from scanned documents is a vital and broadly used technology for digitizing and indexing physical documents. Existing technologies perform well for clean…

Computer Vision and Pattern Recognition · Computer Science 2022-05-18 Daniel Rotman , Ophir Azulai , Inbar Shapira , Yevgeny Burshtein , Udi Barzelay

Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Mariana Dias , Carla Teixeira Lopes

A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing

Optical Character Recognition (OCR) technology finds applications in digitizing books and unstructured documents, along with applications in other domains such as mobility statistics, law enforcement, traffic, security systems, etc. The…

Computer Vision and Pattern Recognition · Computer Science 2023-07-11 Aishik Rakshit , Samyak Mehta , Anirban Dasgupta