Related papers: Statistical Learning for OCR Text Correction

OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion

With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for…

Computation and Language · Computer Science 2012-04-03 Youssef Bassil , Mohammad Alwani

Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models

In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to…

Computation and Language · Computer Science 2022-01-26 Juan Ramirez-Orta , Eduardo Xamena , Ana Maguitman , Evangelos Milios , Axel J. Soto

Neural OCR Post-Hoc Correction of Historical Corpora

Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of…

Computation and Language · Computer Science 2021-02-02 Lijun Lyu , Maria Koutraki , Martin Krickl , Besnik Fetahu

OCR Error Correction Using Character Correction and Feature-Based Word Classification

This paper explores the use of a learned classifier for post-OCR text correction. Experiments with the Arabic language show that this approach, which integrates a weighted confusion matrix and a shallow language model, improves the vast…

Information Retrieval · Computer Science 2020-06-11 Ido Kissos , Nachum Dershowitz

3D Rendering Framework for Data Augmentation in Optical Character Recognition

In this paper, we propose a data augmentation framework for Optical Character Recognition (OCR). The proposed framework is able to synthesize new viewing angles and illumination scenarios, effectively enriching any available OCR dataset.…

Computer Vision and Pattern Recognition · Computer Science 2022-09-30 Andreas Spruck , Maximiliane Hawesch , Anatol Maier , Christian Riess , Jürgen Seiler , André Kaup

OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set

Since the dawn of the computing era, information has been represented digitally so that it can be processed by electronic computers. Paper books and documents were abundant and widely being published at that time; and hence, there was a…

Computation and Language · Computer Science 2012-04-03 Youssef Bassil , Mohammad Alwani

Leveraging Text Repetitions and Denoising Autoencoders in OCR Post-correction

A common approach for improving OCR quality is a post-processing step based on models correcting misdetected characters and tokens. These models are typically trained on aligned pairs of OCR read text and their manually corrected…

Computation and Language · Computer Science 2019-06-27 Kai Hakala , Aleksi Vesanto , Niko Miekka , Tapio Salakoski , Filip Ginter

Unknown-box Approximation to Improve Optical Character Recognition Performance

Optical character recognition (OCR) is a widely used pattern recognition application in numerous domains. There are several feature-rich, general-purpose OCR solutions available for consumers, which can provide moderate to excellent…

Computer Vision and Pattern Recognition · Computer Science 2021-05-18 Ayantha Randika , Nilanjan Ray , Xiao Xiao , Allegra Latimer

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit. Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman. Currently, there exists no dataset…

Computation and Language · Computer Science 2018-09-10 Amrith Krishna , Bodhisattwa Prasad Majumder , Rajesh Shreedhar Bhat , Pawan Goyal

A Cost Efficient Approach to Correct OCR Errors in Large Document Collections

Word error rate of an ocr is often higher than its character error rate. This is especially true when ocrs are designed by recognizing characters. High word accuracies are critical to tasks like the creation of content in digital libraries…

Computer Vision and Pattern Recognition · Computer Science 2019-05-29 Deepayan Das , Jerin Philip , Minesh Mathew , C. V. Jawahar

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to…

Computation and Language · Computer Science 2025-01-23 Jonathan Bourne

OCR Post Correction for Endangered Language Texts

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned…

Computation and Language · Computer Science 2020-11-12 Shruti Rijhwani , Antonios Anastasopoulos , Graham Neubig

Bounding the Probability of Error for High Precision Recognition

We consider models for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low rates of recall. If some variables can be identified with near certainty, then they can be…

Computer Vision and Pattern Recognition · Computer Science 2009-07-03 Andrew Kae , Gary B. Huang , Erik Learned-Miller

Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR

Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to…

Computer Vision and Pattern Recognition · Computer Science 2025-09-01 Shashank Vempati , Nishit Anand , Gaurav Talebailkar , Arpan Garai , Chetan Arora

Boosting Optical Character Recognition: A Super-Resolution Approach

Text image super-resolution is a challenging yet open research problem in the computer vision community. In particular, low-resolution images hamper the performance of typical optical character recognition (OCR) systems. In this article, we…

Computer Vision and Pattern Recognition · Computer Science 2015-06-09 Chao Dong , Ximei Zhu , Yubin Deng , Chen Change Loy , Yu Qiao

On the Accuracy of CRNNs for Line-Based OCR: A Multi-Parameter Evaluation

We investigate how to train a high quality optical character recognition (OCR) model for difficult historical typefaces on degraded paper. Through extensive grid searches, we obtain a neural network architecture and a set of optimal data…

Computer Vision and Pattern Recognition · Computer Science 2020-08-07 Bernhard Liebl , Manuel Burghardt

Enhancing OCR Performance through Post-OCR Models: Adopting Glyph Embedding for Improved Correction

The study investigates the potential of post-OCR models to overcome limitations in OCR models and explores the impact of incorporating glyph embedding on post-OCR correction performance. In this study, we have developed our own post-OCR…

Computer Vision and Pattern Recognition · Computer Science 2023-08-30 Yung-Hsin Chen , Yuli Zhou

Efficient Multi-domain Text Recognition Deep Neural Network Parameterization with Residual Adapters

Recent advancements in deep neural networks have markedly enhanced the performance of computer vision tasks, yet the specialized nature of these networks often necessitates extensive data and high computational power. Addressing these…

Computer Vision and Pattern Recognition · Computer Science 2024-01-03 Jiayou Chao , Wei Zhu

A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing

Optical Character Recognition (OCR) technology finds applications in digitizing books and unstructured documents, along with applications in other domains such as mobility statistics, law enforcement, traffic, security systems, etc. The…

Computer Vision and Pattern Recognition · Computer Science 2023-07-11 Aishik Rakshit , Samyak Mehta , Anirban Dasgupta

Profiling of OCR'ed Historical Texts Revisited

In the absence of ground truth it is not possible to automatically determine the exact spectrum and occurrences of OCR errors in an OCR'ed text. Yet, for interactive postcorrection of OCR'ed historical printings it is extremely useful to…

Computer Vision and Pattern Recognition · Computer Science 2017-01-20 Florian Fink , Klaus-U. Schulz , Uwe Springmann