Related papers: Lexically Aware Semi-Supervised Learning for OCR P…

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit. Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman. Currently, there exists no dataset…

Computation and Language · Computer Science 2018-09-10 Amrith Krishna , Bodhisattwa Prasad Majumder , Rajesh Shreedhar Bhat , Pawan Goyal

OCR Post Correction for Endangered Language Texts

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned…

Computation and Language · Computer Science 2020-11-12 Shruti Rijhwani , Antonios Anastasopoulos , Graham Neubig

A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing

Optical Character Recognition (OCR) technology finds applications in digitizing books and unstructured documents, along with applications in other domains such as mobility statistics, law enforcement, traffic, security systems, etc. The…

Computer Vision and Pattern Recognition · Computer Science 2023-07-11 Aishik Rakshit , Samyak Mehta , Anirban Dasgupta

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a…

Computation and Language · Computer Science 2020-11-20 Quan Duong , Mika Hämäläinen , Simon Hengchen

Leveraging Text Repetitions and Denoising Autoencoders in OCR Post-correction

A common approach for improving OCR quality is a post-processing step based on models correcting misdetected characters and tokens. These models are typically trained on aligned pairs of OCR read text and their manually corrected…

Computation and Language · Computer Science 2019-06-27 Kai Hakala , Aleksi Vesanto , Niko Miekka , Tapio Salakoski , Filip Ginter

Unknown-box Approximation to Improve Optical Character Recognition Performance

Optical character recognition (OCR) is a widely used pattern recognition application in numerous domains. There are several feature-rich, general-purpose OCR solutions available for consumers, which can provide moderate to excellent…

Computer Vision and Pattern Recognition · Computer Science 2021-05-18 Ayantha Randika , Nilanjan Ray , Xiao Xiao , Allegra Latimer

OCR accuracy improvement on document images through a novel pre-processing approach

Digital camera and mobile document image acquisition are new trends arising in the world of Optical Character Recognition and text detection. In some cases, such process integrates many distortions and produces poorly scanned text or…

Computer Vision and Pattern Recognition · Computer Science 2015-09-14 Abdeslam El Harraj , Naoufal Raissouni

Neural OCR Post-Hoc Correction of Historical Corpora

Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of…

Computation and Language · Computer Science 2021-02-02 Lijun Lyu , Maria Koutraki , Martin Krickl , Besnik Fetahu

Optimizing the Neural Network Training for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based documents such as books and newspapers have been digitized using Optical Character Recognition. This technology is error-prone, especially for historical documents. To correct OCR…

Computation and Language · Computer Science 2023-08-01 Omri Suissa , Avshalom Elmalech , Maayan Zhitomirsky-Geffet

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion,…

Computer Vision and Pattern Recognition · Computer Science 2026-05-26 Qinwu Xu , Yifan Jiang , Haoyu Ren

Efficient, Lexicon-Free OCR using Deep Learning

Contrary to popular belief, Optical Character Recognition (OCR) remains a challenging problem when text occurs in unconstrained environments, like natural scenes, due to geometrical distortions, complex backgrounds, and diverse fonts. In…

Computer Vision and Pattern Recognition · Computer Science 2019-06-06 Marcin Namysl , Iuliu Konya

TransDocs: Optical Character Recognition with word to word translation

While OCR has been used in various applications, its output is not always accurate, leading to misfit words. This research work focuses on improving the optical character recognition (OCR) with ML techniques with integration of OCR with…

Computer Vision and Pattern Recognition · Computer Science 2023-04-18 Abhishek Bamotra , Phani Krishna Uppala

Semi-supervised dictionary learning with graph regularization and active points

Supervised Dictionary Learning has gained much interest in the recent decade and has shown significant performance improvements in image classification. However, in general, supervised learning needs a large number of labelled samples per…

Computer Vision and Pattern Recognition · Computer Science 2020-09-15 Khanh-Hung Tran , Fred-Maurice Ngole-Mboula , Jean-Luc Starck , Vincent Prost

Advancing Post-OCR Correction: A Comparative Study of Synthetic Data

This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance.…

Computation and Language · Computer Science 2024-08-14 Shuhao Guan , Derek Greene

OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion

With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for…

Computation and Language · Computer Science 2012-04-03 Youssef Bassil , Mohammad Alwani

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR…

Computation and Language · Computer Science 2021-07-16 Guowei Xu , Wenbiao Ding , Weiping Fu , Zhongqin Wu , Zitao Liu

Discovery of Visual Semantics by Unsupervised and Self-Supervised Representation Learning

The success of deep learning in computer vision is rooted in the ability of deep networks to scale up model complexity as demanded by challenging visual tasks. As complexity is increased, so is the need for large amounts of labeled data to…

Computer Vision and Pattern Recognition · Computer Science 2017-08-22 Gustav Larsson

Combining Unsupervised and Text Augmented Semi-Supervised Learning for Low Resourced Autoregressive Speech Recognition

Recent advances in unsupervised representation learning have demonstrated the impact of pretraining on large amounts of read speech. We adapt these techniques for domain adaptation in low-resource -- both in terms of data and compute --…

Computation and Language · Computer Science 2022-02-14 Chak-Fai Li , Francis Keith , William Hartmann , Matthew Snover

Noisy Parallel Data Alignment

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of…

Computation and Language · Computer Science 2023-02-13 Ruoyu Xie , Antonios Anastasopoulos

Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Mariana Dias , Carla Teixeira Lopes