Related papers: Categorizing ancient documents

Handwriting Classification for the Analysis of Art-Historical Documents

Digitized archives contain and preserve the knowledge of generations of scholars in millions of documents. The size of these archives calls for automatic analysis since a manual analysis by specialists is often too expensive. In this paper,…

Computer Vision and Pattern Recognition · Computer Science 2020-11-05 Christian Bartz , Hendrik Rätz , Christoph Meinel

Page image classification for content-specific data processing

Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text…

Information Retrieval · Computer Science 2026-05-29 Kateryna Lutsai

Line and Word Matching in Old Documents

This paper is concerned with the problem of establishing an index based on word matching. It is assumed that the book was digitised as better as possible and some pre-processing techniques were already applied as line orientation correction…

Artificial Intelligence · Computer Science 2007-05-23 A. Marcolino , Vitorino Ramos , Mario Ramalho , J. R. Caldas Pinto

Image-based material analysis of ancient historical documents

Researchers continually perform corroborative tests to classify ancient historical documents based on the physical materials of their writing surfaces. However, these tests, often performed on-site, requires actual access to the manuscript…

Computer Vision and Pattern Recognition · Computer Science 2023-04-13 Thomas Reynolds , Maruf A. Dhali , Lambert Schomaker

Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two…

Computer Vision and Pattern Recognition · Computer Science 2023-06-22 Pit Schneider

Digitization of Document and Information Extraction using OCR

Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character…

Computer Vision and Pattern Recognition · Computer Science 2025-06-16 Rasha Sinha , Rekha B S

Text recognition in both ancient and cartographic documents

This paper deals with the recognition and matching of text in both cartographic maps and ancient documents. The purpose of this work is to find similar text regions based on statistical and global features. A phase of normalization is done…

Computer Vision and Pattern Recognition · Computer Science 2013-08-30 Nizar Zaghden , Badreddine Khelifi , Adel M. Alimi , Remy Mullot

Historical Document Processing: Historical Document Processing: A Survey of Techniques, Tools, and Trends

Historical Document Processing is the process of digitizing written material from the past for future use by historians and other scholars. It incorporates algorithms and software tools from various subfields of computer science, including…

Computer Vision and Pattern Recognition · Computer Science 2020-09-14 James P. Philips , Nasseh Tabrizi

A Survey on Optical Character Recognition System

Optical Character Recognition (OCR) has been a topic of interest for many years. It is defined as the process of digitizing a document image into its constituent characters. Despite decades of intense research, developing OCR with…

Computer Vision and Pattern Recognition · Computer Science 2017-10-17 Noman Islam , Zeeshan Islam , Nazia Noor

Text Line Segmentation of Historical Documents: a Survey

There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such…

Computer Vision and Pattern Recognition · Computer Science 2007-05-23 Laurence Likforman-Sulem , Abderrazak Zahour , Bruno Taconet

Classification of Documents Extracted from Images with Optical Character Recognition Methods

Over the past decade, machine learning methods have given us driverless cars, voice recognition, effective web search, and a much better understanding of the human genome. Machine learning is so common today that it is used dozens of times…

Computer Vision and Pattern Recognition · Computer Science 2021-06-22 Omer Aydin

Text Detection Forgot About Document OCR

Detection and recognition of text from scans and other images, commonly denoted as Optical Character Recognition (OCR), is a widely used form of automated document processing with a number of methods available. Yet OCR systems still do not…

Computer Vision and Pattern Recognition · Computer Science 2023-01-24 Krzysztof Olejniczak , Milan Šulc

Document Image Coding and Clustering for Script Discrimination

The paper introduces a new method for discrimination of documents given in different scripts. The document is mapped into a uniformly coded text of numerical values. It is derived from the position of the letters in the text line, based on…

Computer Vision and Pattern Recognition · Computer Science 2016-09-22 Darko Brodic , Alessia Amelio , Zoran N. Milivojevic , Milena Jevtic

Efficient OCR for Building a Diverse Digital History

Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR)…

Computer Vision and Pattern Recognition · Computer Science 2024-07-29 Jacob Carlson , Tom Bryan , Melissa Dell

Words as Geometric Features: Estimating Homography using Optical Character Recognition as Compressed Image Representation

Document alignment and registration play a crucial role in numerous real-world applications, such as automated form processing, anomaly detection, and workflow automation. Traditional methods for document alignment rely on image-based…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Ross Greer , Alisha Ukani , Katherine Izhikevich , Earlence Fernandes , Stefan Savage , Alex C. Snoeren

Leveraging GenAI for Segmenting and Labeling Centuries-old Technical Documents

Image segmentation and image recognition are well established computational techniques in the broader discipline of image processing. Segmentation allows to locate areas in an image, while recognition identifies specific objects within an…

Computer Vision and Pattern Recognition · Computer Science 2026-03-04 Carlos Monroy , Benjamin Navarro

A Conglomerate of Multiple OCR Table Detection and Extraction

Information representation as tables are compact and concise method that eases searching, indexing, and storage requirements. Extracting and cloning tables from parsable documents is easier and widely used, however industry still faces…

Information Retrieval · Computer Science 2020-10-20 Smita Pallavi , Raj Ratn Pranesh , Sumit Kumar

Font Identification in Historical Documents Using Active Learning

Identifying the type of font (e.g., Roman, Blackletter) used in historical documents can help optical character recognition (OCR) systems produce more accurate text transcriptions. Towards this end, we present an active-learning strategy…

Computer Vision and Pattern Recognition · Computer Science 2016-01-28 Anshul Gupta , Ricardo Gutierrez-Osuna , Matthew Christy , Richard Furuta , Laura Mandell

Word Spotting in Cursive Handwritten Documents using Modified Character Shape Codes

There is a large collection of Handwritten English paper documents of Historical and Scientific importance. But paper documents are not recognized directly by computer. Hence the closest way of indexing these documents is by storing their…

Computer Vision and Pattern Recognition · Computer Science 2013-10-24 Sayantan Sarkar

Locating Tables in Scanned Documents for Reconstructing and Republishing (ICIAfS14)

Pool of knowledge available to the mankind depends on the source of learning resources, which can vary from ancient printed documents to present electronic material. The rapid conversion of material available in traditional libraries to…

Computer Vision and Pattern Recognition · Computer Science 2014-12-25 Akmal Jahan Mac , Roshan G Ragel