Related papers: Categorizing ancient documents
Digitized archives contain and preserve the knowledge of generations of scholars in millions of documents. The size of these archives calls for automatic analysis since a manual analysis by specialists is often too expensive. In this paper,…
Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text…
This paper is concerned with the problem of establishing an index based on word matching. It is assumed that the book was digitised as better as possible and some pre-processing techniques were already applied as line orientation correction…
Researchers continually perform corroborative tests to classify ancient historical documents based on the physical materials of their writing surfaces. However, these tests, often performed on-site, requires actual access to the manuscript…
Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two…
Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character…
This paper deals with the recognition and matching of text in both cartographic maps and ancient documents. The purpose of this work is to find similar text regions based on statistical and global features. A phase of normalization is done…
Historical Document Processing is the process of digitizing written material from the past for future use by historians and other scholars. It incorporates algorithms and software tools from various subfields of computer science, including…
Optical Character Recognition (OCR) has been a topic of interest for many years. It is defined as the process of digitizing a document image into its constituent characters. Despite decades of intense research, developing OCR with…
There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such…
Over the past decade, machine learning methods have given us driverless cars, voice recognition, effective web search, and a much better understanding of the human genome. Machine learning is so common today that it is used dozens of times…
Detection and recognition of text from scans and other images, commonly denoted as Optical Character Recognition (OCR), is a widely used form of automated document processing with a number of methods available. Yet OCR systems still do not…
The paper introduces a new method for discrimination of documents given in different scripts. The document is mapped into a uniformly coded text of numerical values. It is derived from the position of the letters in the text line, based on…
Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR)…
Document alignment and registration play a crucial role in numerous real-world applications, such as automated form processing, anomaly detection, and workflow automation. Traditional methods for document alignment rely on image-based…
Image segmentation and image recognition are well established computational techniques in the broader discipline of image processing. Segmentation allows to locate areas in an image, while recognition identifies specific objects within an…
Information representation as tables are compact and concise method that eases searching, indexing, and storage requirements. Extracting and cloning tables from parsable documents is easier and widely used, however industry still faces…
Identifying the type of font (e.g., Roman, Blackletter) used in historical documents can help optical character recognition (OCR) systems produce more accurate text transcriptions. Towards this end, we present an active-learning strategy…
There is a large collection of Handwritten English paper documents of Historical and Scientific importance. But paper documents are not recognized directly by computer. Hence the closest way of indexing these documents is by storing their…
Pool of knowledge available to the mankind depends on the source of learning resources, which can vary from ancient printed documents to present electronic material. The rapid conversion of material available in traditional libraries to…