Related papers: Document Image Coding and Clustering for Script Di…
The paper presents a new script classification method for the discrimination of the South Slavic medieval labels. It consists in the textural analysis of the script types. In the first step, each letter is coded by the equivalent script…
Document segmentation is a method of rending the document into distinct regions. A document is an assortment of information and a standard mode of conveying information to others. Pursuance of data from documents involves ton of human…
Document clustering is an unsupervised approach in which a large collection of documents (corpus) is subdivided into smaller, meaningful, identifiable, and verifiable sub-groups (clusters). Meaningful representation of documents and…
Information on different fields which are collected by users requires appropriate management and organization to be structured in a standard way and retrieved fast and more easily. Document classification is a conventional method to…
Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next…
Text Clustering is a text mining technique which divides the given set of text documents into significant clusters. It is used for organizing a huge number of text documents into a well-organized form. In the majority of the clustering…
We propose a novel clustering pipeline to detect and characterize influence campaigns from documents. This approach clusters parts of document, detects clusters that likely reflect an influence campaign, and then identifies documents linked…
The paper proposes an algorithm for the script recognition based on the texture characteristics. The image texture is achieved by coding each letter with the equivalent script type (number code) according to its position in the text line.…
The analysis of historical documents is still a topical issue given the importance of information that can be extracted and also the importance given by the institutions to preserve their heritage. The main idea in order to characterize the…
The knowledge of source printer can help in printed text document authentication, copyright ownership, and provide important clues about the author of a fraudulent document along with his/her potential means and motives. Development of…
Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two…
In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and…
Typography and layout lead to the hierarchical organisation of text in words, text lines, paragraphs. This inherent structure is a key property of text in any script and language, which has nonetheless been minimally leveraged by existing…
A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assign a real number between 0 and 1 to a pair of documents,…
The forensic attribution of the handwriting in a digitized document to multiple scribes is a challenging problem of high dimensionality. Unique handwriting styles may be dissimilar in a blend of several factors including character size,…
India is a multilingual multi-script country. In every state of India there are two languages one is state local language and the other is English. For example in Andhra Pradesh, a state in India, the document may contain text words in…
Text Categorization is traditionally done by using the term frequency and inverse document frequency.This type of method is not very good because, some words which are not so important may appear in the document .The term frequency of…
Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that…
Text clustering holds significant value across various domains due to its ability to identify patterns and group related information. Current approaches which rely heavily on a computed similarity measure between documents are often limited…
Traditional clustering methods aim to group unlabeled data points based on their similarity to each other. However, clustering, in the absence of additional information, is an ill-posed problem as there may be many different, yet equally…