Related papers: SimDoc: Topic Sequence Alignment based Document Si…

An improved semantic similarity measure for document clustering based on topic maps

A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assigns a real number between 0 and 1 to a pair of documents,…

Information Retrieval · Computer Science 2013-03-19 Muhammad Rafi , Mohammad Shahid Shaikh

Document Clustering based on Topic Maps

Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next…

Information Retrieval · Computer Science 2011-12-30 Muhammad Rafi , M. Shahid Shaikh , Amir Farooq

Content-based Text Categorization using Wikitology

A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assign a real number between 0 and 1 to a pair of documents,…

Information Retrieval · Computer Science 2012-08-20 Muhammad Rafi , Sundus Hassan , Mohammad Shahid Shaikh

Assigning Topics to Documents by Successive Projections

Topic models provide a useful tool to organize and understand the structure of large corpora of text documents, in particular, to discover hidden thematic structure. Clustering documents from big unstructured corpora into topics is an…

Statistics Theory · Mathematics 2021-07-09 Olga Klopp , Maxim Panov , Suzanne Sigalla , Alexandre Tsybakov

Simple is not Enough: Document-level Text Simplification using Readability and Coherence

In this paper, we present the SimDoc system, a simplification model considering simplicity, readability, and discourse aspects, such as coherence. In the past decade, the progress of the Text Simplification (TS) field has been mostly shown…

Computation and Language · Computer Science 2024-12-30 Laura Vásquez-Rodríguez , Nhung T. H. Nguyen , Piotr Przybyła , Matthew Shardlow , Sophia Ananiadou

Calculating Semantic Similarity between Academic Articles using Topic Event and Ontology

Determining semantic similarity between academic documents is crucial to many tasks such as plagiarism detection, automatic technical survey and semantic search. Current studies mostly focus on semantic similarity between concepts,…

Computation and Language · Computer Science 2017-12-01 Ming Liu , Bo Lang , Zepeng Gu

A Topological Method for Comparing Document Semantics

Comparing document semantics is one of the toughest tasks in both Natural Language Processing and Information Retrieval. To date, on one hand, the tools for this task are still rare. On the other hand, most relevant methods are devised from…

Computation and Language · Computer Science 2020-12-09 Yuqi Kong , Fanchao Meng , Benjamin Carterette

Efficient Clustering from Distributions over Topics

There are many scenarios where we may want to find pairs of textually similar documents in a large corpus (e.g. a researcher doing literature review, or an R&D project manager analyzing project proposals). To programmatically discover those…

Computation and Language · Computer Science 2020-12-16 Carlos Badenes-Olmedo , Jose-Luis Redondo García , Oscar Corcho

Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings

A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions. It is focused on capturing the word co-occurrences in a document…

Machine Learning · Computer Science 2022-03-16 Dongsheng Wang , Dandan Guo , He Zhao , Huangjie Zheng , Korawat Tanwisuth , Bo Chen , Mingyuan Zhou

SelfDoc: Self-Supervised Document Representation Learning

We propose SelfDoc, a task-agnostic pre-training framework for document image understanding. Because documents are multimodal and are intended for sequential reading, our framework exploits the positional, textual, and visual information of…

Computer Vision and Pattern Recognition · Computer Science 2021-06-08 Peizhao Li , Jiuxiang Gu , Jason Kuen , Vlad I. Morariu , Handong Zhao , Rajiv Jain , Varun Manjunatha , Hongfu Liu

Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering Algorithms

Topic modeling is a powerful technique to discover hidden topics and patterns within a collection of documents without prior knowledge. Traditional topic modeling and clustering-based techniques encounter challenges in capturing contextual…

Computation and Language · Computer Science 2024-10-04 Melkamu Abay Mersha , Mesay Gemeda yigezu , Jugal Kalita

A comparison of two suffix tree-based document clustering algorithms

Document clustering as an unsupervised approach extensively used to navigate, filter, summarize and manage large collection of document repositories like the World Wide Web (WWW). Recently, focuses in this domain shifted from traditional…

Information Retrieval · Computer Science 2012-01-11 Muhammad Rafi , M. Maujood , M. M. Fazal , S. M. Ali

Measuring similarity between texts is an important task for several applications. Available approaches to measure document similarity are inadequate for document pairs that have non-comparable lengths, such as a long document and its…

Computation and Language · Computer Science 2019-03-27 Hongyu Gong , Tarek Sakakini , Suma Bhat , Jinjun Xiong

S2Doc -- Spatial-Semantic Document Format

Documents are a common way to store and share information, with tables being an important part of many documents. However, there is no real common understanding of how to model documents and tables in particular. Because of this lack of…

Digital Libraries · Computer Science 2025-11-04 Sebastian Kempf , Frank Puppe

Utilizing Embeddings for Ad-hoc Retrieval by Document-to-document Similarity

Latent semantic representations of words or paragraphs, namely the embeddings, have been widely applied to information retrieval (IR). One of the common approaches of utilizing embeddings for IR is to estimate the document-to-query (D2Q)…

Information Retrieval · Computer Science 2017-08-11 Chenhao Yang , Ben He , Yanhua Ran

Terminology-based Text Embedding for Computing Document Similarities on Technical Content

We propose in this paper a new, hybrid document embedding approach in order to address the problem of document similarities with respect to the technical content. To do so, we employ a state-of-the-art graph techniques to first extract the…

Computation and Language · Computer Science 2019-07-02 Hamid Mirisaee , Eric Gaussier , Cedric Lagnier , Agnes Guerraz

A Comparison of Document Similarity Algorithms

Document similarity is an important part of Natural Language Processing and is most commonly used for plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity algorithm could have a major…

Computation and Language · Computer Science 2023-04-05 Nicholas Gahman , Vinayak Elangovan

A Novel Approach to Document Classification using WordNet

Content based Document Classification is one of the biggest challenges in the context of free text mining. Current algorithms on document classifications mostly rely on cluster analysis based on bag-of-words approach. However that method is…

Information Retrieval · Computer Science 2015-12-15 Koushiki Sarkar , Ritwika Law

Semantic Document Clustering on Named Entity Features

Keyword-based information processing has limitations due to simple treatment of words. In this paper, we introduce named entities as objectives into document clustering, which are the key elements defining document semantics and in many…

Information Retrieval · Computer Science 2018-07-23 Tru H. Cao , Vuong M. Ngo , Dung T. Hong , Tho T. Quan

SimMatch: Semi-supervised Learning with Similarity Matching

Learning with few labeled data has been a longstanding problem in the computer vision and machine learning research community. In this paper, we introduced a new semi-supervised learning framework, SimMatch, which simultaneously considers…

Computer Vision and Pattern Recognition · Computer Science 2022-03-18 Mingkai Zheng , Shan You , Lang Huang , Fei Wang , Chen Qian , Chang Xu