Related papers: Incremental Entity Resolution from Linked Document…

Document clustering with evolved multiword search queries

Text clustering holds significant value across various domains due to its ability to identify patterns and group related information. Current approaches which rely heavily on a computed similarity measure between documents are often limited…

Information Retrieval · Computer Science 2025-04-09 Laurence Hirsch , Robin Hirsch , Bayode Ogunleye

Joint Event Detection and Entity Resolution: a Virtuous Cycle

Clustering web documents has numerous applications, such as aggregating news articles into meaningful events, detecting trends and hot topics on the Web, preserving diversity in search results, etc. At the same time, the importance of named…

Computation and Language · Computer Science 2016-07-19 Matthias Galle , Jean-Michel Renders , Guillaume Jacquet

Entity Retrieval for Answering Entity-Centric Questions

The similarity between the question and indexed documents is a crucial factor in document retrieval for retrieval-augmented question answering. Although this is typically the only method for obtaining the relevant documents, it is not the…

Information Retrieval · Computer Science 2024-08-07 Hassan S. Shavarani , Anoop Sarkar

Clustering Prominent People and Organizations in Topic-Specific Text Corpora

Named entities in text documents are the names of people, organization, location or other types of objects in the documents that exist in the real world. A persisting research challenge is to use computational techniques to identify such…

Computation and Language · Computer Science 2019-07-09 Abdulkareem Alsudais , Hovig Tchalian

Improving Entity Retrieval on Structured Data

The increasing amount of data on the Web, in particular of Linked Data, has led to a diverse landscape of datasets, which make entity retrieval a challenging task. Explicit cross-dataset links, for instance to indicate co-references or…

Information Retrieval · Computer Science 2017-03-31 Besnik Fetahu , Ujwal Gadiraju , Stefan Dietze

Cross-Document Contextual Coreference Resolution in Knowledge Graphs

Coreference resolution across multiple documents poses a significant challenge in natural language processing, particularly within the domain of knowledge graphs. This study introduces an innovative method aimed at identifying and resolving…

Computation and Language · Computer Science 2025-04-09 Zhang Dong , Mingbang Wang , Songhang deng , Le Dai , Jiyuan Li , Xingzu Liu , Ruilin Nong

Information-Theoretic Generative Clustering of Documents

We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs…

Machine Learning · Computer Science 2024-12-19 Xin Du , Kumiko Tanaka-Ishii

Query-time Entity Resolution

Entity resolution is the problem of reconciling database references corresponding to the same real-world entities. Given the abundance of publicly available databases that have unresolved entities, we motivate the problem of query-time…

Databases · Computer Science 2011-11-02 I. Bhattacharya , L. Getoor

Probability Based Clustering for Document and User Properties

Information Retrieval systems can be improved by exploiting context information such as user and document features. This article presents a model based on overlapping probabilistic or fuzzy clusters for such features. The model is applied…

Human-Computer Interaction · Computer Science 2011-02-21 Thomas Mandl , Christa Womser-Hacker

Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a…

Databases · Computer Science 2018-03-20 Yuhang Zhang , Kee Siong Ng , Michael Walker , Pauline Chou , Tania Churchill , Peter Christen

Document clustering using graph based document representation with constraints

Document clustering is an unsupervised approach in which a large collection of documents (corpus) is subdivided into smaller, meaningful, identifiable, and verifiable sub-groups (clusters). Meaningful representation of documents and…

Information Retrieval · Computer Science 2014-12-08 Muhammad Rafi , Farnaz Amin , Mohammad Shahid Shaikh

Application of Advanced Record Linkage Techniques for Complex Population Reconstruction

Record linkage is the process of identifying records that refer to the same entities from several databases. This process is challenging because commonly no unique entity identifiers are available. Linkage therefore has to rely on partially…

Databases · Computer Science 2016-12-14 Peter Christen

Sequential Cross-Document Coreference Resolution

Relating entities and events in text is a key component of natural language understanding. Cross-document coreference resolution, in particular, is important for the growing interest in multi-document analysis tasks. In this work we propose…

Computation and Language · Computer Science 2021-04-20 Emily Allaway , Shuai Wang , Miguel Ballesteros

Document Clustering based on Topic Maps

Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next…

Information Retrieval · Computer Science 2011-12-30 Muhammad Rafi , M. Shahid Shaikh , Amir Farooq

Communicating and resolving entity references

Statements about entities occur everywhere, from newspapers and web pages to structured databases. Correlating references to entities across systems that use different identifiers or names for them is a widespread problem. In this paper, we…

Artificial Intelligence · Computer Science 2014-06-27 R. V. Guha

Towards Consistent Document-level Entity Linking: Joint Models for Entity Linking and Coreference Resolution

We consider the task of document-level entity linking (EL), where it is important to make consistent decisions for entity mentions over the full document jointly. We aim to leverage explicit "connections" among mentions within the document…

Computation and Language · Computer Science 2022-07-05 Klim Zaporojets , Johannes Deleu , Yiwei Jiang , Thomas Demeester , Chris Develder

Semantic Document Clustering on Named Entity Features

Keyword-based information processing has limitations due to simple treatment of words. In this paper, we introduce named entities as objectives into document clustering, which are the key elements defining document semantics and in many…

Information Retrieval · Computer Science 2018-07-23 Tru H. Cao , Vuong M. Ngo , Dung T. Hong , Tho T. Quan

How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation

Entity resolution (record linkage, microclustering) systems are notoriously difficult to evaluate. Looking for a needle in a haystack, traditional evaluation methods use sophisticated, application-specific sampling schemes to find matching…

Computation and Language · Computer Science 2024-04-09 Olivier Binette , Youngsoo Baek , Siddharth Engineer , Christina Jones , Abel Dasylva , Jerome P. Reiter

Detecting Privileged Documents by Ranking Connected Network Entities

This paper presents a link analysis approach for identifying privileged documents by constructing a network of human entities derived from email header metadata. Entities are classified as either counsel or non-counsel based on a predefined…

Information Retrieval · Computer Science 2025-12-10 Jianping Zhang , Han Qin , Nathaniel Huber-Fliflet

Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering

Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the…

Computation and Language · Computer Science 2024-12-20 Imed Keraghel , Mohamed Nadif