Related papers: CODEC: Complex Document and Entity Collection

Query-Specific Knowledge Graphs for Complex Finance Topics

Across the financial domain, researchers answer complex questions by extensively "searching" for relevant information to generate long-form reports. This workshop paper discusses automating the construction of query-specific document and…

Information Retrieval · Computer Science 2022-11-09 Iain Mackie , Jeffrey Dalton

DREQ: Document Re-Ranking Using Entity-based Query Understanding

While entity-oriented neural IR models have advanced significantly, they often overlook a key nuance: the varying degrees of influence individual entities within a document have on its overall relevance. Addressing this gap, we present…

Information Retrieval · Computer Science 2024-01-12 Shubham Chatterjee , Iain Mackie , Jeff Dalton

CoDEx: A Comprehensive Knowledge Graph Completion Benchmark

We present CoDEx, a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. In terms of scope, CoDEx comprises three…

Computation and Language · Computer Science 2020-10-07 Tara Safavi , Danai Koutra

CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents

Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual…

Computation and Language · Computer Science 2025-06-05 Fabian Karl , Ansgar Scherp

QDER: Query-Specific Document and Entity Representations for Multi-Vector Document Re-Ranking

Neural IR has advanced through two distinct paths: entity-oriented approaches leveraging knowledge graphs and multi-vector models capturing fine-grained semantics. We introduce QDER, a neural re-ranking model that unifies these approaches…

Information Retrieval · Computer Science 2025-10-14 Shubham Chatterjee , Jeff Dalton

CODER: An efficient framework for improving retrieval through COntextual Document Embedding Reranking

Contrastive learning has been the dominant approach to training dense retrieval models. In this work, we investigate the impact of ranking context - an often overlooked aspect of learning dense retrieval models. In particular, we examine…

Information Retrieval · Computer Science 2023-10-24 George Zerveas , Navid Rekabsaz , Daniel Cohen , Carsten Eickhoff

Cross-Document Contextual Coreference Resolution in Knowledge Graphs

Coreference resolution across multiple documents poses a significant challenge in natural language processing, particularly within the domain of knowledge graphs. This study introduces an innovative method aimed at identifying and resolving…

Computation and Language · Computer Science 2025-04-09 Zhang Dong , Mingbang Wang , Songhang deng , Le Dai , Jiyuan Li , Xingzu Liu , Ruilin Nong

DocReLM: Mastering Document Retrieval with Language Model

With over 200 million published academic documents and millions of new documents being written each year, academic researchers face the challenge of searching for information within this vast corpus. However, existing retrieval systems…

Information Retrieval · Computer Science 2024-05-21 Gengchen Wei , Xinle Pang , Tianning Zhang , Yu Sun , Xun Qian , Chen Lin , Han-Sen Zhong , Wanli Ouyang

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are…

Computation and Language · Computer Science 2024-11-12 Yew Ken Chia , Liying Cheng , Hou Pong Chan , Chaoqun Liu , Maojia Song , Sharifah Mahani Aljunied , Soujanya Poria , Lidong Bing

CODE-ACCORD: A Corpus of building regulatory data for rule generation towards automatic compliance checking

Automatic Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector necessitates automating the interpretation of building regulations to achieve its full potential. Converting textual rules into…

Information Retrieval · Computer Science 2025-02-19 Hansi Hettiarachchi , Amna Dridi , Mohamed Medhat Gaber , Pouyan Parsafard , Nicoleta Bocaneala , Katja Breitenfelder , Gonçal Costa , Maria Hedblom , Mihaela Juganaru-Mathieu , Thamer Mecharnia , Sumee Park , He Tan , Abdel-Rahman H. Tawil , Edlira Vakaj

MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering

Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse…

Computation and Language · Computer Science 2025-09-25 Teng Lin , Yuyu Luo , Honglin Zhang , Jicheng Zhang , Chunlin Liu , Kaishun Wu , Nan Tang

KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding

Deep text understanding, which requires the connections between a given document and prior knowledge beyond its text, has been highlighted by many benchmarks in recent years. However, these benchmarks have encountered two major limitations.…

Computation and Language · Computer Science 2023-07-07 Zijun Yao , Yantao Liu , Xin Lv , Shulin Cao , Jifan Yu , Lei Hou , Juanzi Li

A Semantically Enriched Dataset based on Biomedical NER for the COVID19 Open Research Dataset Challenge

Research into COVID-19 is a big challenge and highly relevant at the moment. New tools are required to assist medical experts in their research with relevant and valuable information. The COVID-19 Open Research Dataset Challenge (CORD-19)…

Digital Libraries · Computer Science 2020-05-19 Hermann Kroll , Jan Pirklbauer , Johannes Ruthmann , Wolf-Tilo Balke

Entity-Centric Query Refinement

We introduce the task of entity-centric query refinement. Given an input query whose answer is a (potentially large) collection of entities, the task output is a small set of query refinements meant to assist the user in efficient domain…

Computation and Language · Computer Science 2022-09-19 David Wadden , Nikita Gupta , Kenton Lee , Kristina Toutanova

Harvesting Events from Multiple Sources: Towards a Cross-Document Event Extraction Paradigm

Document-level event extraction aims to extract structured event information from unstructured text. However, a single document often contains limited event information and the roles of different event arguments may be biased due to the…

Computation and Language · Computer Science 2024-08-27 Qiang Gao , Zixiang Meng , Bobo Li , Jun Zhou , Fei Li , Chong Teng , Donghong Ji

CD2CR: Co-reference Resolution Across Documents and Domains

Cross-document co-reference resolution (CDCR) is the task of identifying and linking mentions to entities and concepts across many text documents. Current state-of-the-art models for this task assume that all documents are of the same type…

Computation and Language · Computer Science 2021-02-01 James Ravenscroft , Arie Cattan , Amanda Clare , Ido Dagan , Maria Liakata

CDER: Collaborative Evidence Retrieval for Document-level Relation Extraction

Document-level Relation Extraction (DocRE) involves identifying relations between entities across multiple sentences in a document. Evidence sentences, crucial for precise entity pair relationships identification, enhance focus on essential…

Computation and Language · Computer Science 2025-04-10 Khai Phan Tran , Xue Li

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality…

Computation and Language · Computer Science 2026-05-20 Ming Zhang , Jiabao Zhuang , Wenqing Jing , Kexin Tan , Ziyu Kong , Jingyi Deng , Yujiong Shen , Yuhui Wang , Zhenghao Xiang , Qiyuan Peng , Yuhang Zhao , Ning Luo , Renzhe Zheng , Jiahui Lin , Mingqi Wu , Long Ma , Shihan Dou , Maxm Pan , Tao Gui , Qi Zhang , Xuanjing Huang

NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts,…

Computation and Language · Computer Science 2026-03-09 Genet Asefa Gesese , Zongxiong Chen , Shufan Jiang , Mary Ann Tan , Zhaotai Liu , Sonja Schimmler , Harald Sack

Entities as Retrieval Signals: A Systematic Study of Coverage, Supervision, and Evaluation in Entity-Oriented Ranking

Entity-oriented retrieval assumes that relevant documents exhibit query-relevant entities, yet evaluations report conflicting results. We show this inconsistency stems not from model failure, but from evaluation. On TREC Robust04, we…

Information Retrieval · Computer Science 2026-04-08 Shubham Chatterjee