数据库 — Scifaro

A Systematic Review of FAIR-compliant Big Data Software Reference Architectures

To meet the standards of the Open Science movement, the FAIR Principles emphasize the importance of making scientific data Findable, Accessible, Interoperable, and Reusable. Yet, creating a repository that adheres to these principles…

数据库 · 计算机科学 2025-09-19 João Pedro de Carvalho Castro , Maria Júlia Soares De Grandi , Cristina Dutra de Aguiar

Spezi Data Pipeline: Streamlining FHIR-based Interoperable Digital Health Data Workflows

The increasing adoption of digital health technologies has amplified the need for robust, interoperable solutions to manage complex healthcare data. We present the Spezi Data Pipeline, an open-source Python toolkit designed to streamline…

数据库 · 计算机科学 2025-09-19 Vasiliki Bikia , Paul Schmiedmayer , Aydin Zahedivash , Lauren Aalami , Adrit Rao , Vishnu Ravi , Matthew Turk , Scott R. Ceresnak , Oliver Aalami

jXBW: Fast Substructure Search for Large-Scale JSONL Datasets with LLM Applications

JSON Lines (JSONL) is widely used for managing large collections of semi-structured data, ranging from large language model (LLM) prompts to chemical compound records and geospatial datasets. A key operation is substructure search, which…

数据库 · 计算机科学 2025-09-19 Yasuo Tabei

NLI4DB: A Systematic Review of Natural Language Interfaces for Databases

As the demand for querying databases in all areas of life continues to grow, researchers have devoted significant attention to the natural language interface for databases (NLIDB). This paper presents a comprehensive survey of recently…

数据库 · 计算机科学 2025-09-19 Mengyi Liu , Jianqiu Xu

XASDB -- Design and Implementation of an Open-Access Spectral Database

The increasing volume and complexity of X-ray absorption spectroscopy (XAS) data generated at synchrotron facilities worldwide require robust infrastructure for data management, sharing, and analysis. This paper introduces the XAS Database…

数据库 · 计算机科学 2025-09-18 Denis Spasyuk

Tractability Frontiers of the Shapley Value for Aggregate Conjunctive Queries

In recent years, the Shapley value has emerged as a general game-theoretic measure for assessing the contribution of a tuple to the result of a database query. We study the complexity of calculating the Shapley value of a tuple for an…

数据库 · 计算机科学 2025-09-18 Christoph Standke , Benny Kimelfeld

The NIAID Discovery Portal: A Unified Search Engine for Infectious and Immune-Mediated Disease Datasets

The NIAID Data Ecosystem Discovery Portal (https://data.niaid.nih.gov) provides a unified search interface for over 4 million datasets relevant to infectious and immune-mediated disease (IID) research. Integrating metadata from…

数据库 · 计算机科学 2025-09-18 Ginger Tsueng , Emily Bullen , Candice Czech , Dylan Welzel , Leandro Collares , Jason Lin , Everaldo Rodolpho , Zubair Qazi , Nichollette Acosta , Lisa M. Mayer , Sudha Venkatachari , Zorana Mitrović Vučičević , Poromendro N. Burman , Deepti Jain , Jack DiGiovanna , Maria Giovanni , Asiyah Lin , Wilbert Van Panhuis , Laura D. Hughes , Andrew I. Su , Chunlei Wu

FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference

The Key-Value (KV) cache reading latency increases significantly with context lengths, hindering the efficiency of long-context LLM inference. To address this, previous works propose retaining a small fraction of KV cache based on token…

数据库 · 计算机科学 2025-09-18 Dongwei Wang , Zijie Liu , Song Wang , Yuxin Ren , Jianing Deng , Jingtong Hu , Tianlong Chen , Huanrui Yang

AMAZe: A Multi-Agent Zero-shot Index Advisor for Relational Databases

Index recommendation is one of the most important problems in database management system (DBMS) optimization. Given queries and certain index-related constraints, traditional methods rely on heuristic optimization or learning-based models…

数据库 · 计算机科学 2025-09-17 Zhaodonghui Li , Haitao Yuan , Jiachen Shi , Hao Zhang , Yu Rong , Gao Cong

Enumeration Algorithms for Conjunctive Queries with Projection

We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees…

数据库 · 计算机科学 2025-09-17 Shaleen Deep , Xiao Hu , Paraschos Koutris

SQLGovernor: An LLM-powered SQL Toolkit for Real World Application

SQL queries in real world analytical environments, whether written by humans or generated automatically often suffer from syntax errors, inefficiency, or semantic misalignment, especially in complex OLAP scenarios. To address these…

数据库 · 计算机科学 2025-09-16 Jie Jiang , Siqi Shen , Haining Xie , Yang Li , Yu Shen , Danqing Huang , Bo Qian , Yinjun Wu , Wentao Zhang , Bin Cui , Peng Chen

NeurStore: Efficient In-database Deep Learning Model Management System

With the prevalence of in-database AI-powered analytics, there is an increasing demand for database systems to efficiently manage the ever-expanding number and size of deep learning models. However, existing database systems typically store…

数据库 · 计算机科学 2025-09-16 Siqi Xiang , Sheng Wang , Xiaokui Xiao , Cong Yue , Zhanhao Zhao , Beng Chin Ooi

Consensus-Free Spreadsheet Integration

We describe a method for merging multiple spreadsheets into one sheet, and/or exchanging data among the sheets, by expressing each sheet's formulae as an algebraic (equational) theory and each sheet's values as a model of its theory,…

数据库 · 计算机科学 2025-09-16 Brandon Baylor , Eric Daimler , James Hansen , Esteban Montero , Ryan Wisnesky

Space-Time Tradeoffs for Spatial Conjunctive Queries

Given a conjunctive query and a database instance, we aim to develop an index that can efficiently answer spatial queries on the results of a conjunctive query. We are interested in some commonly used spatial queries, such as range…

数据库 · 计算机科学 2025-09-15 Aryan Esmailpour , Xiao Hu , Stavros Sintos

Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees

Large Language Models (LLMs) are being increasingly used as a building block in data systems to process large text datasets. To do so, LLM model providers offer multiple LLMs with different sizes, spanning various cost-quality trade-offs…

数据库 · 计算机科学 2025-09-15 Sepanta Zeighami , Shreya Shankar , Aditya Parameswaran

Let's Simply Count: Quantifying Distributional Similarity Between Activities in Event Data

To obtain insights from event data, advanced process mining methods assess the similarity of activities to incorporate their semantic relations into the analysis. Here, distributional similarity that captures similarity from activity…

数据库 · 计算机科学 2025-09-12 Henrik Kirchmann , Stephan A. Fahrenkrog-Petersen , Xixi Lu , Matthias Weidlich

Koza and Koza-Hub for born-interoperable knowledge graph generation using KGX

Knowledge graph construction has become an essential domain for the future of biomedical research. But current approaches demand a high amount of redundant labor. These redundancies are the result of the lack of data standards and…

数据库 · 计算机科学 2025-09-12 Daniel R Korn , Patrick Golden , Aaron Odell , Katherina Cortes , Shilpa Sundar , Kevin Schaper , Sarah Gehrke , Corey Cox , Harry Caufield , Justin Reese , Evan Morris , Christopher J Mungall , Melissa Haendel

Jelly-Patch: a Fast Format for Recording Changes in RDF Datasets

Recording data changes in RDF systems is a crucial capability, needed to support auditing, incremental backups, database replication, and event-driven workflows. In large-scale and low-latency RDF applications, the high volume and frequency…

数据库 · 计算机科学 2025-09-12 Piotr Sowinski , Kacper Grzymkowski , Anastasiya Danilenka

Inconsistency Handling in Prioritized Databases with Universal Constraints: Complexity Analysis and Links with Active Integrity Constraints

This paper revisits the problem of repairing and querying inconsistent databases equipped with universal constraints. We adopt symmetric difference repairs, in which both deletions and additions of facts can be used to restore consistency,…

数据库 · 计算机科学 2025-09-12 Meghyn Bienvenu , Camille Bourgaux

Un cadre paraconsistant pour l'{\'e}valuation de similarit{\'e} dans les bases de connaissances

This article proposes a paraconsistent framework for evaluating similarity in knowledge bases. Unlike classical approaches, this framework explicitly integrates contradictions, enabling a more robust and interpretable similarity measure. A…

数据库 · 计算机科学 2025-09-11 José-Luis Vilchis Medina