数据库 — Scifaro

P-MOSS: Scheduling Main-Memory Indexes Over NUMA Servers Using Next Token Prediction

Ever since the Dennard scaling broke down in the early 2000s and the frequency of the CPUs stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA…

数据库 · 计算机科学 2026-01-22 Yeasir Rayhan , Walid G. Aref

A Distributed Spatial Data Warehouse for AIS Data (DIPAAL)

AIS data from ships is excellent for analyzing single-ship movements and monitoring all ships within a specific area. However, the AIS data needs to be cleaned, processed, and stored before being usable. This paper presents a system…

数据库 · 计算机科学 2026-01-21 Alex S. Klitgaard , Lau E. Josefsen , Mikael V. Mikkelsen , Kristian Torp

Is Quantum Computing Ready for Real-Time Database Optimization?

Database systems encompass several performance-critical optimization tasks, such as join ordering and index tuning. As data volumes grow and workloads become more complex, these problems have become exponentially harder to solve…

数据库 · 计算机科学 2026-01-21 Hanwen Liu , Ibrahim Sabek

From HNSW to Information-Theoretic Binarization: Rethinking the Architecture of Scalable Vector Search

Modern semantic search and retrieval-augmented generation (RAG) systems rely predominantly on in-memory approximate nearest neighbor (ANN) indexes over high-precision floating-point vectors, resulting in escalating operational cost and…

数据库 · 计算机科学 2026-01-21 Seyed Moein Abtahi , Majid Fekri , Tara Khani , Akramul Azim

Uniqueness ratio as a predictor of a privacy leakage

Identity leakage can emerge when independent databases are joined, even when each dataset is anonymized individually. While previous work focuses on post-join detection or complex privacy models, little attention has been given to simple,…

数据库 · 计算机科学 2026-01-21 Danah A. AlSalem AlKhashti

RelServe: Fast LLM Inference Serving on Relational Data

The use of Large Language Models (LLMs) for querying relational data has given rise to relQuery, a workload pattern that applies templated LLM calls to structured tables. As relQuery services become more widely adopted in applications such…

数据库 · 计算机科学 2026-01-21 Xin Zhang , Shihong Gao , Yanyan Shen , Haoyang Li , Lei Chen

Knowledge Graph Construction for Stock Markets with LLM-Based Explainable Reasoning

The stock market is inherently complex, with interdependent relationships among companies, sectors, and financial indicators. Traditional research has largely focused on time-series forecasting and single-company analysis, relying on…

数据库 · 计算机科学 2026-01-21 Cheonsol Lee , Youngsang Jeong , Jeongyeol Shin , Huiju Kim , Jidong Kim

Relational Database Distillation: From Structured Tables to Condensed Graph Data

Relational databases (RDBs) underpin the majority of global data management systems, where information is structured into multiple interdependent tables. To effectively use the knowledge within RDBs for predictive tasks, recent advances…

数据库 · 计算机科学 2026-01-21 Xinyi Gao , Jingxi Zhang , Lijian Chen , Tong Chen , Lizhen Cui , Hongzhi Yin

Batch Query Processing and Optimization for Agentic Workflows

Large Language Models (LLMs) in agentic workflows combine multi-step reasoning, heterogeneous tool use, and collaboration across multiple specialized agents. Existing LLM serving engines optimize individual calls in isolation, while…

数据库 · 计算机科学 2026-01-21 Junyi Shen , Noppanat Wadlom , Yao Lu

Towards the Automated Extraction and Refactoring of NoSQL Schemas from Application Code

In this paper, we present a static code analysis strategy to extract logical schemas from NoSQL applications. Our solution is based on a model-driven reverse engineering process composed of a chain of platform-independent model…

数据库 · 计算机科学 2026-01-21 Carlos J. Fernandez-Candel , Anthony Cleve , Jesus J. Garcia-Molina

Shapley Revisited: Tractable Responsibility Measures for Query Answers

The Shapley value, originating from cooperative game theory, has been employed to define responsibility measures that quantify the contributions of database facts to obtaining a given query answer. For non-numeric queries, this is done by…

数据库 · 计算机科学 2026-01-19 Meghyn Bienvenu , Diego Figueira , Pierre Lafourcade

Using Color Refinement to Boost Enumeration and Counting for Acyclic CQs of Binary Schemas

We present an index structure, called the color-index, to boost the evaluation of acyclic conjunctive queries (ACQs) over binary schemas. The color-index is based on the color refinement algorithm, a widely used subroutine for graph…

数据库 · 计算机科学 2026-01-19 Cristian Riveros , Benjamin Scheidt , Nicole Schweikardt

Improving Database Performance by Application-side Transaction Merging

This paper explores a new opportunity to improve the performance of transaction processing at the application side by merging structurely similar statements or transactions. Concretely, we re-write transactions to 1) merge similar…

数据库 · 计算机科学 2026-01-16 Xueyuan Ren , Frank Li , Yang Wang

Redundancy-Driven Top-$k$ Functional Dependency Discovery

Functional dependencies (FDs) are basic constraints in relational databases and are used for many data management tasks. Most FD discovery algorithms find all valid dependencies, but this causes two problems. First, the computational cost…

数据库 · 计算机科学 2026-01-16 Xiaolong Wan , Xixian Han

The "I" in FAIR: Translating from Interoperability in Principle to Interoperation in Practice

The FAIR (Findable, Accessible, Interoperable, and Reusable) data principles [1] promote the interoperability of scientific data by encouraging the use of persistent identifiers, standardized vocabularies, and formal metadata structures.…

数据库 · 计算机科学 2026-01-16 Evan Morris , Gaurav Vaidya , Phil Owen , Jason Reilly , Karamarie Fecho , Patrick Wang , Yaphet Kebede , E. Kathleen Carter , Chris Bizon

Honesty-Aware Multi-Agent Framework for High-Fidelity Synthetic Data Generation in Digital Psychiatric Intake Doctor-Patient Interactions

Data scarcity and unreliable self-reporting -- such as concealment or exaggeration -- pose fundamental challenges to psychiatric intake and assessment. We propose a multi-agent synthesis framework that explicitly models patient deception to…

数据库 · 计算机科学 2026-01-15 Xinyuan Zhang , Zijian Wang , Chang Dao , Juexiao Zhou

Global Benchmark Database

This paper presents Global Benchmark Database (GBD), a comprehensive suite of tools for provisioning and sustainably maintaining benchmark instances and their metadata. The availability of benchmark metadata is essential for many tasks in…

数据库 · 计算机科学 2026-01-15 Ashlin Iser , Christoph Jabs

SVFusion: A CPU-GPU Co-Processing Architecture for Large-Scale Real-Time Vector Search

Approximate Nearest Neighbor Search (ANNS) underpins modern applications such as information retrieval and recommendation. With the rapid growth of vector data, efficient indexing for real-time vector search has become rudimentary. Existing…

数据库 · 计算机科学 2026-01-14 Yuchen Peng , Dingyu Yang , Zhongle Xie , Ji Sun , Lidan Shou , Ke Chen , Gang Chen

CSQL: Mapping Documents into Causal Databases

We describe a novel system, CSQL, which automatically converts a collection of unstructured text documents into an SQL-queryable causal database (CDB). A CDB differs from a traditional DB: it is designed to answer "why'' questions via…

数据库 · 计算机科学 2026-01-14 Sridhar Mahadevan

Rule Rewriting Revisited: A Fresh Look at Static Filtering for Datalog and ASP

Static filtering is a data-independent optimisation method for Datalog, which generalises algebraic query rewriting techniques from relational databases. In spite of its early discovery by Kifer and Lozinskii in 1986, the method has been…

数据库 · 计算机科学 2026-01-14 Philipp Hanisch , Markus Krötzsch