数据库 — Scifaro

FCDB (Functorial-Categorical Database): A Compositional Framework for Information Preservation and Anti-Commutativity Reduction

Conventional database architectures often secure local consistency by discarding information, entangling correctness with loss. We introduce the Functorial-Categorical Database (FCDb), which models data operations as morphisms in a layered…

数据库 · 计算机科学 2025-12-03 Jun Kawasaki

SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads

Database research and development often require a large number of SQL queries for benchmarking purposes. However, acquiring real-world SQL queries is challenging due to privacy concerns, and existing SQL generation methods are limited in…

数据库 · 计算机科学 2025-12-03 Jiale Lao , Immanuel Trummer

Data-Semantics-Aware Recommendation of Diverse Pivot Tables

Data summarization is essential to discover insights from large datasets. In a spreadsheets, pivot tables offer a convenient way to summarize tabular data by computing aggregates over some attributes, grouped by others. However, identifying…

数据库 · 计算机科学 2025-12-03 Whanhee Cho , Anna Fariha

Answering Constraint Path Queries over Graphs

Constraints are powerful declarative constructs that allow users to conveniently restrict variable values that potentially range over an infinite domain. In this paper, we propose a constraint path query language over property graphs, which…

数据库 · 计算机科学 2025-12-02 Heyang Li , Anthony Widjaja Lin , Domagoj Vrgoč

DuckDB on xNVMe

DuckDB is designed for portability. It is also designed to run anywhere, and possibly in contexts where it can be specialized for performance, e.g., as a cloud service or on a smart device. In this paper, we consider the way DuckDB…

数据库 · 计算机科学 2025-12-02 Marius Ottosen , Magnus Keinicke Parlo , Philippe Bonnet

PG-HIVE: Hybrid Incremental Schema Discovery for Property Graphs

Property graphs have rapidly become the de facto standard for representing and managing complex, interconnected data, powering applications across domains from knowledge graphs to social networks. Despite the advantages, their schema-free…

数据库 · 计算机科学 2025-12-02 Sofia Sideri , Georgia Troullinou , Elisjana Ymeralli , Vasilis Efthymiou , Dimitris Plexousakis , Haridimos Kondylakis

Efficiently Sampling Interval Patterns from Numerical Databases

Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness…

数据库 · 计算机科学 2025-12-02 Djawad Bekkoucha , Lamine Diop , Abdelkader Ouali , Bruno Crémilleux , Patrice Boizumault

Predicate Transfer: Efficient Pre-Filtering on Multi-Join Queries

This paper presents predicate transfer, a novel method that optimizes join performance by pre-filtering tables to reduce the join input sizes. Predicate transfer generalizes Bloom join, which conducts pre-filtering within a single join…

数据库 · 计算机科学 2025-12-02 Yifei Yang , Hangdong Zhao , Xiangyao Yu , Paraschos Koutris

Extended Serial Safety Net: A Refined Serializability Criterion for Multiversion Concurrency Control

A long line of concurrency-control (CC) protocols argues correctness via a single serialization point (begin or commit), an assumption that is incompatible with snapshot isolation (SI), where read-write anti-dependencies arise. Serial…

数据库 · 计算机科学 2025-12-01 Atsushi Kitazawa , Chihaya Ito , Yuta Yoshida , Takamitsu Shioi

Structured Multi-Step Reasoning for Entity Matching Using Large Language Model

Entity matching is a fundamental task in data cleaning and data integration. With the rapid adoption of large language models (LLMs), recent studies have explored zero-shot and few-shot prompting to improve entity matching accuracy.…

数据库 · 计算机科学 2025-12-01 Rohan Bopardikar , Jin Wang , Jia Zou

Relation-Stratified Sampling for Shapley Values Estimation in Relational Databases

Shapley-like values, including the Shapley and Banzhaf values, provide a principled way to quantify how individual tuples contribute to a query result. Their exact computation, however, is intractable because it requires aggregating…

数据库 · 计算机科学 2025-12-01 Amirhossein Alizad , Mostafa Milani

A Conceptual Model for Context Awareness in Ethical Data Management

Ethics has become a major concern to the information management community, as both algorithms and data should satisfy ethical rules that guarantee not to generate dishonourable behaviours when they are used. However, these ethical rules may…

数据库 · 计算机科学 2025-12-01 Elisa Quintarelli , Fabio Alberto Schreiber , Kostas Stefanidis , Letizia Tanca , Barbara Oliboni

OmniRouter: Budget and Performance Controllable Multi-LLM Routing

Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM…

数据库 · 计算机科学 2025-12-01 Kai Mei , Wujiang Xu , Minghao Guo , Shuhang Lin , Yongfeng Zhang

MCTS-SQL: Light-Weight LLMs can Master the Text-to-SQL through Monte Carlo Tree Search

Text-to-SQL is a fundamental yet challenging task in the NLP area, aiming at translating natural language questions into SQL queries. While recent advances in large language models have greatly improved performance, most existing approaches…

数据库 · 计算机科学 2025-12-01 Shuozhi Yuan , Limin Chen , Miaomiao Yuan , Zhao Jin

Beyond Accuracy: An Empirical Study of Uncertainty Estimation in Imputation

Handling missing data is a central challenge in data-driven analysis. Modern imputation methods not only aim for accurate reconstruction but also differ in how they represent and quantify uncertainty. Yet, the reliability and calibration of…

数据库 · 计算机科学 2025-11-27 Zarin Tahia Hossain , Mostafa Milani

MatBase Algorithm for Translating Entity-Relationship Data Models into (Elementary) Mathematical Data Model Schemes

This paper presents a pseudocode algorithm for translating Entity-Relationship data models into (Elementary) Mathematical Data Model schemes. We prove that this algorithm is linear, sound, complete, and optimal. As an example, we apply this…

数据库 · 计算机科学 2025-11-27 Christian Mancas , Diana Christina Mancas

InferF: Declarative Factorization of AI/ML Inferences over Joins

Real-world AI/ML workflows often apply inference computations to feature vectors joined from multiple datasets. To avoid the redundant AI/ML computations caused by repeated data records in the join's output, factorized ML has been proposed…

数据库 · 计算机科学 2025-11-26 Kanchan Chowdhury , Lixi Zhou , Lulu Xie , Xinwei Fu , Jia Zou

The Case for Intent-Based Query Rewriting

With this work, we describe the concept of intent-based query rewriting and present a first viable solution. The aim is to allow rewrites to alter the structure and syntactic outcome of an original query while keeping the obtainable…

数据库 · 计算机科学 2025-11-26 Gianna Lisa Nicolai , Patrick Hansert , Sebastian Michel

Forgetting by Pruning: Data Deletion in Join Cardinality Estimation

Machine unlearning in learned cardinality estimation (CE) systems presents unique challenges due to the complex distributional dependencies in multi-table relational data. Specifically, data deletion, a core component of machine unlearning,…

数据库 · 计算机科学 2025-11-26 Chaowei He , Yuanjun Liu , Qingzhi Ma , Shenyuan Ren , Xizhao Luo , Lei Zhao , An Liu

An experimental study of existing tools for outlier detection and cleaning in trajectories

Outlier detection and cleaning are essential steps in data preprocessing to ensure the integrity and validity of data analyses. This paper focuses on outlier points within individual trajectories, i.e., points that deviate significantly…

数据库 · 计算机科学 2025-11-26 Mariana M Garcez Duarte , Mahmoud Sakr