数据库 — Scifaro

A Theoretical Framework for Distribution-Aware Dataset Search

Effective data discovery is a cornerstone of modern data-driven decision-making. Yet, identifying datasets with specific distributional characteristics, such as percentiles or preferences, remains challenging. While recent proposals have…

数据库 · 计算机科学 2025-03-28 Aryan Esmailpour , Sainyam Galhotra , Rahul Raychaudhury , Stavros Sintos

PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees (Technical Report)

After decades of research in approximate query processing (AQP), its adoption in the industry remains limited. Existing methods struggle to simultaneously provide user-specified error guarantees, eliminate maintenance overheads, and avoid…

数据库 · 计算机科学 2025-03-28 Yuxuan Zhu , Tengjun Jin , Stefanos Baziotis , Chengsong Zhang , Charith Mendis , Daniel Kang

The Data Sharing Paradox of Synthetic Data in Healthcare

Synthetic data offers a promising solution to privacy concerns in healthcare by generating useful datasets in a privacy-aware manner. However, although synthetic data is typically developed with the intention of sharing said data, ambiguous…

数据库 · 计算机科学 2025-03-28 Jim Achterberg , Bram van Dijk , Saif ul Islam , Hafiz Muhammad Waseem , Parisis Gallos , Gregory Epiphaniou , Carsten Maple , Marcel Haas , Marco Spruit

Extended Event Log: Towards a Unified Standard for Process Mining

Process mining has grown popular today given their ability to provide managers with insights into the actual business process as executed by employees. Process mining depends on event logs found in process aware information systems to model…

数据库 · 计算机科学 2025-03-28 Ali Suleiman , Gamal Kassem

Tractable Conjunctive Queries over Static and Dynamic Relations

We investigate the evaluation of conjunctive queries over static and dynamic relations. While static relations are given as input and do not change, dynamic relations are subject to inserts and deletes. We characterise syntactically three…

数据库 · 计算机科学 2025-03-28 Ahmet Kara , Zheng Luo , Milos Nikolic , Dan Olteanu , Haozhe Zhang

RED2Hunt: an Actionable Framework for Cleaning Operational Databases with Surrogate Keys

Surrogate keys are now extensively utilized by database designers to implement keys in SQL tables. They are straightforward, easy to understand, and enable efficient access, despite lacking any real-world semantic meaning. In this context,…

数据库 · 计算机科学 2025-03-27 Mathilde Marcy , Jean-Marc Petit , Vasile-Marian Scuturici , Jocelyn Bonjour , Camille Fertel , Gerald Cavalier

Approximating Opaque Top-k Queries

Combining query answering and data science workloads has become prevalent. An important class of such workloads is top-k queries with a scoring function implemented as an opaque UDF - a black box whose internal structure and scores on the…

数据库 · 计算机科学 2025-03-27 Jiwon Chang , Fatemeh Nargesian

Toward a Cognitive Data Model: Exploring a Mind-Inspired Approach to Database Design

The Cognitive Data Model (CDM) is proposed. A novel approach to database design, inspired by the belief that the human brain operates with a logical data model independent of its anatomical structure. The study aims to identify and…

数据库 · 计算机科学 2025-03-27 Dhammika Pieris

Fast Matrix Multiplication meets the Submodular Width

One fundamental question in database theory is the following: Given a Boolean conjunctive query Q, what is the best complexity for computing the answer to Q in terms of the input database size N? When restricted to the class of…

数据库 · 计算机科学 2025-03-27 Mahmoud Abo-Khamis , Xiao Hu , Dan Suciu

HybEA: Hybrid Models for Entity Alignment

Entity Alignment (EA) aims to detect descriptions of the same real-world entities among different Knowledge Graphs (KG). Several embedding methods have been proposed to rank potentially matching entities of two KGs according to their…

数据库 · 计算机科学 2025-03-27 Nikolaos Fanourakis , Fatia Lekbour , Guillaume Renton , Vasilis Efthymiou , Vassilis Christophides

A General-Purpose Data Harmonization Framework: Supporting Reproducible and Scalable Data Integration in the RADx Data Hub

In the age of big data, it is important for primary research data to follow the FAIR principles of findability, accessibility, interoperability, and reusability. Data harmonization enhances interoperability and reusability by aligning…

数据库 · 计算机科学 2025-03-26 Jimmy K. Yu , Marcos Martínez-Romero , Matthew Horridge , Mete U. Akdogan , Mark A. Musen

Credible Intervals for Knowledge Graph Accuracy Estimation

Knowledge Graphs (KGs) are widely used in data-driven applications and downstream tasks, such as virtual assistants, recommendation systems, and semantic search. The accuracy of KGs directly impacts the reliability of the inferred knowledge…

数据库 · 计算机科学 2025-03-26 Stefano Marchesin , Gianmaria Silvello

Jovis: A Visualization Tool for PostgreSQL Query Optimizer

Query optimizers are essential components of relational database management systems that directly impact query performance as they transform input queries into efficient execution plans. While users can obtain the final execution plan using…

数据库 · 计算机科学 2025-03-26 Yoojin Choi , Juhee Han , Kyoseung Koo , Bongki Moon

Transformer-based Ranking Approaches for Keyword Queries over Relational Databases

Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without requiring schema knowledge or query-language proficiency. Although numerous R-KwS methods have been…

数据库 · 计算机科学 2025-03-25 Paulo Martins , Altigran da Silva , Johny Moreira , Edleno de Moura

SynchroStore: A Cost-Based Fine-Grained Incremental Compaction for Hybrid Workloads

This study proposes a novel storage engine, SynchroStore, designed to address the inefficiency of update operations in columnar storage systems based on Log-Structured Merge Trees (LSM-Trees) under hybrid workload scenarios. While columnar…

数据库 · 计算机科学 2025-03-25 Yinan Zhang , Huiqi Hu , Xuan Zhou

On the feasibility of semantic query metrics

We consider the problem of defining semantic metrics for relational database queries. Informally, a semantic query metric for a query language $L$ is a metric function $\delta:L\times L\to \mathbb{N}$ where $\delta(Q_1, Q_2)$ represents the…

数据库 · 计算机科学 2025-03-25 George Fletcher , Peter Wood , Nikolay Yakovets

Bag Semantics Conjunctive Query Containment. Four Small Steps Towards Undecidability

Query Containment Problem (QCP) is one of the most fundamental decision problems in database query processing and optimization. Complexity of QCP for conjunctive queries (QCP-CQ) has been fully understood since 1970s. But, as Chaudhuri and…

数据库 · 计算机科学 2025-03-25 Jerzy Marcinkowski , Mateusz Orda

A Generative Caching System for Large Language Models

Caching has the potential to be of significant benefit for accessing large language models (LLMs) due to their high latencies which typically range from a small number of seconds to well over a minute. Furthermore, many LLMs charge money…

数据库 · 计算机科学 2025-03-25 Arun Iyengar , Ashish Kundu , Ramana Kompella , Sai Nandan Mamidi

An Algebraic Foundation for Knowledge Graph Construction (Extended Version)

Although they exist since more than ten years already, have attracted diverse implementations, and have been used successfully in a significant number of applications, declarative mapping languages for constructing knowledge graphs from…

数据库 · 计算机科学 2025-03-25 Sitt Min Oo , Olaf Hartig

AnDB: Breaking Boundaries with an AI-Native Database for Universal Semantic Analysis

In this demonstration, we present AnDB, an AI-native database that supports traditional OLTP workloads and innovative AI-driven tasks, enabling unified semantic analysis across structured and unstructured data. While structured data…

数据库 · 计算机科学 2025-03-25 Tianqing Wang , Xun Xue , Guoliang Li , Yong Wang