数据库 — Scifaro

Parajudica: An RDF-Based Reasoner and Metamodel for Multi-Framework Context-Dependent Data Compliance Assessments

Motivated by the challenges of implementing policy-based data access control (PBAC) under multiple simultaneously applicable compliance frameworks, we present Parajudica, an open, modular, and extensible RDF/SPARQL-based rule system for…

数据库 · 计算机科学 2025-12-08 Luc Moreau , Alfred Rossi , Sophie Stalla-Bourdillon

PETGraphDB: A Property Evolution Temporal Graph Data Management System

Temporal graphs are graphs whose nodes and edges, together with their associated properties, continuously change over time. With the development of Internet of Things (IoT) systems, a subclass of the temporal graph, i.e., Property Evolution…

数据库 · 计算机科学 2025-12-08 Jinghe Song , Zongyu Zuo , Xuelian Lin , Yang Wang , Shuai Ma

Featurized-Decomposition Join: Low-Cost Semantic Joins with Guarantees

Large Language Models (LLMs) are being increasingly used within data systems to process large datasets with text fields. A broad class of such tasks involves a semantic join-joining two tables based on a natural language predicate per pair…

数据库 · 计算机科学 2025-12-08 Sepanta Zeighami , Shreya Shankar , Aditya Parameswaran

Integrating Wearable Data into Process Mining: Event, Case and Activity Enrichment

In this short paper, we explore the enrichment of event logs with data from wearable devices. We discuss three approaches: (1) treating wearable data as event attributes, linking them directly to individual events, (2) treating wearable…

数据库 · 计算机科学 2025-12-08 Vinicius Stein Dani , Xixi Lu , Iris Beerepoot

Cloud-Native Vector Search: A Comprehensive Performance Analysis

Vector search has been widely employed in recommender system and retrieval-augmented-generation pipelines, commonly performed with vector indexes to efficiently find similar items in large datasets. Recent growths in both data and task…

数据库 · 计算机科学 2025-12-08 Zhaoheng Li , Wei Ding , Silu Huang , Zikang Wang , Yuanjin Lin , Ke Wu , Yongjoo Park , Jianjun Chen

Enhancing SPARQL Query Rewriting for Complex Ontology Alignments

SPARQL query rewriting is a fundamental mechanism for uniformly querying heterogeneous ontologies in the Linked Data Web. However, the complexity of ontology alignments, particularly rich correspondences (c : c), makes this process…

数据库 · 计算机科学 2025-12-08 Anicet Lepetit Ondo , Laurence Capus , Mamadou Bousso

NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation

Natural Language to SQL (i.e., NL2SQL) translation is crucial for democratizing database access, but even state-of-the-art models frequently generate semantically incorrect SQL queries, hindering the widespread adoption of these techniques…

数据库 · 计算机科学 2025-12-08 Xinyu Liu , Shuyu Shen , Boyan Li , Nan Tang , Yuyu Luo

Resilience for Regular Path Queries: Towards a Complexity Classification

The resilience problem for a query and an input set or bag database is to compute the minimum number of facts to remove from the database to make the query false. In this paper, we study how to compute the resilience of Regular Path Queries…

数据库 · 计算机科学 2025-12-08 Antoine Amarilli , Wolfgang Gatterbauer , Neha Makhija , Mikaël Monet , Martín Muñoz

A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going?

Translating users' natural language queries (NL) into SQL queries (i.e., Text-to-SQL, a.k.a. NL2SQL) can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of…

数据库 · 计算机科学 2025-12-08 Xinyu Liu , Shuyu Shen , Boyan Li , Peixian Ma , Runzhi Jiang , Yuxin Zhang , Ju Fan , Guoliang Li , Nan Tang , Yuyu Luo

A Fast Ethereum-Compatible Forkless Database

The State Database of a blockchain stores account data and enables authentication. Modern blockchains use fast consensus protocols to avoid forking, improving throughput and finality. However, Ethereum's StateDB was designed for a forking…

数据库 · 计算机科学 2025-12-05 Herbert Jordan , Kamil Jezek , Pavle Subotic , Bernhard Scholz

Energy Profiling of Data-Sharing Pipelines: Modeling, Estimation, and Reuse Strategies

Data-sharing pipelines involve a series of stages that apply policy-based data transformations to enable secure and effective data exchange among organizations. Although numerous tools and platforms exist to manage governance and…

数据库 · 计算机科学 2025-12-05 Sepideh Masoudi , Sebastian Werner , Pierluigi Plebani , Stefan Tai

IBM Multilevel Process Mining vs de facto Object-Centric Process Mining approaches

The academic evolution of process mining is moving toward object centric process mining, marking a significant shift in how processes are modeled and analyzed. IBM has developed its own distinctive approach called Multilevel Process Mining.…

数据库 · 计算机科学 2025-12-05 Alberto Ronzoni , Anina Antony , Anjana M R , Francesca De Leo , Jesna Jose , Mattia Freda , Nandini Narayanankutty , Rafflesia Khan , Raji RV , Thomas Diacci

ExOAR: Expert-Guided Object and Activity Recognition from Textual Data

Object-centric process mining requires structured data, but extracting it from unstructured text remains a challenge. We introduce ExOAR (Expert-Guided Object and Activity Recognition), an interactive method that combines large language…

数据库 · 计算机科学 2025-12-04 Iris Beerepoot , Vinicius Stein Dani , Xixi Lu

Enterprise Data Science Platform: A Unified Architecture for Federated Data Access

Organizations struggle to share data across departments that have adopted different data analytics platforms. If n datasets must serve m environments, up to n*m replicas can emerge, increasing inconsistency and cost. Traditional warehouses…

数据库 · 计算机科学 2025-12-04 Ryoto Miyamoto , Akira Kasuga

Continuous Prompts: LLM-Augmented Pipeline Processing over Unstructured Streams

Monitoring unstructured streams increasingly requires persistent, semantics-aware computation, yet today's LLM frameworks remain stateless and one-shot, limiting their usefulness for long-running analytics. We introduce Continuous Prompts…

数据库 · 计算机科学 2025-12-04 Shu Chen , Deepti Raghavan , Uğur Çetintemel

GenRewrite: Query Rewriting via Large Language Models

Query rewriting is an effective technique for refining poorly written queries before they reach the query optimizer. However, manual rewriting is not scalable, as it is prone to errors and requires deep expertise. Traditional query…

数据库 · 计算机科学 2025-12-04 Jie Liu , Barzan Mozafari

From Administrative Chaos to Analytical Cohorts: A Three-Stage Normalisation Pipeline for Longitudinal University Administrative Records

The growing use of longitudinal university administrative records in data-driven decision-making often overlooks a critical layer: how raw, inconsistent data are normalised before modelling. This article presents a three-stage normalisation…

数据库 · 计算机科学 2025-12-03 H. R. Paz

A Datalake for Data-driven Social Science Research

Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets.Social science research…

数据库 · 计算机科学 2025-12-03 Puneet Arya , Ojas Sahasrabudhe , Adwaiya Srivastav , Partha Pratim Das , Maya Ramanath

QJoin: Transformation-aware Joinable Data Discovery Using Reinforcement Learning

Discovering which tables in large, heterogeneous repositories can be joined and by what transformations is a central challenge in data integration and data discovery. Traditional join discovery methods are largely designed for equi-joins,…

数据库 · 计算机科学 2025-12-03 Ning Wang , Sainyam Galhotra

Trinity: Disaggregating Vector Search from Prefill-Decode Disaggregation in LLM Serving

Prefill and decode (PD) disaggregation separates prompt prefill and token-by-token decode stages into distinct GPU pools and has become the dominant architecture for large-scale LLM serving in industry. Also, retrieval tasks via vector…

数据库 · 计算机科学 2025-12-03 Yi Liu , Chen Qian