数据库 — Scifaro

TKHist: Cardinality Estimation for Join Queries via Histograms with Dominant Attribute Correlation Finding

Cardinality estimation has long been crucial for cost-based database optimizers in identifying optimal query execution plans, attracting significant attention over the past decades. While recent advancements have significantly improved the…

数据库 · 计算机科学 2025-10-20 Renrui Li , Qingzhi Ma , Jiajie Xu , Lei Zhao , An Liu

LakeVilla: A Modular and Non-Invasive Toolbox for Lakehouse Transactions

Data lakehouses (LHs) are at the core of current cloud analytics stacks by providing elastic, relational compute on data in cloud data lakes across vendors. For relational semantics, they rely on open table formats (OTFs). Unfortunately,…

数据库 · 计算机科学 2025-10-20 Tobias Götz , Daniel Ritter , Jana Giceva

Text2Schema: Filling the Gap in Designing Database Table Structures based on Natural Language

People without a database background usually rely on file systems or tools such as Excel for data management, which often lead to redundancy and data inconsistency. Relational databases possess strong data management capabilities, but…

数据库 · 计算机科学 2025-10-20 Qin Wang , Youhuan Li , Yansong Feng , Si Chen , Ziming Li , Pan Zhang , Zihui Si , Yixuan Chen , Zhichao Shi , Zebin Huang , Guo Chen , Wenqiang Jin

The Past Still Matters: A Temporally-Valid Data Discovery System

Over the past decade, the proliferation of public and enterprise data lakes has fueled intensive research into data discovery, aiming to identify the most relevant data from vast and complex corpora to support diverse user tasks.…

数据库 · 计算机科学 2025-10-16 Mahdi Esmailoghli , Matthias Weidlich

Experiments \& Analysis of Privacy-Preserving SQL Query Sanitization Systems

Analytical SQL queries are essential for extracting insights from relational databases but concurrently introduce significant privacy risks by potentially exposing sensitive information. To mitigate these risks, numerous query sanitization…

数据库 · 计算机科学 2025-10-16 Loïs Ecoffet , Veronika Rehn-Sonigo , Jean-François Couchot , Catuscia Palamidessi

Towards a Standard for JSON Document Databases

In this technical report, we present a formalisation of the MongoDB aggregation framework. Our aim is to identify a fragment that could serve as the starting point for an industry-wide standard for querying JSON document databases. We…

数据库 · 计算机科学 2025-10-16 Elena Botoeva , Julien Corman , Norman Townsend

Aixel: A Unified, Adaptive and Extensible System for AI-powered Data Analysis

A growing trend in modern data analysis is the integration of data management with learning, guided by accuracy, latency, and cost requirements. In practice, applications draw data of different formats from many sources. In the meanwhile,…

数据库 · 计算机科学 2025-10-15 Meihui Zhang , Liming Wang , Chi Zhang , Zhaojing Luo

Poseidon: A OneGraph Engine

We present the Poseidon engine behind the Neptune Analytics graph database service. Customers interact with Poseidon using the declarative openCypher query language, which enables requests that seamlessly combine traditional querying…

数据库 · 计算机科学 2025-10-14 Brad Bebee , Ümit V. Çatalyürek , Olaf Hartig , Ankesh Khandelwal , Simone Rondelli , Michael Schmidt , Lefteris Sidirourgos , Bryan Thompson

GrASP: A Generalizable Address-based Semantic Prefetcher for Scalable Transactional and Analytical Workloads

Data prefetching--loading data into the cache before it is requested--is essential for reducing I/O overhead and improving database performance. While traditional prefetchers focus on sequential patterns, recent learning-based approaches,…

数据库 · 计算机科学 2025-10-14 Farzaneh Zirak , Farhana Choudhury , Renata Borovica-Gajic

DriftBench: Defining and Generating Data and Query Workload Drift for Benchmarking

Data and workload drift are key to evaluating database components such as caching, cardinality estimation, indexing, and query optimization. Yet, existing benchmarks are static, offering little to no support for modeling drift. This…

数据库 · 计算机科学 2025-10-14 Guanli Liu , Renata Borovica-Gajic

Regular Expression Indexing for Log Analysis. Extended Version

In this paper, we present the design and architecture of REI, a novel system for indexing log data for regular expression queries. Our main contribution is an $n$-gram-based indexing strategy and an efficient storage mechanism that results…

数据库 · 计算机科学 2025-10-14 Ling Zhang , Shaleen Deep , Jignesh M. Patel , Karthikeyan Sankaralingam

The Hybrid Multimodal Graph Index (HMGI): A Comprehensive Framework for Integrated Relational and Vector Search

The proliferation of complex, multimodal datasets has exposed a critical gap between the capabilities of specialized vector databases and traditional graph databases. While vector databases excel at semantic similarity search, they lack the…

数据库 · 计算机科学 2025-10-14 Joydeep Chandra , Satyam Kumar Navneet , Yong Zhang

Real-Time Health Analytics Using Ontology-Driven Complex Event Processing and LLM Reasoning: A Tuberculosis Case Study

Timely detection of critical health conditions remains a major challenge in public health analytics, especially in Big Data environments characterized by high volume, rapid velocity, and diverse variety of clinical data. This study presents…

数据库 · 计算机科学 2025-10-14 Ritesh Chandra , Sonali Agarwal , Navjot Singh

HES-SQL: Hybrid Reasoning for Efficient Text-to-SQL with Structural Skeleton Guidance

We present HES-SQL, a novel hybrid training framework that advances Text-to-SQL generation through the integration of thinking-mode-fused supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Our approach introduces…

数据库 · 计算机科学 2025-10-13 Suming Qiu , Jing Li , Zhicheng Zhou , Junjie Huang , Linyuan Qiu , Zhijie Sun

Comparative Performance Analysis of Modern NoSQL Data Technologies: Redis, Aerospike, and Dragonfly

The rise of distributed applications and cloud computing has created a demand for scalable, high-performance key-value storage systems. This paper presents a performance evaluation of three prominent NoSQL key-value stores: Redis,…

数据库 · 计算机科学 2025-10-13 Deep Bodra , Sushil Khairnar

DiskJoin: Large-scale Vector Similarity Join with SSD

Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these…

数据库 · 计算机科学 2025-10-13 Yanqi Chen , Xiao Yan , Alexandra Meliou , Eric Lo

SigSPARQL: Signals as a First-Class Citizen When Querying Knowledge Graphs

Purpose: Cyber-Physical Systems (CPSs) integrate computation and physical processes, producing time series data from thousands of sensors. Knowledge graphs can contextualize these data, yet current approaches that are applicably to…

数据库 · 计算机科学 2025-10-13 Tobias Schwarzinger , Gernot Steindl , Thomas Frühwirth , Thomas Preindl , Konrad Diwold , Katrin Ehrenmüller , Fajar J. Ekaputra

Implementing Semantic Join Operators Efficiently

Semantic query processing engines often support semantic joins, enabling users to match rows that satisfy conditions specified in natural language. Such join conditions can be evaluated using large language models (LLMs) that solve novel…

数据库 · 计算机科学 2025-10-10 Immanuel Trummer

ZeroCard: Cardinality Estimation with Zero Dependence on Target Databases -- No Data, No Query, No Retraining

Cardinality estimation is a fundamental task in database systems and plays a critical role in query optimization. Despite significant advances in learning-based cardinality estimation methods, most existing approaches remain difficult to…

数据库 · 计算机科学 2025-10-10 Xianghong Xu , Rong Kang , Xiao He , Lei Zhang , Jianjun Chen , Tieying Zhang

MobilityDuck: Mobility Data Management with DuckDB

The analytics of spatiotemporal data is increasingly important for mobility analytics. Despite extensive research on moving object databases (MODs), few systems are ready on production or lightweight enough for analytics. MobilityDB is a…

数据库 · 计算机科学 2025-10-10 Nhu Ngoc Hoang , Ngoc Hoa Pham , Viet Phuong Hoang , Esteban Zimányi