数据库 — Scifaro

Fiber-Navigable Search: A Geometric Approach to Filtered ANN

We present a geometric framework for filtered approximate nearest neighbor (ANN) search. Filtering a proximity graph by a metadata predicate produces a subgraph, a fiber, whose connectivity and geometry can differ sharply from the full…

数据库 · 计算机科学 2026-04-02 Thuong Dang

Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA

Table Question Answering (TQA) aims to answer natural language questions over structured tables. Large Language Models (LLMs) enable promising solutions to this problem, with operator-centric solutions that generate table manipulation…

数据库 · 计算机科学 2026-04-02 Fengyu Li , Junhao Zhu , Kaishi Song , Lu Chen , Zhongming Yao , Tianyi Li , Christian S. Jensen

Compass: General Filtered Search across Vector and Structured Data

The increasing prevalence of hybrid vector and relational data necessitates efficient, general support for queries that combine high-dimensional vector search with complex relational filtering. However, existing filtered search solutions…

数据库 · 计算机科学 2026-04-02 Chunxiao Ye , Xiao Yan , Eric Lo

Benchmarking Filtered Approximate Nearest Neighbor Search Algorithms on Transformer-based Embedding Vectors

Advances in embedding models for text, image, audio, and video drive progress across multiple domains, including retrieval-augmented generation, recommendation systems, and others. Many of these applications require an efficient method to…

数据库 · 计算机科学 2026-04-02 Patrick Iff , Paul Bruegger , Marcin Chrapek , David Kochergin , Maciej Besta , Torsten Hoefler

SteelDB: Diagnosing Kernel-Space Bottlenecks in Cloud OLTP Databases

Modern cloud OLTP databases have sought performance primarily through user-space optimization - separating storage and compute layers, or distributing transactions across multiple nodes using consensus algorithms. This paper turns attention…

数据库 · 计算机科学 2026-04-01 Mitsumasa Kondo

DeepEye: A Steerable Self-driving Data Agent System

Large Language Models (LLMs) have revolutionized natural language interaction with data. The "holy grail" of data analytics is to build autonomous Data Agents that can self-drive complex data analysis workflows. However, current…

数据库 · 计算机科学 2026-04-01 Boyan Li , Yiran Peng , Yupeng Xie , Sirong Lu , Yizhang Zhu , Xing Mu , Xinyu Liu , Yuyu Luo

WAter: A Workload-Adaptive Knob Tuning System based on Workload Compression

Selecting appropriate values for the configurable parameters of Database Management Systems (DBMS) to improve performance is a significant challenge. Recent machine learning (ML)-based tuning systems have shown strong potential, but their…

数据库 · 计算机科学 2026-04-01 Yibo Wang , Jiale Lao , Chen Zhang , Cehua Yang , Jianguo Wang , Mingjie Tang

ReViSQL: Achieving Human-Level Text-to-SQL

Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have focused on enhancing SQL reasoning by developing large language models and AI agents…

数据库 · 计算机科学 2026-04-01 Yuxuan Zhu , Tengjun Jin , Yoojin Choi , Daniel Kang

Data-informed healthcare service design for multiple long-term conditions using online patient stories

Conventional service design methods are valuable for improving healthcare experience, but are limited in scale and information capture. Based on a constructed database of 2,320 stories from patients and carers with multiple long-term…

数据库 · 计算机科学 2026-03-31 Ji Han , Marta Staff , Saeema Ahmed-Kristensen

Can Large Language Models be a Cardinality Estimator? An Empirical study

Cardinality estimation (CardEst) still remains a challenging problem for DBMS. Recent years have witnessed the success of ML-based cardinality estimators in outperforming traditional methods. However, these solutions suffer from poor…

数据库 · 计算机科学 2026-03-31 Liangzu Liu , Yiyan Wang , Yinjun Wu , Runze Su , Zhuo Chang , Peizhi Wu , Jianjun Chen , Fuxin Jiang , Rui Shi , Bin Cui , Tieying Zhang

DaiSy: A Library for Scalable Data Series Similarity Search

Exact similarity search over large collections of data series is a fundamental operation in modern applications, yet existing solutions are often fragmented, specialized, or tailored to specific execution environments. In this paper, we…

数据库 · 计算机科学 2026-03-31 Francesca Del Gaudio , Manos Chatzakis , Gayathiri Ravendirane , Botao Peng , Themis Palpanas

The Case for Multi-Version Experimental Evaluation (MVEE)

In the database community, we typically evaluate new methods based on experimental results, which we produce by integrating the proposed method along with a set of baselines in a single benchmarking codebase and measuring the individual…

数据库 · 计算机科学 2026-03-31 Simon Jörz , Felix Schuhknecht

NeedleDB: A Generative-AI Based System for Accurate and Efficient Image Retrieval using Complex Natural Language Queries

We demonstrate NeedleDB, an open-source, deployment-ready database system for answering complex natural language queries over image data. Unlike existing approaches that rely on contrastive-learning embeddings (e.g., CLIP), which degrade on…

数据库 · 计算机科学 2026-03-31 Mahdi Erfanian , Abolfazl Asudeh

Amalgam: Hybrid LLM-PGM Synthesis Algorithm for Accuracy and Realism

To generate synthetic datasets, e.g., in domains such as healthcare, the literature proposes approaches of two main types: Probabilistic Graphical Models (PGMs) and Deep Learning models, such as LLMs. While PGMs produce synthetic data that…

数据库 · 计算机科学 2026-03-31 Antheas Kapenekakis , Bent Thomsen , Katja Hose , Michele Albano

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

Evaluating production LLM responses and routing requests across providers in LLM gateways requires fine-grained quality signals and operationally grounded decisions. To address this gap, we present SEAR, a schema-based evaluation and…

数据库 · 计算机科学 2026-03-31 Zecheng Zhang , Han Zheng , Yue Xu

Partial Partial Aggregates

We introduce partial partial aggregates (PPA), a query optimization technique for distributed engines that pushes only the local compute phase of an aggregate operation through joins. A query that aggregates after a join involves two…

数据库 · 计算机科学 2026-03-31 Claude Brisson

WN-Wrangle: Wireless Network Data Wrangling Assistant

Data wrangling continues to be the most time-consuming task in the data science pipeline and wireless network data is no exception. Prior approaches for automatic or assisted data-wrangling primarily target unordered, single-table data.…

数据库 · 计算机科学 2026-03-31 Anirudh Kamath , Dustin Maas , Jacobus Van der Merwe , Anna Fariha

CLEAR: A Knowledge-Centric Vessel Trajectory Analysis Platform

Vessel trajectory data from the Automatic Identification System (AIS) is used widely in maritime analytics. Yet, analysis is difficult for non-expert users due to the incompleteness and complexity of AIS data. We present CLEAR, a…

数据库 · 计算机科学 2026-03-31 Hengyu Liu , Tianyi Li , Haoyu Wang , Kristian Torp , Yushuai Li , Tiancheng Zhang , Torben Bach Pedersen , Christian S. Jensen

LMG Index: A Robust and Efficient Learned Index Framework for Multi-Dimensional Performance Balance

Index structures are fundamental for efficient query processing on large-scale datasets. Learned indexes model the indexing process as a prediction problem to overcome the inherent trade-offs of traditional indexes. However, most existing…

数据库 · 计算机科学 2026-03-31 Yuzhen Chen , Bin Yao

Exqutor: Extended Query Optimizer for Vector-augmented Analytical Queries

Vector similarity search is becoming increasingly important for data science pipelines, particularly in Retrieval-Augmented Generation (RAG), where it enhances large language model inference by enabling efficient retrieval of relevant…

数据库 · 计算机科学 2026-03-31 Hyunjoon Kim , Chaerim Lim , Hyeonjun An , Rathijit Sen , Kwanghyun Park