数据库 — Scifaro

Semantic Data Processing with Holistic Data Understanding

Semantic operators have increasingly become integrated within data systems to enable processing data using Large Language Models (LLMs). Despite significant recent effort in improving these operators, their accuracy is limited due to a…

数据库 · 计算机科学 2026-04-06 Youran Sun , Sepanta Zeighami , Bhavya Chopra , Shreya Shankar , Aditya G. Parameswaran

Efficient Path Query Processing in Relational Database Systems

Path queries are crucial for property graphs, and there is growing interest in queries that combine regular expressions over labels with constraints on property values of vertices and edges. Efficient evaluation of such general path queries…

数据库 · 计算机科学 2026-04-06 Diego Rivera Correa , Mirek Riedewald

OmniTQA: A Cost-Aware System for Hybrid Query Processing over Semi-Structured Data

While recent advances in large language models have significantly improved Text-to-SQL and table question answering systems, most existing approaches assume that all query-relevant information is explicitly represented in structured…

数据库 · 计算机科学 2026-04-06 Nima Shahbazi , Seiji Maekawa , Nikita Bhutani , Estevam Hruschka

LitMOF: An LLM Multi-Agent for Literature-Validated Metal-Organic Frameworks Database Correction and Expansion

Metal-organic framework (MOF) databases have grown rapidly through experimental deposition and large-scale literature extraction, but recent analyses show that nearly half of their entries contain substantial structural errors. These…

数据库 · 计算机科学 2026-04-06 Honghui Kim , Dohoon Kim , Jihan Kim

DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework

Large language models (LLMs) have advanced Text-to-SQL, yet existing solutions still fall short of system-level reliability. The limitation is not merely in individual modules -- e.g., schema linking, reasoning, and verification -- but more…

数据库 · 计算机科学 2026-04-06 Boyan Li , Chong Chen , Zhujun Xue , Yinan Mei , Yuyu Luo

ResidualPlanner+: a scalable matrix mechanism for marginals and beyond

Noisy marginals are a common form of confidentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy…

数据库 · 计算机科学 2026-04-06 Yingtai Xiao , Guanlin He , Levent Toksoz , Zeyu Ding , Danfeng Zhang , Daniel Kifer

Optimizing Relational Queries over Array-Valued Data in Columnar Systems

Modern analytical workloads increasingly combine relational data with array-valued attributes. While columnar database systems efficiently process such workloads, their ability to optimize queries that interleave relational operators with…

数据库 · 计算机科学 2026-04-03 Maroua Zeblah , Etienne Couritas , Sarah Chlyah , Pierre Genevès , Nils Gesbert , Nabil Layaïda

GPU-RMQ: Accelerating Range Minimum Queries on Modern GPUs

Range minimum queries are frequently used in string processing and database applications including biological sequence analysis, document retrieval, and web search. Hence, various data structures have been proposed for improving their…

数据库 · 计算机科学 2026-04-03 Lara Kreis , Justus Henneberg , Valentin Henkys , Felix Schuhknecht , Bertil Schmidt

CogPic: A Multimodal Dataset for Early Cognitive Impairment Assessment via Picture Description Tasks

The automated evaluation of cognitive status utilizing multimedia technologies presents a promising frontier in early dementia diagnosis. However, the development of robust machine learning models for cognitive impairment detection is…

数据库 · 计算机科学 2026-04-03 Liuyu Wu , Rui Feng , Jie Li , Wentao Xiang , Yi Zhang , Yin Cao , Siyang Song , Xiao Gu , Jianqing Li , Wei Wang

Know Your Streams: On the Conceptualization, Characterization, and Generation of Intentional Event Streams

The shift toward IoT-enabled, sensor-driven systems has transformed how operational data is generated, favoring continuous, real-time event streams (ES) over static event logs. This evolution presents new challenges for Streaming Process…

数据库 · 计算机科学 2026-04-03 Andrea Maldonado , Christian Imenkamp , Hendrik Reiter , Thomas Seidl , Wilhelm Hasselbring , Martin Werner , Agnes Koschmider

SEAnet: A Deep Learning Architecture for Data Series Similarity Search

A key operation for massive data series collection analysis is similarity search. According to recent studies, SAX-based indexes offer state-of-the-art performance for similarity search tasks. However, their performance lags under…

数据库 · 计算机科学 2026-04-03 Qitong Wang , Themis Palpanas

Multi-Objective Agentic Rewrites for Unstructured Data Processing

One year ago, we open-sourced DocETL, a declarative system for LLM-powered data processing that, as of March 2026, has 3.7K GitHub stars and users across domains (e.g., journalism, law, medicine, policy, finance, and urban planning). In…

数据库 · 计算机科学 2026-04-03 Lindsey Linxi Wei , Shreya Shankar , Sepanta Zeighami , Yeounoh Chung , Fatma Ozcan , Aditya G. Parameswaran

AutoPK: Leveraging LLMs and a Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data from Complex Tables and Documents

Pharmacokinetics (PK) plays a critical role in drug development and regulatory decision-making for human and veterinary medicine, directly affecting public health through drug safety and efficacy assessments. However, PK data are often…

数据库 · 计算机科学 2026-04-03 Hossein Sholehrasa , Amirhossein Ghanaatian , Doina Caragea , Lisa A. Tell , Jim E. Riviere , Majid Jaberi-Douraki

Towards Robustness: A Critique of Current Vector Database Assessments

Vector databases are critical infrastructure in AI systems, and average recall is the dominant metric for their evaluation. Both users and researchers rely on it to choose and optimize their systems. We show that relying on average recall…

数据库 · 计算机科学 2026-04-03 Zikai Wang , Qianxi Zhang , Baotong Lu , Qi Chen , Cheng Tan

Accurate and Scalable Matrix Mechanisms via Divide and Conquer

Matrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier…

数据库 · 计算机科学 2026-04-02 Guanlin He , Yingtai Xiao , Jiamu Bai , Xin Gu , Zeyu Ding , Wenpeng Yin , Daniel Kifer

Streaming Model Cascades for Semantic SQL

Modern data warehouses extend SQL with semantic operators that invoke large language models on each qualifying row, but the per-row inference cost is prohibitive at scale. Model cascades reduce this cost by routing most rows through a fast…

数据库 · 计算机科学 2026-04-02 Paweł Liskowski , Kyle Schmaus

Making Array-Based Translation Practical for Modern, High-Performance Buffer Management

Modern buffer pools must now support a broader workload mix than classic OLTP alone. In addition to B-tree lookups, database systems increasingly serve scan-heavy analytics and vector-search indexes with irregular high-fan-out graph…

数据库 · 计算机科学 2026-04-02 Xinjing Zhou , Jinming Hu , Andrew Pavlo , Michael Stonebraker

Inference-Aware & Privacy-Preserving Deletion in Databases

Deletion is a fundamental database operation, yet modern systems often fail to provide the privacy guarantee that users expect from it. A deleted value may disappear from query results and even from physical storage, yet remain inferable…

数据库 · 计算机科学 2026-04-02 Vishal Chakraborty , Youri Kaminsky , Arnav Abhijit Dhariya , Sharad Mehrotra , Felix Naumann , Sarvesh Pandey

The Data Hydration Gap: A Formal Model of Underinvestment in General-Purpose Data Products Under Decentralized Governance

When organizations decentralize data product ownership, as in the data mesh paradigm, each domain team optimizes for its immediate analytical needs, underinvesting in the cross-domain generality that enables organization-wide reuse. We…

数据库 · 计算机科学 2026-04-02 Gaston Besanson

Reasoning about Transactional Isolation Levels with Isolde

Most databases can be configured to operate under isolation levels weaker than serializability. These enforce fewer restrictions on the concurrent access to data and consequently allow for more performant implementations. While formal…

数据库 · 计算机科学 2026-04-02 Manuel Barros , Alcino Cunha , Jose Pereira , Eunsuk Kang