数据库 — Scifaro

Efficient Vector Search in the Wild: One Model for Multi-K Queries

Learned top-K search is a promising approach for serving vector queries with both high accuracy and performance. However, current models trained for a specific K value fail to generalize to real-world multi-K queries: they suffer from…

数据库 · 计算机科学 2026-03-09 Yifan Peng , Jiafei Fan , Xingda Wei , Sijie Shen , Rong Chen , Jianning Wang , Xiaojian Luo , Wenyuan Yu , Jingren Zhou , Haibo Chen

Querying with Conflicts of Interest

Conflicts of interest often arise between data sources and their users regarding how the users' information needs should be interpreted by the data source. For example, an online product search might be biased towards presenting certain…

数据库 · 计算机科学 2026-03-09 Nischal Aryal , Arash Termehchy , Marianne Winslett

Space-efficient B-tree Implementation for Memory-Constrained Flash Embedded Devices

Small devices collecting data for agricultural, environmental, and industrial monitoring enable Internet of Things (IoT) applications. Given their critical role in data collection, there is a need for optimizations to improve on-device data…

数据库 · 计算机科学 2026-03-09 Nadir Ould-Khessal , Scott Fazackerley , Ramon Lawrence

Towards Neural Graph Data Management

While AI systems have made remarkable progress in processing unstructured text, structured data such as graphs stored in databases, continues to grow rapidly yet remains difficult for neural models to effectively utilize. We introduce…

数据库 · 计算机科学 2026-03-09 Yufei Li , Yisen Gao , Jiaxin Bai , Jiaxuan Xiong , Haoyu Huang , Zhongwei Xie , Hong Ting Tsang , Yangqiu Song

Publication and Maintenance of Relational Data in Enterprise Knowledge Graphs (Revised Version)

Enterprise knowledge graphs (EKGa) are a novel paradigm for consolidating and semantically integrating large numbers of heterogeneous data sources into a comprehensive dataspace. The main goal of an EKG is to provide a data layer that is…

数据库 · 计算机科学 2026-03-09 Vânia Maria Ponte Vidal , Valéria Magalhães Pequeno , Marco Antonio Casanova , Narciso Arruda , Carlos Brito

Efficient Query Rewrite Rule Discovery via Standardized Enumeration and Learning-to-Rank(extend)

Query rewriting is essential for database performance optimization, but existing automated rule enumeration methods suffer from exponential search spaces, severe redundancy, and poor scalability, especially when handling complex query plans…

数据库 · 计算机科学 2026-03-09 Yuan Zhang , Yuxing Chen , Yuekun Yu , Jinbin Huang , Rui Mao , Anqun Pan , Lixiong Zheng , Jianbin Qin

KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Discovering insights from a real-world data lake potentially containing unclean, semi-structured, and unstructured data requires a variety of data processing tasks, ranging from extraction and cleaning to integration, analysis, and…

数据库 · 计算机科学 2026-03-09 Eugenie Lai , Gerardo Vitagliano , Ziyu Zhang , Om Chabra , Sivaprasad Sudhir , Anna Zeng , Anton A. Zabreyko , Chenning Li , Ferdi Kossmann , Jialin Ding , Jun Chen , Markos Markakis , Matthew Russo , Weiyang Wang , Ziniu Wu , Michael J. Cafarella , Lei Cao , Samuel Madden , Tim Kraska

Cracking Vector Search Indexes

Retrieval Augmented Generation (RAG) uses vector databases to expand the expertise of an LLM model without having to retrain it. The idea can be applied over data lakes, leading to the notion of embedding data lakes, i.e., a pool of vector…

数据库 · 计算机科学 2026-03-09 Vasilis Mageirakos , Bowen Wu , Gustavo Alonso

O^3-LSM: Maximizing Disaggregated LSM Write Performance via Three-Layer Offloading

Log-Structured Merge-tree-based Key-Value Stores (LSM-KVS) have been optimized and redesigned for disaggregated storage via techniques such as compaction offloading to reduce the network I/Os between compute and storage. However, the…

数据库 · 计算机科学 2026-03-06 Qi Lin , Gangqi Huang , Te Guo , Chang Guo , Viraj Thakkar , Zichen Zhu , Jianguo Wang , Zhichao Cao

Bala-Join: An Adaptive Hash Join for Balancing Communication and Computation in Geo-Distributed SQL Databases

Shared-nothing geo-distributed SQL databases, such as CockroachDB, are increasingly vital for enterprise applications requiring data resilience and locality. However, we encountered significant performance degradation at the customer side,…

数据库 · 计算机科学 2026-03-06 Wenlong Song , Hui Li , Bingying Zhai , Jinxin Yang , Pinghui Wang , Luming Sun , Ming Li , Jiangtao Cui

CRISP: Correlation-Resilient Indexing via Subspace Partitioning

As the dimensionality of modern learned representations increases to thousands of dimensions, the state-of-the-art Approximate Nearest Neighbor (ANN) indices exhibit severe limitations. Graph-based methods (e.g., HNSW) suffer from…

数据库 · 计算机科学 2026-03-06 Dimitris Dimitropoulos , Achilleas Michalopoulos , Dimitrios Tsitsigkos , Nikos Mamoulis

RESYSTANCE: Unleashing Hidden Performance of Compaction in LSM-trees via eBPF

The development of high-speed storage devices such as NVMe SSDs has shifted the primary I/O bottleneck from hardware to software. Modern database systems also rely on kernel-based I/O paths, where frequent system call invocations and…

数据库 · 计算机科学 2026-03-06 Hongsu Byun , Seungjae Lee , Honghyeon Yoo , Myoungjoon Kim , Sungyong Park

FluxSieve: Unifying Streaming and Analytical Data Planes for Scalable Cloud Observability

Despite many advances in query optimization, indexing techniques, and data storage, modern data platforms still face difficulties in delivering robust query performance under high concurrency and computationally intensive queries. This…

数据库 · 计算机科学 2026-03-06 Adriano Vogel , Sören Henning , Otmar Ertl

Deterministic Preprocessing and Interpretable Fuzzy Banding for Cost-per-Student Reporting from Extracted Records

Administrative extracts are often exchanged as spreadsheets and may be read as reports in their own right during budgeting, workload review, and governance discussions. When an exported workbook becomes the reference snapshot for such…

数据库 · 计算机科学 2026-03-06 Shane Lee , Stella Ng

Beyond Linear LLM Invocation: An Efficient and Effective Semantic Filter Paradigm

Large language models (LLMs) are increasingly used for semantic query processing over large corpora. A set of semantic operators derived from relational algebra has been proposed to provide a unified interface for expressing such queries,…

数据库 · 计算机科学 2026-03-06 Nan Hou , Kangfei Zhao , Jiadong Xie , Jeffrey Xu Yu

Towards a B+-tree with Fluctuation-Free Performance

Performance predictability is critical for modern DBMSs because index maintenance can trigger rare but severe I/O spikes. In a B or B+-tree with height H, node split propagation means the cost of a single insert can vary from H + 1 to 3H +…

数据库 · 计算机科学 2026-03-06 Lu Xing , Walid G. Aref

stratum: A System Infrastructure for Massive Agent-Centric ML Workloads

Recent advances in large language models (LLMs) transform how machine learning (ML) pipelines are developed and evaluated. LLMs enable a new type of workload, agentic pipeline search, in which autonomous or semi-autonomous agents generate,…

数据库 · 计算机科学 2026-03-06 Arnab Phani , Elias Strauss , Sebastian Schelter

V3DB: Audit-on-Demand Zero-Knowledge Proofs for Verifiable Vector Search over Committed Snapshots

Dense retrieval services increasingly underpin semantic search, recommendation, and retrieval-augmented generation, yet clients typically receive only a top-$k$ list with no auditable evidence of how it was produced. We present V3DB, a…

数据库 · 计算机科学 2026-03-06 Zipeng Qiu , Wenjie Qu , Jiaheng Zhang , Binhang Yuan

The Case for Cardinality Lower Bounds

Despite decades of research, cardinality estimation remains the optimizer's Achilles heel, with industrial-strength systems exhibiting a systemic tendency toward underestimation. At cloud scale, this is a severe production vulnerability: in…

数据库 · 计算机科学 2026-03-06 Mihail Stoian , Tiemo Bang , Hangdong Zhao , Jesús Camacho-Rodríguez , Yuanyuan Tian , Andreas Kipf

Scalable Join Inference for Large Context Graphs

Context graphs are essential for modern AI applications including question answering, pattern discovery, and data analysis. Building accurate context graphs from structured databases requires inferring join relationships between entities.…

数据库 · 计算机科学 2026-03-05 Shivani Tripathi , Ravi Shetye , Shi Qiao , Alekh Jindal