数据库 — Scifaro

The Decode-Work Law: Margin-Governed, Provably-Exact Spatial Joins over Compressed Geometry

Filter-and-refine spatial joins have always avoided touching exact geometry for certified candidate pairs, but the field never modeled the decompression cost of the pairs that survive the filter. When geometry is stored in a compressed,…

数据库 · 计算机科学 2026-07-01 Madhulatha Mandarapu , Sandeep Kunkunuru

From Single to Multiple Attributes: Experimental Insights on Sampling-Based Distinct Combination Estimation in GROUP-BY Queries

Estimating the number of distinct combinations in multi-attribute GROUP-BY queries remains a significant yet underexplored challenge. Current cardinality estimation techniques primarily focus on SPJ queries (i.e., selections, projections,…

数据库 · 计算机科学 2026-07-01 Yujie Zhang , Xiaochun Yang , Bin Wang , Yuan Sui

Generative Retrieval for Table Union Search

Modern data lakes contain heterogeneous tables whose task-relevant information is often scattered across different schemas, sources, and naming conventions. Table union search (TUS) retrieves tables that can be reliably unioned with a query…

数据库 · 计算机科学 2026-07-01 Shulun Zhang , Linting Wang , Yuwei Xu , Yingli Zhou , Chenhao Ma

Exploring the Semantic Gap in Agentic Data Systems: A Formative Study of Operationalization Failures in Analytical Workflows

Large language models (LLMs) are increasingly used to generate queries, invoke tools, and construct analytical workflows. Although recent advances have substantially improved workflow generation and execution, the semantic information…

数据库 · 计算机科学 2026-07-01 Jalal Mahmud , Eser Kandogan

RACORN-1: Adaptive Recall-Preserving Speedup for Low-Selectivity Filtered Vector Search

Filtered Vector Search (FVS), which combines vector embedding similarity with structured metadata predicates, has emerged as a core requirement in RAG and production retrieval systems. ACORN-1, the representative In-filtering algorithm that…

数据库 · 计算机科学 2026-07-01 Yoonseok Kim , Gyusik Choe

SessionBound: Turning Enterprise Task Approval into Budgeted Database Sessions

Enterprise AI agents are useful for internal analysis, audit, compliance review, and operational investigation, but they create a difficult authorization problem. A manager or data owner may approve a business task, while the agent later…

数据库 · 计算机科学 2026-07-01 Minmin Wu

When to Repair a Graph ANN Index: Navigability-Signal-Triggered Local Repair Protects Tail Recall Under Bursty Churn

Graph approximate-nearest-neighbor (ANN) indexes (HNSW, DiskANN/Vamana) lose recall under insert/delete churn, because deletions orphan the greedy-search paths that route through removed nodes. Production systems restore navigability by…

数据库 · 计算机科学 2026-07-01 Madhulatha Mandarapu , Sandeep Kunkunuru

Approximate Nearest Neighbor Search with Graph Range Filters

Vector databases have become a fundamental component for high-dimensional vector retrieval in artificial intelligence applications. Recent research has focused on filtered approximate nearest neighbor search (filtered ANN), which involves…

数据库 · 计算机科学 2026-07-01 Qian Tao , Yuntao Jiang , Yongxin Tong , Yu Sun

TVA: A Version-aware Temporal Graph Storage System for Real-time Analytics

Analyzing temporal graphs can reveal valuable insights that are typically hidden in static graphs. Unfortunately, existing graph storage systems either lack native temporal support or suffer from high latency when querying temporal graphs.…

数据库 · 计算机科学 2026-07-01 Wenhao Li , Zhanhao Zhao , Jinhao Dong , Jiamin Hou , Wei Lu , Yunhai Wang , Xiaoyong Du

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

LLM agents increasingly rely on retrieval buffers to store and reuse past experience, yet the cache management policies governing these buffers remain largely ad-hoc. We formalize this as an online semantic cache replacement problem with…

数据库 · 计算机科学 2026-07-01 Yushi Sun , Bowen Cao , Wai Lam

Query-Centric Optimization of AI Workflows via Approximate Query Processing and Proxy Models

Many modern AI workflows, ranging from LLM post-training pipelines to agentic reasoning tasks, can be expressed as declarative queries whose expensive predicate is evaluated by a large model or reward function. We propose a query-centric…

数据库 · 计算机科学 2026-06-30 Huayi Wang , Jun Xu , Gromit Yeuk-Yin Chan

Clean Me If You Can: A Large Collection of Real-World Addresses for Data Cleaning Benchmarking

There has been extensive research on automating and scaling data cleaning, i.e., the detection and correction of erroneous values in tabular data. Yet, existing approaches often perform well only within controlled environments. One of the…

数据库 · 计算机科学 2026-06-30 Fatemeh Ahmadi , Tobias Bernhard , Mohamed Abdelmaksoud , Luca Zecchini , Tilmann Rabl , Ziawasch Abedjan

DA-Studio: An Agentic System for End-to-End Data Analysis

Real-world data analysis is a multi-step process over heterogeneous inputs rather than merely producing a final answer. A practical system should autonomously organize multi-step workflows, execute generated code in a sandboxed and…

数据库 · 计算机科学 2026-06-30 Yizhe Liu , Shaolei Zhang , Ju Fan

MaDI-Bench: An End-to-End Data Integration Benchmark

Data integration combines heterogeneous data sets into a single, coherent representation. Data integration involves a sequence of interdependent tasks including schema matching, value normalization, entity blocking, entity matching, and…

数据库 · 计算机科学 2026-06-29 Aaron Steiner , Ralph Peeters , Christian Bizer

CLIP: Lightweight Cosine-Law-Based Inverted-List Pruning for IVF-Based Vector Search

Vector search has become a core component of modern multimodal retrieval systems. Among existing methods, inverted file (IVF)-based methods are widely adopted due to their scalability, efficient updates, and hardware friendliness. However,…

数据库 · 计算机科学 2026-06-29 Yitong Song , Shuhang Lu , Xuanhe Zhou , Pengcheng Zhang , Jianliang Xu

Experience Graphs: The Data Foundation for Self-Improving Agents

The database community has repeatedly advanced the state of the art by recognizing that new workloads demand new system architectures. We argue that long-horizon agentic tasks -- code generation, scientific discovery, hardware design -- are…

数据库 · 计算机科学 2026-06-29 Gang Liao , Yujia He , Abdullah Ozturk , Zhouyang Li , Ying Wang , Zhitong Guo , Hongsen Qin , Yaobin Qin , Tao Yang , Zewei Jiang , Dianshi Li , Jort Gemmeke , Jiangyuan Li , Liyuan Li , Nathan Yan , Masha Basmanova , Uladzimir Pashkevich , Matt Steiner , Pedro Pedreira , Rob Fergus , Anirudh Goyal , Carole-Jean Wu , Gaoxiang Liu , Andrew Witten , Daniel J. Abadi

Mandol: An Agglomerative Agent Memory System for Long-Term Conversations

Long-term conversational agents need to remember and query cross-session, multi-typed information with complex correlations. Existing agent memory systems rely on heterogeneous vector and graph databases, which fragment memory information…

数据库 · 计算机科学 2026-06-29 Yuhan Zhang , Zhiyuan Guo , Ziheng Zeng , Wei Wang , Wentao Wu , Lijie Xu

SemJoin: Semantic Join Optimization

Integrating unstructured data into relational database systems is increasingly important as demand grows for natural language querying and analysis. A semantic join, joining two tables under a natural-language predicate, can be evaluated…

数据库 · 计算机科学 2026-06-28 Christopher Gou , Aditya Banerjee , Jiaxuan Wang , Chunwei Liu

Enterprise Data Modelling Methodologies: A Comparative Analysis of Inmon, Kimball, and Data Vault

The design and governance of enterprise data warehouses constitute foundational decisions in modern data-driven organisations, with long-term impact for analytical capability, operational agility, and regulatory compliance. This paper…

数据库 · 计算机科学 2026-06-28 Issar Arab

CADENZA: Compiling Natural-Language Intent into Task-Specific Operator DAGs for Semantic Query Processing

Semantic query processing engines (SQPEs) extend relational query processing with semantic operators that are executed via model inference over unstructured data. Optimizing such queries is inherently multi-objective: model inference…

数据库 · 计算机科学 2026-06-28 Jaehyun Ha , Yongjoo Park , Wook-Shin Han