数据库 — Scifaro

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration

Entity resolution (ER) is a fundamental task in data integration that enables insights from heterogeneous data sources. The primary challenge of ER lies in classifying record pairs as matches or nonmatches, which in multi-source ER (MS-ER)…

数据库 · 计算机科学 2026-04-10 Victor Christen , Peter Christen

AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views

Text-to-SQL is the task of translating natural language queries into executable SQL for a given database, enabling non-expert users to access structured data without writing SQL manually. Despite rapid advances driven by large language…

数据库 · 计算机科学 2026-04-09 Minh Tam Pham , Trinh Pham , Tong Chen , Hongzhi Yin , Quoc Viet Hung Nguyen , Thanh Tam Nguyen

LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO

Query rewriting, the process of transforming queries into semantically equivalent yet more efficient variants, is crucial for database optimization. Existing solutions predominantly rely on either rule-based heuristics or Large Language…

数据库 · 计算机科学 2026-04-09 Jiahui Li , Tongwang Wu , Yuren Mao , Rong Kang , Tieying Zhang , Yunjun Gao

AI-Driven Research for Databases

As the complexity of modern workloads and hardware increasingly outpaces human research and engineering capacity, existing methods for database performance optimization struggle to keep pace. To address this gap, a new class of techniques,…

数据库 · 计算机科学 2026-04-09 Audrey Cheng , Harald Ng , Aaron Kabcenell , Peter Bailis , Matei Zaharia , Lin Ma , Xiao Shi , Ion Stoica

Database Querying under Missing Values Governed by Missingness Mechanisms

We address the problems of giving a semantics to- and doing query answering (QA) on a relational database (RDB) that has missing values (MVs). The causes for the latter are governed by a Missingness Mechanism that is modelled as a Bayesian…

数据库 · 计算机科学 2026-04-09 Leopoldo Bertossi , Farouk Toumani , Maxime Buron

Automating Database-Native Function Code Synthesis with LLMs

Database systems incorporate an ever-growing number of functions in their kernels (a.k.a., database native functions) for scenarios like new application support and business migration. This growth causes an urgent demand for automatic…

数据库 · 计算机科学 2026-04-09 Wei Zhou , Xuanhe Zhou , Qikang He , Guoliang Li , Bingsheng He , Quanqing Xu , Fan Wu

Ontology-based knowledge graph infrastructure for interoperable atomistic simulation data

The reuse of atomistic simulation data is often limited by heterogeneous formats, incomplete metadata, and a lack of standardized representations of workflows and provenance. Here we present an ontology-based infrastructure for representing…

数据库 · 计算机科学 2026-04-09 Abril Azocar Guzman , Sarath Menon , Tilmann Hickel , Stefan Sandfeld

Jaguar: A Primal Algorithm for Conjunctive Query Evaluation in Submodular-Width Time

The submodular width is a complexity measure of conjunctive queries (CQs), which assigns a nonnegative real number, subw(Q), to each CQ Q. An existing algorithm, called PAND, performs CQ evaluation in polynomial time where the exponent is…

数据库 · 计算机科学 2026-04-08 Mahmoud Abo Khamis , Hubie Chen

PANDAExpress: a Simpler and Faster PANDA Algorithm

PANDA is a powerful generic algorithm for answering conjunctive queries (CQs) and disjunctive datalog rules (DDRs) given input degree constraints. In the special case where degree constraints are cardinality constraints and the query is…

数据库 · 计算机科学 2026-04-08 Mahmoud Abo Khamis , Hung Q. Ngo , Dan Suciu

Cortex AISQL: A Production SQL Engine for Unstructured Data

Snowflake's Cortex AISQL is a production SQL engine that integrates native semantic operations directly into SQL. This integration allows users to write declarative queries that combine relational operations with semantic reasoning,…

数据库 · 计算机科学 2026-04-08 Paweł Liskowski , Benjamin Han , Paritosh Aggarwal , Bowei Chen , Boxin Jiang , Nitish Jindal , Zihan Li , Aaron Lin , Kyle Schmaus , Jay Tayade , Weicheng Zhao , Anupam Datta , Nathan Wiegand , Dimitris Tsirogiannis

Advancing Object-Centric Process Mining with Multi-Dimensional Data Operations

Analyzing process data at varying levels of granularity is important to derive actionable insights and support informed decision-making. Object-Centric Event Data (OCED) enhances process mining by capturing interactions among events and…

数据库 · 计算机科学 2026-04-08 Shahrzad Khayatbashi , Najmeh Miri , Amin Jalali

Discovering Process Models With Long-Term Dependencies While Providing Guarantees and Filtering Infrequent Behavior Patterns

In process discovery, the goal is to find, for a given event log, the model describing the underlying process. While process models can be represented in a variety of ways, Petri nets form a theoretically well-explored description language…

数据库 · 计算机科学 2026-04-08 Lisa Luise Mannel , Wil M. P. van der Aalst

Query Optimization and Evaluation via Information Theory: A Tutorial

Database theory is exciting because it studies highly general and practically useful abstractions. Conjunctive query (CQ) evaluation is a prime example: it simultaneously generalizes graph pattern matching, constraint satisfaction, and…

数据库 · 计算机科学 2026-04-07 Mahmoud Abo Khamis , Hung Q. Ngo , Dan Suciu

Cardinality Estimation for High Dimensional Similarity Queries with Adaptive Bucket Probing

In this work, we address the problem of cardinality estimation for similarity search in high-dimensional spaces. Our goal is to design a framework that is lightweight, easy to construct, and capable of providing accurate estimates with…

数据库 · 计算机科学 2026-04-07 Zhonghan Chen , Qintian Guo , Ruiyuan Zhang , Xiaofang Zhou

Version Control System for Data with MatrixOne

The rapid advancement of artificial intelligence has elevated data to a cornerstone of modern software systems. As data projects become increasingly complex and dynamic, version control for data has become essential rather than merely…

数据库 · 计算机科学 2026-04-07 Hongshen Gou , Feng Tian , Long Wang , Nan Deng , Peng Xu

VectraFlow: Long-Horizon Semantic Processing over Data and Event Streams with LLMs

Monitoring continuous data for meaningful signals increasingly demands long-horizon, stateful reasoning over unstructured streams. However, today's LLM frameworks remain stateless and one-shot, and traditional Complex Event Processing (CEP)…

数据库 · 计算机科学 2026-04-07 Shu Chen , Junhan Liu , Deepti Raghavan , Ugur Cetintemel

Making Prompts First-Class Citizens for Adaptive LLM Pipelines

Modern LLM pipelines increasingly resemble complex data-centric applications: they retrieve data, correct errors, call external tools, and coordinate interactions between agents. Yet, the central element controlling this entire process --…

数据库 · 计算机科学 2026-04-07 Ugur Cetintemel , Shu Chen , Alexander W. Lee , Deepti Raghavan , Duo Lu , Andrew Crotty

Causality-Based Scores Alignment in Explainable Data Management

Different attribution scores have been proposed to quantify the relevance of database tuples for query answering in databases; e.g. Causal Responsibility, the Shapley Value, the Banzhaf Power-Index, and the Causal Effect. They have been…

数据库 · 计算机科学 2026-04-07 Felipe Azua , Leopoldo Bertossi

Unified and Efficient Approach for Multi-Vector Similarity Search

Multi-Vector Similarity Search is essential for fine-grained semantic retrieval in many real-world applications, offering richer representations than traditional single-vector paradigms. Due to the lack of native multi-vector index,…

数据库 · 计算机科学 2026-04-06 Binhan Yang , Yuxiang Zeng , Hengxin Zhang , Zhuanglin Zheng , Yunzhen Chi , Yongxin Tong , Ke Xu

Distance Comparison Operations Are Not Silver Bullets in Vector Similarity Search: A Benchmark Study on Their Merits and Limits

Distance Comparison Operations (DCOs), which decide whether the distance between a data vector and a query is within a threshold, are a critical performance bottleneck in vector similarity search. Recent DCO methods that avoid…

数据库 · 计算机科学 2026-04-06 Zhuanglin Zheng , Yuxiang Zeng , Chenchen Liu , Yunzhen Chi , Binhan Yang , Yongxin Tong