数据库 — Scifaro

Understanding Domain-Aware Distribution Alignment in Budgeted Entity Matching

Entity Matching (EM) is a core operation in the data integration pipeline, where records from different sources are compared to determine whether they refer to the same real-world entity. Recent work has incorporated domain information and…

数据库 · 计算机科学 2026-06-25 Nicholas Pulsone , Gregory Goren , Roee Shraga

BtrLog: Low-Latency Logging for Cloud Database Systems

Cloud database systems cannot rely on instance-local disks for write-ahead logging (WAL) durability, forcing WAL onto remote storage. Existing options are unsatisfying: remote block storage like EBS is easy to adopt but adds substantial…

数据库 · 计算机科学 2026-06-25 Maximilian Kuschewski , Lam-Duy Nguyen , Matthias Jasny , Tobias Ziegler , Viktor Leis , Muhammad El-Hindi

EcoTable: Cost-effective Table Integration in Data Lakes for Natural Language Queries

The diverse formats of CSV and Parquet files in data lakes pose a significant challenge to traditional ETL, which relies on data engineers to pre-define a target database schema and build a complex pipeline for data integration. Moreover,…

数据库 · 计算机科学 2026-06-25 Yuhui Wang , Jinqi Liu , Chengliang Chai , Hangyu Zhao , Yuhao Deng , Yuyu Luo , Xin Tang , Ye Yuan , Guoren Wang , Fengjin Wang , Lei Cao

3D Spatial Pattern Matching

Spatial pattern matching is the process of matching query entities and constraints with database entities and relations. It has many applications, including similar region search, housing market search, landmark search, and road network…

数据库 · 计算机科学 2026-06-25 Nicole R. Schneider , Avik Das , Lukas Arzoumanidis , Abhijeet Ghodgaonkar , Hanan Samet , Youness Dehbi

Query Cost Model Calibration in Confidential Virtual Machines

With the growing adoption of Confidential Computing, running databases in confidential virtual machines (CVMs) such as AMD SEV-SNP has become an attractive way to protect sensitive cloud data with minimal changes to legacy DBMSs. However,…

数据库 · 计算机科学 2026-06-24 Qihan Zhang , Mengyuan Li , Ibrahim Sabek

Zero-Scan Data Quality: Leveraging Table Format Metadata for Continuous Observability at Scale

Modern table formats such as Apache Iceberg compute and store metadata-commit timestamps, record counts, and column-level statistics such as null counts and value bounds at write time as part of file writing. These statistics serve query…

数据库 · 计算机科学 2026-05-29 Mohit Verma , Shantanu Rawat , Christian Bush , Sumedh Sakdeo , Lokesh Amarnath Ravindranathan , Dwarak Bakshi

The Missing Dimensions in Geo-Distributed Database Evaluation

Geo-distributed OLTP databases are widely deployed across cloud regions, yet current evaluation practices do not cover the challenges of this aspect. Existing benchmarks assume stable network conditions; they lack explicit settings for data…

数据库 · 计算机科学 2026-05-29 Oto Mraz , Kyriakos Psarakis , George Christodoulou , Paris Carbone , Asterios Katsifodimos

Towards Reliable Agentic Progressive Text-to-Visualization with Verification Rules

Text-to-Visualization (Text-to-Vis) translates natural language queries into visualization query languages, enabling non-expert users to perform data analysis. However, most existing methods follow a one-shot paradigm that requires users to…

数据库 · 计算机科学 2026-05-29 Wenxin Xu , Chen Jason Zhang , Xiaoyong Wei , Haoyang Li , Hwanhee Kim , Yuanfeng Song , Raymond Chi-Wing Wong

One Ring to Shuffle Them All: Scalable Intra-Process Data Redistribution with Ring-Buffer Shuffle in Redpanda Oxla

As server CPUs scale to dozens and now hundreds of cores per socket, parallel query engines must rethink how they redistribute data between threads. Partitioned operators such as hash joins and aggregations require frequent data…

数据库 · 计算机科学 2026-05-29 Adam Szymański , Tyler Akidau

ScanTwin: Simulating Performance Regressions Without Access to Tenant Data

In cloud data platforms, developers often encounter performance regressions that occur in specific tenant datasets. However, due to confidentiality constraints, they cannot access the original data, which makes it difficult to reproduce…

数据库 · 计算机科学 2026-05-29 Donghyun Sohn , Jennie Rogers

IORM: Hierarchical I/O Governance for Thousands of Consolidated Databases on Oracle Exadata

Oracle Exadata consolidates thousands of tenant databases onto shared storage infrastructure deployed at hundreds of customer sites worldwide. Oracle Multitenant architecture enables this extreme density, with thousands of tenant databases…

数据库 · 计算机科学 2026-05-29 Rajarshi Chowdhury , Akshay Shah , Zakaria Alrmaih , Chenhao Guo , Anubhav Singh , Sue Lee

E2E: Efficient Filtered AKNN Search via Adaptive Termination

Approximate k-Nearest Neighbor (AKNN) search is widely used in vector databases. When vectors carry additional attributes (e.g., labels or numerical values), filtered AKNN search retrieves the nearest vectors to a query vector under…

数据库 · 计算机科学 2026-05-29 Wenxuan Xia , Mingyu Yang , Wentao Li , Wei Wang

Grain Theory: Type-Level Granularity Correctness in Data Pipelines

Data transformation correctness is a fundamental challenge in data engineering: how can we verify that pipelines produce correct results before executing on production data? Existing practice relies on iterative testing over materialized…

数据库 · 计算机科学 2026-05-29 Nikos Karayannidis

Redbench: Workload Synthesis From Cloud Traces

Workload traces from cloud data warehouse providers reveal that standard benchmarks such as TPC-H and TPC-DS fail to capture key characteristics of real-world workloads, including query repetition and string-heavy queries. In this paper, we…

数据库 · 计算机科学 2026-05-29 Johannes Wehrstein , Roman Heinrich , Mihail Stoian , Skander Krid , Martin Stemmer , Andreas Kipf , Carsten Binnig , Muhammad El-Hindi

Towards Cost-effective LLMs Routing with Batch Prompting

Large Language Model (LLM) serving systems must balance task performance against monetary cost. Two prominent optimization techniques have emerged independently: LLM routing, which directs each query to the most cost-effective model in a…

数据库 · 计算机科学 2026-05-28 Haotian Xu , Kangfei Zhao , Jiadong Xie

Are Diffusion Language Models Good Database Analysts?

Recent advancements in large language models (LLMs) have significantly improved Natural Language to SQL (NL2SQL) tasks, yet most NL2SQL systems continue to rely on the autoregressive (AR) paradigm. The highly structured nature of SQL makes…

数据库 · 计算机科学 2026-05-28 Peixian Ma , Xialie Zhuang , Jiantao Tan , Changlun Li , Ruirui Chen , Chengwei Qin

Enhancing OLAP Resilience at LinkedIn

Real-time OLAP datastores are critical infrastructure for modern enterprises, powering interactive analytics on petabyte-scale datasets with subsecond latency requirements. As these systems become integral to service architectures,…

数据库 · 计算机科学 2026-05-28 Praveen Chaganlal , Jia Guo , Vivek Vaidyanathan , Dino Occhialini , Sonam Mandal , Subbu Subramaniam , Siddharth Teotia , Tianqi Li , Xiaxuan Gao , Florence Zhang

AlayaLaser: Efficient Index Layout and Search Strategy for Large-scale High-dimensional Vector Similarity Search

On-disk graph-based approximate nearest neighbor search (ANNS) is essential for large-scale, high-dimensional vector retrieval, yet its performance is widely recognized to be limited by the prohibitive I/O costs. Interestingly, we observed…

数据库 · 计算机科学 2026-05-28 Weijian Chen , Haotian Liu , Yangshen Deng , Long Xiang , Liang Huang , Bo Tang

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) establishes that GPT-4 agents achieve 65% on 139 industrial maintenance scenarios backed by CouchDB,…

数据库 · 计算机科学 2026-05-27 Madhulatha Mandarapu , Sandeep Kunkunuru

RT-RkNN: Reverse k Nearest Neighbor Queries as a Graphics Ray Casting Problem

Reverse k nearest neighbor (RkNN) queries are fundamental in spatial databases, location-based analytics, and recommendation systems. Existing state-of-the-art techniques rely on spatial pruning supported by R-trees and their variants.…

数据库 · 计算机科学 2026-05-27 Zhengyang Bai , Peng Chen , Mohamed Wahib