数据库 — Scifaro

Even Faster Geosocial Reachability Queries

Geosocial reachability queries (\textsc{RangeReach}) determine whether a given vertex in a geosocial network can reach any spatial vertex within a query region. The state-of-the-art 3DReach method answers such queries by encoding graph…

数据库 · 计算机科学 2026-02-06 Rick van der Heijden , Nikolay Yakovets , Thekla Hamm

Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration

Retrieval-augmented generation (RAG) enhances LLM reasoning in knowledge-intensive tasks, but existing RAG pipelines incur substantial retrieval and generation overhead when applied to large-scale entity matching. To address this…

数据库 · 计算机科学 2026-02-06 Chuangtao Ma , Zeyu Zhang , Arijit Khan , Sebastian Schelter , Paul Groth

One Size Does NOT Fit All: On the Importance of Physical Representations for Datalog Evaluation

Datalog is an increasingly popular recursive query language that is declarative by design, meaning its programs must be translated by an engine into the actual physical execution plan. When generating this plan, a central decision is how to…

数据库 · 计算机科学 2026-02-06 Nick Rassau , Felix Schuhknecht

Taking the Leap: Efficient and Reliable Fine-Grained NUMA Migration in User-space

Modern multi-socket architectures offer a single virtual address space, but physically divide main-memory across multiple regions, where each region is attached to a CPU and its cores. While this simplifies the usage, developers must be…

数据库 · 计算机科学 2026-02-06 Felix Schuhknecht , Nick Rassau

Repairing Property Graphs under PG-Constraints

Recent standardization efforts for graph databases lead to standard query languages like GQL and SQL/PGQ, and constraint languages like Property Graph Constraints (PG-Constraints). In this paper, we embark on the study of repairing property…

数据库 · 计算机科学 2026-02-06 Christopher Spinrath , Angela Bonifati , Rachid Echahed

DistillER: Knowledge Distillation in Entity Resolution with Large Language Models

Recent advances in Entity Resolution (ER) have leveraged Large Language Models (LLMs), achieving strong performance but at the cost of substantial computational resources or high financial overhead. Existing LLM-based ER approaches operate…

数据库 · 计算机科学 2026-02-06 Alexandros Zeakis , George Papadakis , Dimitrios Skoutas , Manolis Koubarakis

Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) is now standard for knowledge-intensive LLM tasks, but most systems still treat every query as fresh, repeatedly re-retrieving long passages and re-reasoning from scratch, inflating tokens, latency, and…

数据库 · 计算机科学 2026-02-06 Ning Wang , Kuanyan Zhu , Daniel Yuehwoon Yee , Yitang Gao , Shiying Huang , Zirun Xu , Sainyam Galhotra

Overview of Publicly Available Degradation Data Sets for Tasks within Prognostics and Health Management

Central to the efficacy of prognostics and health management methods is the acquisition and analysis of degradation data, which encapsulates the evolving health condition of engineering systems over time. Degradation data serves as a rich…

数据库 · 计算机科学 2026-02-06 Fabian Mauthe , Christopher Braun , Julian Raible , Peter Zeiler , Marco F. Huber

Identifying knowledge gaps in biodiversity data and their determinants at the regional level

Biodiversity open-access databases are valuable resources in the structuring and accessibility of species occurrence data. By compiling different data sources, they reveal the uneven spatial distribution of knowledge, with areas or…

数据库 · 计算机科学 2026-02-05 Didier Alard , Anaïs Guéry

Data Agents: Levels, State of the Art, and Open Problems

Data agents are an emerging paradigm that leverages large language models (LLMs) and tool-using agents to automate data management, preparation, and analysis tasks. However, the term "data agent" is currently used inconsistently, conflating…

数据库 · 计算机科学 2026-02-05 Yuyu Luo , Guoliang Li , Ju Fan , Nan Tang

LatentTune: Efficient Tuning of High Dimensional Database Parameters via Latent Representation Learning

As data volumes continue to grow, optimizing database performance has become increasingly critical, making the implementation of effective tuning methods essential. Among various approaches, database parameter tuning has proven to be a…

数据库 · 计算机科学 2026-02-05 Sein Kwon , Youngwan Jo , Seungyeon Choi , Jieun Lee , Huijun Jin , Sanghyun Park

Piece of CAKE: Adaptive Execution Engines via Microsecond-Scale Learning

Low-level database operators often admit multiple physical implementations ("kernels") that are semantically equivalent but have vastly different performance characteristics depending on the input data distribution. Existing database…

数据库 · 计算机科学 2026-02-05 Zijie Zhao , Ryan Marcus

PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models

Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints.…

数据库 · 计算机科学 2026-02-05 Vignesh Kothapalli , Rishabh Ranjan , Valter Hudovernik , Vijay Prakash Dwivedi , Johannes Hoffart , Carlos Guestrin , Jure Leskovec

StraTyper: Automated Semantic Type Discovery and Multi-Type Annotation for Dataset Collections

Understanding dataset semantics is crucial for effective search, discovery, and integration pipelines. To this end, column type annotation (CTA) methods associate columns of tabular datasets with semantic types that accurately describe…

数据库 · 计算机科学 2026-02-05 Christos Koutras , Juliana Freire

Tidehunter: Large-Value Storage With Minimal Data Relocation

Log-Structured Merge-Trees (LSM-trees) dominate persistent key-value storage but suffer from high write amplification from 10x to 30x under random workloads due to repeated compaction. This overhead becomes prohibitive for large values with…

数据库 · 计算机科学 2026-02-05 Andrey Chursin , Lefteris Kokoris-Kogias , Alex Orlov , Alberto Sonnino , Igor Zablotchi

GPU-Accelerated ANNS: Quantized for Speed, Built for Change

Approximate nearest neighbor search (ANNS) is a core problem in machine learning and information retrieval applications. GPUs offer a promising path to high-performance ANNS: they provide massive parallelism for distance computations, are…

数据库 · 计算机科学 2026-02-05 Hunter McCoy , Zikun Wang , Prashant Pandey

A Chase-based Approach to Consistent Answers of Analytic Queries in Star Schemas

We present an approach to computing consistent answers to queries possibly involving an aggregation operator in databases operating under a star schema and possibly containing missing values and inconsistent data. Our approach is based on…

数据库 · 计算机科学 2026-02-05 Dominique Laurent , Nicolas Spyratos

StreamShield: A Production-Proven Resiliency Solution for Apache Flink at ByteDance

Distributed Stream Processing Systems (DSPSs) form the backbone of real-time processing and analytics at ByteDance, where Apache Flink powers one of the largest production clusters worldwide. Ensuring resiliency, the ability to withstand…

数据库 · 计算机科学 2026-02-04 Yong Fang , Yuxing Han , Meng Wang , Yifan Zhang , Yue Ma , Chi Zhang

Skill-Based Autonomous Agents for Material Creep Database Construction

The advancement of data-driven materials science is currently constrained by a fundamental bottleneck: the vast majority of historical experimental data remains locked within the unstructured text and rasterized figures of legacy scientific…

数据库 · 计算机科学 2026-02-04 Yue Wu , Tianhao Su , Shunbo Hu , Deng Pan

ResQ: Realistic Performance-Aware Query Generation

Database research and development rely heavily on realistic user workloads for benchmarking, instance optimization, migration testing, and database tuning. However, acquiring real-world SQL queries is notoriously challenging due to strict…

数据库 · 计算机科学 2026-02-04 Zhengle Wang , Yanfei Zhang , Chunwei Liu