数据库 — Scifaro

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) establishes that GPT-4 agents achieve 65% on 139 industrial maintenance scenarios backed by CouchDB,…

数据库 · 计算机科学 2026-05-27 Madhulatha Mandarapu , Sandeep Kunkunuru

RT-RkNN: Reverse k Nearest Neighbor Queries as a Graphics Ray Casting Problem

Reverse k nearest neighbor (RkNN) queries are fundamental in spatial databases, location-based analytics, and recommendation systems. Existing state-of-the-art techniques rely on spatial pruning supported by R-trees and their variants.…

数据库 · 计算机科学 2026-05-27 Zhengyang Bai , Peng Chen , Mohamed Wahib

Generalized Range Filtering Approximate Nearest Neighbor Search: Containment and Overlap [Technical Report]

Approximate nearest neighbor (ANN) search with range filters has recently garnered significant attention. This paper delves into a generalized form of this problem, i.e., ANN search with exact range-range (RR) predicates on a range-valued…

数据库 · 计算机科学 2026-05-27 Yingfan Liu , Tong Wu , Jiadong Xie , Yang Zhao , Jeffrey Xu Yu , Jiangtao Cui

Conceptual Schema Inference for Tabular Datasets using Large Language Models

Large collections of tabular data from data lakes, web tables and open data portals often originate from heterogeneous sources, leading to representational inconsistencies. Understanding and organizing such repositories therefore remains a…

数据库 · 计算机科学 2026-05-27 Zhenyu Wu , Jiaoyan Chen , Norman W. Paton

Do GPUs Really Need New Tabular File Formats?

Parquet is the de facto columnar file format in modern analytical systems, yet its configuration guidelines have largely been shaped by CPU-centric execution models. As GPU-accelerated data processing becomes increasingly prevalent, Parquet…

数据库 · 计算机科学 2026-05-27 Jigao Luo , Qi Chen , Carsten Binnig

Scaling GraphLLM with Bilevel-Optimized Sparse Querying

LLMs have recently shown strong potential in enhancing node-level tasks on text-attributed graphs (TAGs) by providing explanation features. However, their practical use is severely limited by the high computational and monetary cost of…

数据库 · 计算机科学 2026-05-27 Yangzhe Peng , Haiquan Qiu , Quanming Yao , Kun He

Conceptual Schema Inference for Tabular Datasets using Large Language Models

Large collections of tabular data from data lakes, web tables and open data portals often originate from heterogeneous sources, leading to representational inconsistencies. Understanding and organizing such repositories therefore remains a…

数据库 · 计算机科学 2026-05-27 Zhenyu Wu , Jiaoyan Chen , Norman W. Paton

Same Data, Different Schemas: Robustness of LLM-based Text-to-SQL

Large language models (LLMs) consistently achieve strong results on text-to-SQL benchmarks, but their robustness to schema variations remains poorly understood. Recent work suggests that the schema structure matters, but does not provide a…

数据库 · 计算机科学 2026-05-26 Nitin Kanchinadam , Aditya Menachery , Amol Deshpande

CS-PQ: Cache-Friendly SIMD Product Quantization for Large-Scale ANNS Index Construction

Product Quantization (PQ) construction is deeply integrated into vector index construction for Approximate Nearest Neighbor Search (ANNS). The rapid growth in vector dimensionality and volume has significantly increased the computational…

数据库 · 计算机科学 2026-05-26 Y. T. Ma , K. C. Huang , X. K. Jiang , M. L. Wang , X. Yao , R. H. Chen , G. Zhang , Z. L. Shao

Top-k Approximate Functional Dependency Discovery

Approximate functional dependencies (AFDs) relax exact functional dependencies by tolerating a bounded degree of violation, making them suited for data quality auditing. Threshold-based discovery returns all dependencies above a…

数据库 · 计算机科学 2026-05-26 Xiaolong Wan , Xixian Han

MetaboKG: An Analysis-centric Knowledge Graph Framework for Untargeted Metabolomics

Untargeted metabolomics generates large volumes of tandem mass spectrometry (MS/MS) data and computational annotations that can reveal molecular mechanisms across organisms and environments. Public reuse has improved through harmonized…

数据库 · 计算机科学 2026-05-26 Matthieu Féraud , Dina Boukhajou , Fabien Gandon , Louis-Félix Nothias

LEARNT: A Practical Estimator for Cardinality of LIKE Queries with Formal Accuracy Guarantees

We study the problem of cardinality estimation for LIKE queries on string data, focusing on the most common patterns in real workloads: prefix, suffix, and substring queries. We propose LEARNT, a LIKE query Estimator with Accuracy,…

数据库 · 计算机科学 2026-05-26 Hai Lan , Zhifeng Bao , Divesh Srivastava , Shixun Huang , Yuwei Peng , Yang Yu

Incorporating Deep Learning Design in Database Queries

Deep learning over relational databases is conventionally realized by translating data into graph representations and applying graph-based neural networks within external frameworks. This round-trip between the database and external machine…

数据库 · 计算机科学 2026-05-26 Yuval Lev Lubarsky , Dean Light , Boaz Berger , Shunit Agmon , Benny Kimelfeld

AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than…

数据库 · 计算机科学 2026-05-26 Darek Kleczek , Fuheng Zhao , Alexander W. Lee , Julien Tissier , Pawel Liskowski , Ugur Cetintemel , Anupam Datta

The Time is Here for Just-in-Time Systems: Challenges and Opportunities

Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across deployments, paying a significant performance cost. We argue that LLM-based coding agents now make a…

数据库 · 计算机科学 2026-05-26 Shu Liu , Alexander Krentsel , Shubham Agarwal , Mert Cemri , Ziming Mao , Soujanya Ponnapalli , Alexandros G. Dimakis , Sylvia Ratnasamy , Matei Zaharia , Aditya Parameswaran , Ion Stoica

Extending the (Elementary) Mathematical Data Model and MatBase with two new constraint types: inexistence and anti-existence

This research paper introduces two new constraint types and four subtypes of database constraints added to our (Elementary) Mathematical Data Model, which are the duals of the existence and non-existence ones. They are formally defined,…

数据库 · 计算机科学 2026-05-26 Christian Mancas

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-update lifecycle. Despite substantial prior work, existing systems suffer from…

数据库 · 计算机科学 2026-05-26 Han Chen , Zining Zhang , Wenqi Pei , Bingsheng He , Ming Wu , Jason Zeng , Michael Heinrich , Wei Wu , Hongbao Zhang

Federated Semantic Knowledge Graphs for Laboratory Workflows: A Structured Expert Elicitation Methodology Demonstrated Through Bioanalytical Workflow Twins

Laboratory workflows in pharmaceutical and biomedical research encode substantial tacit knowledge -- expert judgment about failure conditions, decision branching logic, and contextual dependencies -- that remains inaccessible to protocol…

数据库 · 计算机科学 2026-05-26 Luis F. Schachner , Vinith Thamizhazhagan , Sara Tanenbaum , John C. Tran , Pamela P. F. Chan , Mandy Kwong , Andy Chang , Maureen Beresini , Margaret Porter Scott

Timehash: Hierarchical Time Indexing for Efficient Business Hours Search

Temporal range filtering is critical in large-scale search systems, particularly location-based services filtering businesses by operating hours. Traditional approaches suffer from poor query performance (scope filtering), index size…

数据库 · 计算机科学 2026-05-26 Jinoh Kim , Jaewon Son

PiPNN: Ultra-Scalable Graph-Based Nearest Neighbor Indexing

The fastest indexes for Approximate Nearest Neighbor Search today are also the slowest to build: graph-based methods like HNSW and Vamana achieve state-of-the-art query performance but have large construction times due to relying on…

数据库 · 计算机科学 2026-05-26 Tobias Rubel , Richard Wen , Laxman Dhulipala , Lars Gottesbüren , Rajesh Jayaram , Jakub Łącki