数据库 — Scifaro

Structure Guided Large Language Model for SQL Generation

Recent advancements in large language models (LLMs) have shown promise in bridging the gap between natural language queries and database management systems, enabling users to interact with databases without the background of SQL. However,…

数据库 · 计算机科学 2025-07-11 Qinggang Zhang , Hao Chen , Junnan Dong , Shengyuan Chen , Feiran Huang , Xiao Huang

Interactive Text-to-SQL via Expected Information Gain for Disambiguation

Relational databases are foundational to numerous domains, including business intelligence, scientific research, and enterprise systems. However, accessing and analyzing structured data often requires proficiency in SQL, which is a skill…

数据库 · 计算机科学 2025-07-10 Luyu Qiu , Jianing Li , Chi Su , Lei Chen

Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models

Generative AI is transforming business applications by enabling natural language interfaces and intelligent automation. However, the underlying large language models (LLMs) are evolving rapidly and so prompting them consistently is a…

数据库 · 计算机科学 2025-07-09 Shivani Tripathi , Pushpanjali Nema , Aditya Halder , Shi Qiao , Alekh Jindal

GTRSS: Graph-based Top-$k$ Representative Similar Subtrajectory Query

Trajectory mining has attracted significant attention. This paper addresses the Top-k Representative Similar Subtrajectory Query (TRSSQ) problem, which aims to find the k most representative subtrajectories similar to a query. Existing…

数据库 · 计算机科学 2025-07-09 Mingchang Ge , Liping Wang , Xuemin Lin , Yuang Zhang , Kunming Wang

PBE Meets LLM: When Few Examples Aren't Few-Shot Enough

Large language models (LLMs) can generate code from natural language descriptions. Their performance is typically evaluated using programming benchmarks that simulate real-world tasks. These benchmarks provide specifications in the form of…

数据库 · 计算机科学 2025-07-09 Shuning Zhang , Yongjoo Park

LAKEGEN: A LLM-based Tabular Corpus Generator for Evaluating Dataset Discovery in Data Lakes

How to generate a large, realistic set of tables along with joinability relationships, to stress-test dataset discovery methods? Dataset discovery methods aim to automatically identify related data assets in a data lake. The development and…

数据库 · 计算机科学 2025-07-09 Zhenwei Dai , Chuan Lei , Asterios Katsifodimos , Xiao Qin , Christos Faloutsos , Huzefa Rangwala

A Comprehensive Study of Shapley Value in Data Analytics

Over the recent years, Shapley value (SV), a solution concept from cooperative game theory, has found numerous applications in data analytics (DA). This paper presents the first comprehensive study of SV used throughout the DA workflow,…

数据库 · 计算机科学 2025-07-09 Hong Lin , Shixin Wan , Zhongle Xie , Ke Chen , Meihui Zhang , Lidan Shou , Gang Chen

The Case for Instance-Optimized LLMs in OLAP Databases

Large Language Models (LLMs) can enhance analytics systems with powerful data summarization, cleaning, and semantic transformation capabilities. However, deploying LLMs at scale -- processing millions to billions of rows -- remains…

数据库 · 计算机科学 2025-07-08 Bardia Mohammadi , Laurent Bindschaedler

OneDB: A Distributed Multi-Metric Data Similarity Search System

Increasingly massive volumes of multi-modal data are being accumulated in many {real world} settings, including in health care and e-commerce. This development calls for effective general-purpose data management solutions for multi-modal…

数据库 · 计算机科学 2025-07-08 Tang Qian , Yifan Zhu , Lu Chen , Xiangyu Ke , Jingwen Zhao , Tianyi Li , Yunjun Gao , Christian S. Jensen

PFCS: Prime Factorization Cache System for Deterministic Data Relationship Discovery

Cache systems fundamentally limit modern computing performance due to their inability to precisely capture data relationships. While achieving 85-92% hit rates, traditional systems rely on statistical heuristics that cannot guarantee…

数据库 · 计算机科学 2025-07-08 Duy Le

LLM4Hint: Leveraging Large Language Models for Hint Recommendation in Offline Query Optimization

Query optimization is essential for efficient SQL query execution in DBMS, and remains attractive over time due to the growth of data volumes and advances in hardware. Existing traditional optimizers struggle with the cumbersome hand-tuning…

数据库 · 计算机科学 2025-07-08 Suchen Liu , Jun Gao , Yinjun Han , Yang Lin

Handling out-of-order input arrival in CEP engines on the edge combining optimistic, pessimistic and lazy evaluation

In Complex Event Processing, handling out-of-order, late, and duplicate events is critical for real-time analytics, especially on resource-constrained devices that process heterogeneous data from multiple sources. We present LimeCEP, a…

数据库 · 计算机科学 2025-07-08 Styliani Kyrama , Anastasios Gounaris

Training-Free Query Optimization via LLM-Based Plan Similarity

Large language model (LLM) embeddings offer a promising new avenue for database query optimization. In this paper, we explore how pre-trained execution plan embeddings can guide SQL query execution without the need for additional model…

数据库 · 计算机科学 2025-07-08 Nikita Vasilenko , Alexander Demin , Vladimir Boorlakov

Approximate Vector Set Search Inspired by Fly Olfactory Neural System

Vector set search, an underexplored similarity search paradigm, aims to find vector sets similar to a query set. This search paradigm leverages the inherent structural alignment between sets and real-world entities to model more…

数据库 · 计算机科学 2025-07-08 Yiqi Li , Sheng Wang , Zhiyu Chen , Shangfeng Chen , Zhiyong Peng

Datalog with First-Class Facts

Datalog is a popular logic programming language for deductive reasoning tasks in a wide array of applications, including business analytics, program analysis, and ontological reasoning. However, Datalog's restriction to flat facts over…

数据库 · 计算机科学 2025-07-08 Thomas Gilray , Arash Sahebolamri , Yihao Sun , Sowmith Kunapaneni , Sidharth Kumar , Kristopher Micinski

On the Convergence Rate of Linear Datalogo over Stable Semirings

Datalogo is an extension of Datalog, where instead of a program being a collection of union of conjunctive queries over the standard Boolean semiring, a program may now be a collection of sum-product queries over an arbitrary commutative…

数据库 · 计算机科学 2025-07-08 Sungjin Im , Benjamin Moseley , Hung Ngo , Kirk Pruhs

Template-Based Schema Matching of Multi-Layout Tenancy Schedules:A Comparative Study of a Template-Based Hybrid Matcher and the ALITE Full Disjunction Model

The lack of standardized tabular formats for tenancy schedules across real estate firms creates significant inefficiencies in data integration. Existing automated integration methods, such as Full Disjunction (FD)-based models like ALITE,…

数据库 · 计算机科学 2025-07-04 Tim Uilkema , Yao Ma , Seyed Sahand Mohammadi Ziabari , Joep van Vliet

PathDB: A system for evaluating regular path queries

PathDB is a Java-based graph database designed for in-memory data loading and querying. By utilizing Regular Path Queries (RPQ) and a closed path algebra, PathDB processes paths through its three main components: the parser, the logical…

数据库 · 计算机科学 2025-07-04 Roberto García , Renzo Angles , Vicente Rojas , Sebastián Ferrada

Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems

Traditional Data+AI systems utilize data-driven techniques to optimize performance, but they rely heavily on human experts to orchestrate system pipelines, enabling them to adapt to changes in data, queries, tasks, and environments. For…

数据库 · 计算机科学 2025-07-03 Zhaoyan Sun , Jiayi Wang , Xinyang Zhao , Jiachi Wang , Guoliang Li

MobileRAG: A Fast, Memory-Efficient, and Energy-Efficient Method for On-Device RAG

Retrieval-Augmented Generation (RAG) has proven effective on server infrastructures, but its application on mobile devices is still underexplored due to limited memory and power resources. Existing vector search and RAG solutions largely…

数据库 · 计算机科学 2025-07-03 Taehwan Park , Geonho Lee , Min-Soo Kim