数据库 — Scifaro

Automatic Metadata Extraction for Text-to-SQL

Large Language Models (LLMs) have recently become sophisticated enough to automate many tasks ranging from pattern finding to writing assistance to code generation. In this paper, we examine text-to-SQL generation. We have observed from…

数据库 · 计算机科学 2025-09-04 Vladislav Shkapenyuk , Divesh Srivastava , Theodore Johnson , Parisa Ghane

Integrating Knowledge Graphs and Visualization Dashboards for Advance Data Discovery in VESA

The increasing complexity and scale of scientific datasets demand advanced tools for efficient discovery and exploration. Traditional search systems often fall short in addressing the multidimensional nature of data and their intricate…

数据库 · 计算机科学 2025-09-04 Pawandeep Kaur Betz , Tobias Hecking , Andreas Gerndt

FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data

The growing demand for data-driven decision-making has created an urgent need for data agents that can integrate structured and unstructured data for analysis. While data agents show promise for enabling users to perform complex analytics…

数据库 · 计算机科学 2025-09-03 Ziting Wang , Shize Zhang , Haitao Yuan , Jinwei Zhu , Shifu Li , Wei Dong , Gao Cong

OASIS: Object-based Analytics Storage for Intelligent SQL Query Offloading in Scientific Tabular Workloads

Computation-Enabled Object Storage (COS) systems, such as MinIO and Ceph, have recently emerged as promising storage solutions for post hoc, SQL-based analysis on large-scale datasets in High-Performance Computing (HPC) environments. By…

数据库 · 计算机科学 2025-09-03 Soon Hwang , Junhyeok Park , Junghyun Ryu , Seonghoon Ahn , Jeoungahn Park , Jeongjin Lee , Soonyeal Yang , Jungki Noh , Woosuk Chung , Hoshik Kim , Youngjae Kim

Disentangling the schema turn: Restoring the information base to conceptual modelling

If one looks at contemporary mainstream development practices for conceptual modelling in computer science, these so clearly focus on a conceptual schema completely separated from its information base that the conceptual schema is often…

数据库 · 计算机科学 2025-09-03 Chris Partridge , Andrew Mitchell , Sergio de Cesare , Oscar Xiberta Soto

Diverse Unionable Tuple Search: Novelty-Driven Discovery in Data Lakes [Technical Report]

Unionable table search techniques input a query table from a user and search for data lake tables that can contribute additional rows to the query table. The definition of unionability is generally based on similarity measures which may…

数据库 · 计算机科学 2025-09-03 Aamod Khatiwada , Roee Shraga , Renée J. Miller

Near-Duplicate Text Alignment under Weighted Jaccard Similarity

Near-duplicate text alignment is the task of identifying, among the texts in a corpus, all the subsequences (substrings) that are similar to a given query. Traditional approaches rely on seeding-extension-filtering heuristics, which lack…

数据库 · 计算机科学 2025-09-03 Yuheng Zhang , Miao Qiao , Zhencan Peng , Dong Deng

BPI: A Novel Efficient and Reliable Search Structure for Hybrid Storage Blockchain

Hybrid storage solutions have emerged as potent strategies to alleviate the data storage bottlenecks prevalent in blockchain systems. These solutions harness off-chain Storage Services Providers (SPs) in conjunction with Authenticated Data…

数据库 · 计算机科学 2025-09-03 Xinkui Zhao , Rengrong Xiong , Guanjie Cheng , Xinhao Jin , Shawn Shi , Xiubo Liang , Gongsheng Yuan , Xiaoye Miao , Jianwei Yin , Shuiguang Deng

CRouting: Reducing Expensive Distance Calls in Graph-Based Approximate Nearest Neighbor Search

Approximate nearest neighbor search (ANNS) is a crucial problem in information retrieval and AI applications. Recently, there has been a surge of interest in graph-based ANNS algorithms due to their superior efficiency and accuracy.…

数据库 · 计算机科学 2025-09-03 Zhenxin Li , Shuibing He , Jiahao Guo , Xuechen Zhang , Xian-He Sun , Gang Chen

Illuminating Patterns of Divergence: DataDios SmartDiff for Large-Scale Data Difference Analysis

Data engineering workflows require reliable differencing across files, databases, and query outputs, yet existing tools falter under schema drift, heterogeneous types, and limited explainability. SmartDiff is a unified system that combines…

数据库 · 计算机科学 2025-09-03 Aryan Poduri , Yashwant Tailor

SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

The emergence of large-language models (LLMs) has enabled a new class of semantic data processing systems (SDPSs) to support declarative queries against unstructured documents. Existing SDPSs are, however, lacking a unified algebraic…

数据库 · 计算机科学 2025-09-03 Changjae Lee , Zhuoyue Zhao , Jinjun Xiong

Efficient Computation of Trip-based Group Nearest Neighbor Queries (Full Version)

In recent years, organizing group meetups for entertainment or other necessities has gained significant importance, especially given the busy nature of daily schedules. People often combine multiple activities, such as dropping kids off at…

数据库 · 计算机科学 2025-09-03 Shahiduz Zaman , Tanzima Hashem , Sukarna Barua

ForeSight: A Predictive-Scheduling Deterministic Database

Deterministic databases enable scalable replicated systems by executing transactions in a predetermined order. However, existing designs fail to capture transaction dependencies, leading to insufficient scheduling, high abort rates, and…

数据库 · 计算机科学 2025-09-03 Junfang Huang , Yu Yan , Hongzhi Wang , Yingze Li , Jinghan Lin

SQL-Factory: A Multi-Agent Framework for High-Quality and Large-Scale SQL Generation

High quality SQL corpus is essential for intelligent database. For example, Text-to-SQL requires SQL queries and correspond natural language questions as training samples. However, collecting such query corpus remains challenging in…

数据库 · 计算机科学 2025-09-03 Jiahui Li , Tongwang Wu , Yuren Mao , Yunjun Gao , Yajie Feng , Huaizhong Liu

Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems

AI-augmented data processing systems (DPSs) integrate large language models (LLMs) into query pipelines, allowing powerful semantic operations on structured and unstructured data. However, the reliability (a.k.a. trust) of these systems is…

数据库 · 计算机科学 2025-09-03 Alexander W. Lee , Justin Chan , Michael Fu , Nicolas Kim , Akshay Mehta , Deepti Raghavan , Ugur Cetintemel

Query Rewriting via LLMs

When complex SQL queries suffer slow executions despite query optimization, DBAs typically invoke automated query rewriting tools to recommend ``lean'' equivalents that are conducive to faster execution. The rewritings are usually achieved…

数据库 · 计算机科学 2025-09-03 Sriram Dharwada , Himanshu Devrani , Jayant Haritsa , Harish Doraiswamy

DobLIX: A Dual-Objective Learned Index for Log-Structured Merge Trees

In this paper, we introduce DobLIX, a dual-objective learned index specifically designed for Log-Structured Merge(LSM) tree-based key-value stores. Although traditional learned indexes focus exclusively on optimizing index lookups, they…

数据库 · 计算机科学 2025-09-03 Alireza Heidari , Amirhossein Ahmadi , Wei Zhang

Can Uncertainty Quantification Improve Learned Index Benefit Estimation?

Index tuning is crucial for optimizing database performance by selecting optimal indexes based on workload. The key to this process lies in an accurate and efficient benefit estimator. Traditional methods relying on what-if tools often…

数据库 · 计算机科学 2025-09-03 Tao Yu , Zhaonian Zou , Hao Xiong

Hilbert Forest in the SISAP 2025 Indexing Challenge

We report our participation in the SISAP 2025 Indexing Challenge using a novel indexing technique called the Hilbert forest. The method is based on the fast Hilbert sort algorithm, which efficiently orders high-dimensional points along a…

数据库 · 计算机科学 2025-09-01 Yasunobu Imamura , Takeshi Shinohara , Naoya Higuchi , Kouichi Hirata , Tetsuji Kuboyama

Database Normalization via Dual-LLM Self-Refinement

Database normalization is crucial to preserving data integrity. However, it is time-consuming and error-prone, as it is typically performed manually by data engineers. To this end, we present Miffie, a database normalization framework that…

数据库 · 计算机科学 2025-09-01 Eunjae Jo , Nakyung Lee , Gyuyeong Kim