数据库 — Scifaro

SQLord: A Robust Enterprise Text-to-SQL Solution via Reverse Data Generation and Workflow Decomposition

Transforming natural language into SQL queries (NL2SQL) is crucial for data-driven business applications. Existing frameworks, trained on open-source datasets, struggle with complex business logic and lack domain-specific data for…

数据库 · 计算机科学 2025-07-16 Song Cheng , Qiannan Cheng , Linbo Jin , Lei Yi , Guannan Zhang

On the Complexity of Checking Mixed Isolation Levels for SQL Transactions

Concurrent accesses to databases are typically grouped in transactions which define units of work that should be isolated from other concurrent computations and resilient to failures. Modern databases provide different levels of isolation…

数据库 · 计算机科学 2025-07-16 Ahmed Bouajjani , Constantin Enea , Enrique Román-Calvo

VSAG: An Optimized Search Framework for Graph-based Approximate Nearest Neighbor Search

Approximate nearest neighbor search (ANNS) is a fundamental problem in vector databases and AI infrastructures. Recent graph-based ANNS algorithms have achieved high search accuracy with practical efficiency. Despite the advancements, these…

数据库 · 计算机科学 2025-07-16 Xiaoyao Zhong , Haotian Li , Jiabao Jin , Mingyu Yang , Deming Chu , Xiangyu Wang , Zhitao Shen , Wei Jia , George Gu , Yi Xie , Xuemin Lin , Heng Tao Shen , Jingkuan Song , Peng Cheng

MINE GRAPH RULE: A New Cypher-like Operator for Mining Association Rules on Property Graphs

Mining information from graph databases is becoming overly important. To approach this problem, current methods focus on identifying subgraphs with specific topologies; as of today, no work has been dedicated to jointly expressing the…

数据库 · 计算机科学 2025-07-16 Francesco Cambria , Francesco Invernici , Anna Bernasconi , Stefano Ceri

Instance-Optimized String Fingerprints

Recent research found that cloud data warehouses are text-heavy. However, their capabilities for efficiently processing string columns remain limited, relying primarily on techniques like dictionary encoding and prefix-based partition…

数据库 · 计算机科学 2025-07-15 Mihail Stoian , Johannes Thürauf , Andreas Zimmerer , Alexander van Renen , Andreas Kipf

LogLite: Lightweight Plug-and-Play Streaming Log Compression

Log data is a vital resource for capturing system events and states. With the increasing complexity and widespread adoption ofmodern software systems and IoT devices, the daily volume of log generation has surged to tens of petabytes,…

数据库 · 计算机科学 2025-07-15 Benzhao Tang , Shiyu Yang , Zhitao Shen , Wenjie Zhang , Xuemin Lin , Zhihong Tian

Efficient Temporal Simple Path Graph Generation

Interactions between two entities often occur at specific timestamps, which can be modeled as a temporal graph. Exploring the relationships between vertices based on temporal paths is one of the fundamental tasks. In this paper, we conduct…

数据库 · 计算机科学 2025-07-15 Zhiyang Tang , Yanping Wu , Xiangjun Zai , Chen Chen , Xiaoyang Wang , Ying Zhang

Rethinking LSM-tree based Key-Value Stores: A Survey

LSM-tree is a widely adopted data structure in modern key-value store systems that optimizes write performance in write-heavy applications by using append writes to achieve sequential writes. However, the unpredictability of LSM-tree…

数据库 · 计算机科学 2025-07-15 Yina Lv , Qiao Li , Quanqing Xu , Congming Gao , Chuanhui Yang , Xiaoli Wang , Chun Jason Xue

TRACER: Efficient Object Re-Identification in Networked Cameras through Adaptive Query Processing

Efficiently re-identifying and tracking objects across a network of cameras is crucial for applications like traffic surveillance. Spatula is the state-of-the-art video database management system (VDBMS) for processing Re-ID queries.…

数据库 · 计算机科学 2025-07-15 Pramod Chunduri , Yao Lu , Joy Arulraj

HedraRAG: Coordinating LLM Generation and Database Retrieval in Heterogeneous RAG Serving

This paper addresses emerging system-level challenges in heterogeneous retrieval-augmented generation (RAG) serving, where complex multi-stage workflows and diverse request patterns complicate efficient execution. We present HedraRAG, a…

数据库 · 计算机科学 2025-07-15 Zhengding Hu , Vibha Murthy , Zaifeng Pan , Wanlu Li , Xiaoyi Fang , Yufei Ding , Yuke Wang

Orchestration for Domain-specific Edge-Cloud Language Models

The remarkable performance of Large Language Models (LLMs) has inspired many applications, which often necessitate edge-cloud collaboration due to connectivity, privacy, and cost considerations. Traditional methods primarily focus on…

数据库 · 计算机科学 2025-07-15 Prasoon Patidar , Alex Crown , Kevin Hsieh , Yifei Xu , Tusher Chakraborty , Ranveer Chandra , Yuvraj Agarwal

QPET: A Versatile and Portable Quantity-of-Interest-Preservation Framework for Error-Bounded Lossy Compression

Error-bounded lossy compression has been widely adopted in many scientific domains because it can address the challenges in storing, transferring, and analyzing unprecedented amounts of scientific data. Although error-bounded lossy…

数据库 · 计算机科学 2025-07-15 Jinyang Liu , Pu Jiao , Kai Zhao , Xin Liang , Sheng Di , Franck Cappello

Hashing for Fast Pattern Set Selection

Pattern set mining, which is the task of finding a good set of patterns instead of all patterns, is a fundamental problem in data mining. Many different definitions of what constitutes a good set have been proposed in recent years. In this…

数据库 · 计算机科学 2025-07-14 Maiju Karjalainen , Pauli Miettinen

ONION: A Multi-Layered Framework for Participatory ER Design

We present ONION, a multi-layered framework for participatory Entity-Relationship (ER) modeling that integrates insights from design justice, participatory AI, and conceptual modeling. ONION introduces a five-stage methodology: Observe,…

数据库 · 计算机科学 2025-07-14 Viktoriia Makovska , George Fletcher , Julia Stoyanovich

xpSHACL: Explainable SHACL Validation using Retrieval-Augmented Generation and Large Language Models

Shapes Constraint Language (SHACL) is a powerful language for validating RDF data. Given the recent industry attention to Knowledge Graphs (KGs), more users need to validate linked data properly. However, traditional SHACL validation…

数据库 · 计算机科学 2025-07-14 Gustavo Correa Publio , José Emilio Labra Gayo

TableCopilot: A Table Assistant Empowered by Natural Language Conditional Table Discovery

The rise of LLM has enabled natural language-based table assistants, but existing systems assume users already have a well-formed table, neglecting the challenge of table discovery in large-scale table pools. To address this, we introduce…

数据库 · 计算机科学 2025-07-14 Lingxi Cui , Guanyu Jiang , Huan Li , Ke Chen , Lidan Shou , Gang Chen

QUEST: Query Optimization in Unstructured Document Analysis

Most recently, researchers have started building large language models (LLMs) powered data systems that allow users to analyze unstructured text documents like working with a database because LLMs are very effective in extracting attributes…

数据库 · 计算机科学 2025-07-14 Zhaoze Sun , Qiyan Deng , Chengliang Chai , Kaisen Jin , Xinyu Guo , Han Han , Ye Yuan , Guoren Wang , Lei Cao

A Service Architecture for Dataspaces

Dataspaces are designed to support sovereign, trusted and decentralized data exchange between participants forming an ecosystem. They are standardized by initiatives such as the International Data Spaces Association or Gaia-X and have…

数据库 · 计算机科学 2025-07-11 Benedikt T. Arnold , Christoph Lange , Christina Gillmann , Stefan Decker

JOB-Complex: A Challenging Benchmark for Traditional & Learned Query Optimization

Query optimization is a fundamental task in database systems that is crucial to providing high performance. To evaluate learned and traditional optimizer's performance, several benchmarks, such as the widely used JOB benchmark, are used.…

数据库 · 计算机科学 2025-07-11 Johannes Wehrstein , Timo Eckmann , Roman Heinrich , Carsten Binnig

Algorithmic Complexity Attacks on All Learned Cardinality Estimators: A Data-centric Approach

Learned cardinality estimators show promise in query cardinality prediction, yet they universally exhibit fragility to training data drifts, posing risks for real-world deployment. This work is the first to theoretical investigate how…

数据库 · 计算机科学 2025-07-11 Yingze Li , Xianglong Liu , Dong Wang , Zixuan Wang , Hongzhi Wang , Kaixing Zhang , Yiming Guan