数据库 — Scifaro

OrchANN: A Unified I/O Orchestration Framework for Skewed Out-of-Core Vector Search

Approximate nearest neighbor search (ANNS) at billion scale is fundamentally an out-of-core problem: vectors and indexes live on SSD, so performance is dominated by I/O rather than compute. Under skewed semantic embeddings, existing…

数据库 · 计算机科学 2025-12-30 Chengying Huan , Lizheng Chen , Zhengyi Yang , Shaonan Ma , Rong Gu , Renjie Yao , Zhibin Wang , Mingxing Zhang , Fang Xi , Jie Tao , Gang Zhang , Guihai Chen , Chen Tian

Robust LLM-based Column Type Annotation via Prompt Augmentation with LoRA Tuning

Column Type Annotation (CTA) is a fundamental step towards enabling schema alignment and semantic understanding of tabular data. Existing encoder-only language models achieve high accuracy when fine-tuned on labeled columns, but their…

数据库 · 计算机科学 2025-12-30 Hanze Meng , Jianhao Cao , Rachel Pottinger

MonoM: Enhancing Monotonicity in Learned Cardinality Estimators

Cardinality estimation is a key component of database query optimization. Recent studies have demonstrated that learned cardinality estimation techniques can surpass traditional methods in accuracy. However, a significant barrier to their…

数据库 · 计算机科学 2025-12-30 Lyu Yi , Weiqi Feng , Yuanbiao Wang , Yuhong Kan

Query Carefully: Detecting the Unanswerables in Text-to-SQL Tasks

Text-to-SQL systems allow non-SQL experts to interact with relational databases using natural language. However, their tendency to generate executable SQL for ambiguous, out-of-scope, or unanswerable queries introduces a hidden risk, as…

数据库 · 计算机科学 2025-12-29 Jasmin Saxer , Isabella Maria Aigner , Luise Linzmeier , Andreas Weiler , Kurt Stockinger

Automated Training of Learned Database Components with Generative AI

The use of deep learning for database optimization has gained significant traction, offering improvements in indexing, cardinality estimation, and query optimization. However, acquiring high-quality training data remains a significant…

数据库 · 计算机科学 2025-12-24 Angjela Davitkova , Sebastian Michel

A Multi-agent Text2SQL Framework using Small Language Models and Execution Feedback

Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced…

数据库 · 计算机科学 2025-12-23 Thanh Dat Hoang , Thanh Trung Huynh , Matthias Weidlich , Thanh Tam Nguyen , Tong Chen , Hongzhi Yin , Quoc Viet Hung Nguyen

Sync Without Guesswork: Incomplete Time Series Alignment

Multivariate time series alignment is critical for ensuring coherent analysis across variables, but missing values and timestamp inconsistencies make this task highly challenging. Existing approaches often rely on prior imputation, which…

数据库 · 计算机科学 2025-12-23 Ding Jia , Jingyu Zhu , Yu Sun , Aoqian Zhang , Shaoxu Song , Haiwei Zhang , Xiaojie Yuan

Memelang: An Axial Grammar for LLM-Generated Vector-Relational Queries

Structured generation for LLM tool use highlights the value of compact DSL intermediate representations (IRs) that can be emitted directly and parsed deterministically. This paper introduces axial grammar: linear token sequences that…

数据库 · 计算机科学 2025-12-23 Bri Holt

Efficient Hypergraph Pattern Matching via Match-and-Filter and Intersection Constraint

A hypergraph is a generalization of a graph, in which a hyperedge can connect multiple vertices, modeling complex relationships involving multiple vertices simultaneously. Hypergraph pattern matching, which is to find all isomorphic…

数据库 · 计算机科学 2025-12-23 Siwoo Song , Wonseok Shin , Kunsoo Park , Giuseppe F. Italiano , Zhengyi Yang , Wenjie Zhang

Targeted Sequential Pattern Mining with High Average Utility

Incorporating utility into targeted pattern mining can address the practical limitations of traditional frequency-based approaches. However, utility-based methods often suffer from generating a large number of long and complicated…

数据库 · 计算机科学 2025-12-23 Kai Cao , Yucong Duan , Wensheng Gan

Dual Pruning and Sorting-Free Overestimation for Average-Utility Sequential Pattern Mining

In a quantitative sequential database, numerous efficient algorithms have been developed for high-utility sequential pattern mining (HUSPM). HUSPM establishes a relationship between frequency and significance in the real world and reflects…

数据库 · 计算机科学 2025-12-23 Kai Cao , Yucong Duan , Wensheng Gan

Rethinking Analytical Processing in the GPU Era

The era of GPU-powered data analytics has arrived. In this paper, we argue that recent advances in hardware (e.g., larger GPU memory, faster interconnect and IO, and declining cost) and software (e.g., composable data systems and mature…

数据库 · 计算机科学 2025-12-23 Bobbi Yogatama , Yifei Yang , Kevin Kristensen , Devesh Sarda , Abigale Kim , Adrian Cockcroft , Yu Teng , Joshua Patterson , Gregory Kimball , Wes McKinney , Weiwei Gong , Xiangyao Yu

The FAIREr Guiding Principles: Organizing data and metadata into semantically meaningful types of FAIR Digital Objects to increase their human explorability and cognitive interoperability

Ensuring the FAIRness (Findable, Accessible, Interoperable, Reusable) of data and metadata is an important goal in both research and industry. Knowledge graphs and ontologies have been central in achieving this goal, with interoperability…

数据库 · 计算机科学 2025-12-23 Lars Vogt

Democratizing Scalable Cloud Applications: Transactional Stateful Functions on Streaming Dataflows

Web applications underpin much of modern digital life, yet building scalable and consistent cloud applications remains difficult, requiring expertise across cloud computing, distributed systems, databases, and software engineering. These…

数据库 · 计算机科学 2025-12-22 Kyriakos Psarakis

Multi-granularity Spatiotemporal Flow Patterns

Analyzing flow of objects or data at different granularities of space and time can unveil interesting insights or trends. For example, transportation companies, by aggregating passenger travel data (e.g., counting passengers traveling from…

数据库 · 计算机科学 2025-12-22 Chrysanthi Kosyfaki , Nikos Mamoulis , Reynold Cheng , Ben Kao

Subset Sampling over Joins

Subset sampling (also known as Poisson sampling), where the decision to include any specific element in the sample is made independently of all others, is a fundamental primitive in data analytics, enabling efficient approximation by…

数据库 · 计算机科学 2025-12-19 Aryan Esmailpour , Xiao Hu , Jinchao Huang , Stavros Sintos

ModelTables: A Corpus of Tables about Models

We present ModelTables, a benchmark of tables in Model Lakes that captures the structured semantics of performance and configuration tables often overlooked by text only retrieval. The corpus is built from Hugging Face model cards, GitHub…

数据库 · 计算机科学 2025-12-19 Zhengyuan Dong , Victor Zhong , Renée J. Miller

Scaling Text2SQL via LLM-efficient Schema Filtering with Functional Dependency Graph Rerankers

Most modern Text2SQL systems prompt large language models (LLMs) with entire schemas -- mostly column information -- alongside the user's question. While effective on small databases, this approach fails on real-world schemas that exceed…

数据库 · 计算机科学 2025-12-19 Thanh Dat Hoang , Thanh Tam Nguyen , Thanh Trung Huynh , Hongzhi Yin , Quoc Viet Hung Nguyen

Implementing a Scalable, Redeployable and Multitiered Repository for FAIR and Secure Scientific Data Sharing: The BIG-MAP Archive

Data sharing in large consortia, such as research collaborations or industry partnerships, requires addressing both organizational and technical challenges. A common platform is essential to promote collaboration, facilitate exchange of…

数据库 · 计算机科学 2025-12-19 Valeria Granata , Francois Liot , Xing Wang , Steen Lysgaard , Ivano E. Castelli , Tejs Vegge , Nicola Marzari , Giovanni Pizzi

DP-Bench: A Benchmark for Evaluating Data Product Creation Systems

A data product is created with the intention of solving a specific problem, addressing a specific business usecase or meeting a particular need, going beyond just serving data as a raw asset. Data products enable end users to gain greater…

数据库 · 计算机科学 2025-12-19 Faisal Chowdhury , Sola Shirai , Sarthak Dash , Nandana Mihindukulasooriya , Horst Samulowitz