数据库 — Scifaro

PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models

Number of Distinct Values (NDV) estimation of a multiset/column is a basis for many data management tasks, especially within databases. Despite decades of research, most existing methods require either a significant amount of samples…

数据库 · 计算机科学 2025-04-02 Xianghong Xu , Xiao He , Tieying Zhang , Lei Zhang , Rui Shi , Jianjun Chen

HistogramTools for Efficient Data Analysis and Distribution Representation in Large Data Sets

Histograms provide a powerful means of summarizing large data sets by representing their distribution in a compact, binned form. The HistogramTools R package enhances R built-in histogram functionality, offering advanced methods for…

数据库 · 计算机科学 2025-04-02 Shubham Malhotra

Sustainable Open-Data Management for Field Research: A Cloud-Based Approach in the Underlandscape Project

Field-based research projects require a robust suite of ICT services to support data acquisition, documentation, storage, and dissemination. A key challenge lies in ensuring the sustainability of data management - not only during the…

数据库 · 计算机科学 2025-04-02 Augusto Ciuffoletti , Letizia Chiti

DAgent: A Relational Database-Driven Data Analysis Report Generation Agent

Relational database-driven data analysis (RDB-DA) report generation, which aims to generate data analysis reports after querying relational databases, has been widely applied in fields such as finance and healthcare. Typically, these tasks…

数据库 · 计算机科学 2025-04-02 Wenyi Xu , Yuren Mao , Xiaolu Zhang , Chao Zhang , Xuemei Dong , Mengfei Zhang , Yunjun Gao

Unbalanced Triangle Detection and Enumeration Hardness for Unions of Conjunctive Queries

We study the enumeration of answers to Unions of Conjunctive Queries (UCQs) with optimal time guarantees. More precisely, we wish to identify the queries that can be solved with linear preprocessing time and constant delay. Despite the…

数据库 · 计算机科学 2025-04-02 Karl Bringmann , Nofar Carmeli

Shape Expressions with Inheritance

We formally introduce an inheritance mechanism for the Shape Expressions language (ShEx). It is inspired by inheritance in object-oriented programming languages, and provides similar advantages such as reuse, modularity, and more flexible…

数据库 · 计算机科学 2025-04-01 Iovka Boneva , Jose Emilio Labra Gayo , Eric Prud'hommeaux , Katherine Thornton , Andra Waagmeester

GRACEFUL: A Learned Cost Estimator For UDFs

User-Defined-Functions (UDFs) are a pivotal feature in modern DBMS, enabling the extension of native DBMS functionality with custom logic. However, the integration of UDFs into query optimization processes poses significant challenges,…

数据库 · 计算机科学 2025-04-01 Johannes Wehrstein , Tiemo Bang , Roman Heinrich , Carsten Binnig

Figaro on GPUs: Two Tables

This paper introduces the implementation of the Figaro-GPU algorithm for computing a QR and SVD decomposition over a join matrix defined by the natural join over two tables on GPUs. Figaro-GPU's main novelty is a GPU implementation of the…

数据库 · 计算机科学 2025-04-01 Dorde Zivanovic

PrivPetal: Relational Data Synthesis via Permutation Relations

Releasing relational databases while preserving privacy is an important research problem with numerous applications. A canonical approach is to generate synthetic data under differential privacy (DP), which provides a strong, rigorous…

数据库 · 计算机科学 2025-04-01 Kuntai Cai , Xiaokui Xiao , Yin Yang

Output-Sensitive Evaluation of Regular Path Queries

We study the classical evaluation problem for regular path queries: Given an edge-labeled graph and a regular path query, compute the set of pairs of vertices that are connected by paths that match the query. The Product Graph (PG) is the…

数据库 · 计算机科学 2025-04-01 Mahmoud Abo Khamis , Ahmet Kara , Dan Olteanu , Dan Suciu

A Lower Bound on Unambiguous Context Free Grammars via Communication Complexity

Motivated by recent connections to factorised databases, we analyse the efficiency of representations by context free grammars (CFGs). Concretely, we prove a recent conjecture by Kimelfeld, Martens, and Niewerth (ICDT 2025), that for finite…

数据库 · 计算机科学 2025-04-01 Stefan Mengel , Harry Vinall-Smeeth

Kishu: Time-Traveling for Computational Notebooks

Computational notebooks (e.g., Jupyter, Google Colab) are widely used by data scientists. A key feature of notebooks is the interactive computing model of iteratively executing cells (i.e., a set of statements) and observing the result…

数据库 · 计算机科学 2025-04-01 Zhaoheng Li , Supawit Chockchowwat , Ribhav Sahu , Areet Sheth , Yongjoo Park

Automatic Data Repair: Are We Ready to Deploy?

Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from…

数据库 · 计算机科学 2025-04-01 Wei Ni , Xiaoye Miao , Xiangyu Zhao , Yangyang Wu , Jianwei Yin

Distributed Evaluation of Graph Queries using Recursive Relational Algebra

We present a system called Dist-$\mu$-RA for the distributed evaluation of recursive graph queries. Dist-$\mu$-RA builds on the recursive relational algebra and extends it with evaluation plans suited for the distributed setting. The goal…

数据库 · 计算机科学 2025-04-01 Sarah Chlyah , Pierre Genevès , Nabil Layaïda

A Graph-native Optimization Framework for Complex Graph Queries

This technical report extends the SIGMOD 2025 paper "A Modular Graph-Native Query Optimization Framework" by providing a comprehensive exposition of GOpt's advanced technical mechanisms, implementation strategies, and extended evaluations.…

数据库 · 计算机科学 2025-03-31 Bingqing Lyu , Xiaoli Zhou , Longbin Lai , Yufan Yang , Yunkai Lou , Wenyuan Yu , Jingren Zhou

Taxonomy Inference for Tabular Data Using Large Language Models

Taxonomy inference for tabular data is a critical task of schema inference, aiming at discovering entity types (i.e., concepts) of the tables and building their hierarchy. It can play an important role in data management, data exploration,…

数据库 · 计算机科学 2025-03-31 Zhenyu Wu , Jiaoyan Chen , Norman W. Paton

Workshop Scientific HPC in the pre-Exascale era (part of ITADATA 2024) Proceedings

The proceedings of Workshop Scientific HPC in the pre-Exascale era (SHPC), held in Pisa, Italy, September 18, 2024, are part of 3rd Italian Conference on Big Data and Data Science (ITADATA2024) proceedings (arXiv: 2503.14937). The main…

数据库 · 计算机科学 2025-03-31 Nicola Bena , Claudia Diamantini , Michela Natilli , Luigi Romano , Giovanni Stilo , Valentina Pansanella , Claudio A. Ardagna , Anna Monreale , Roberto Trasarti , Valentina Cesare , Gianluca Mittone , Emanuele De Rubeis , Alberto Vecchiato

CONCERTO: Complex Query Execution Mechanism-Aware Learned Cost Estimation

With the growing demand for massive data analysis, many DBMSs have adopted complex underlying query execution mechanisms, including vectorized operators, parallel execution, and dynamic pipeline modifications. However, there remains a lack…

数据库 · 计算机科学 2025-03-31 Kaixin Zhang , Hongzhi Wang , Kunkai Gu , Ziqi Li , Chunyu Zhao , Yingze Li , Yu Yan

A logic-based framework for database repairs

We introduce a general abstract framework for database repairs, where the repair notions are defined using formal logic. We distinguish between integrity constraints and so-called query constraints. The former are used to model consistency…

数据库 · 计算机科学 2025-03-31 Nicolas Fröhlich , Arne Meier , Nina Pardal , Jonni Virtema

Algebraic Data Integration

In this paper we develop an algebraic approach to data integration by combining techniques from functional programming, category theory, and database theory. In our formalism, database schemas and instances are algebraic (multi-sorted…

数据库 · 计算机科学 2025-03-31 Patrick Schultz , Ryan Wisnesky