数据库
Conventional database architectures often secure local consistency by discarding information, entangling correctness with loss. We introduce the Functorial-Categorical Database (FCDb), which models data operations as morphisms in a layered…
Database research and development often require a large number of SQL queries for benchmarking purposes. However, acquiring real-world SQL queries is challenging due to privacy concerns, and existing SQL generation methods are limited in…
Data summarization is essential to discover insights from large datasets. In a spreadsheets, pivot tables offer a convenient way to summarize tabular data by computing aggregates over some attributes, grouped by others. However, identifying…
Constraints are powerful declarative constructs that allow users to conveniently restrict variable values that potentially range over an infinite domain. In this paper, we propose a constraint path query language over property graphs, which…
DuckDB is designed for portability. It is also designed to run anywhere, and possibly in contexts where it can be specialized for performance, e.g., as a cloud service or on a smart device. In this paper, we consider the way DuckDB…
Property graphs have rapidly become the de facto standard for representing and managing complex, interconnected data, powering applications across domains from knowledge graphs to social networks. Despite the advantages, their schema-free…
Pattern sampling has emerged as a promising approach for information discovery in large databases, allowing analysts to focus on a manageable subset of patterns. In this approach, patterns are randomly drawn based on an interestingness…
This paper presents predicate transfer, a novel method that optimizes join performance by pre-filtering tables to reduce the join input sizes. Predicate transfer generalizes Bloom join, which conducts pre-filtering within a single join…
A long line of concurrency-control (CC) protocols argues correctness via a single serialization point (begin or commit), an assumption that is incompatible with snapshot isolation (SI), where read-write anti-dependencies arise. Serial…
Entity matching is a fundamental task in data cleaning and data integration. With the rapid adoption of large language models (LLMs), recent studies have explored zero-shot and few-shot prompting to improve entity matching accuracy.…
Shapley-like values, including the Shapley and Banzhaf values, provide a principled way to quantify how individual tuples contribute to a query result. Their exact computation, however, is intractable because it requires aggregating…
Ethics has become a major concern to the information management community, as both algorithms and data should satisfy ethical rules that guarantee not to generate dishonourable behaviours when they are used. However, these ethical rules may…
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM…
Text-to-SQL is a fundamental yet challenging task in the NLP area, aiming at translating natural language questions into SQL queries. While recent advances in large language models have greatly improved performance, most existing approaches…
Handling missing data is a central challenge in data-driven analysis. Modern imputation methods not only aim for accurate reconstruction but also differ in how they represent and quantify uncertainty. Yet, the reliability and calibration of…
This paper presents a pseudocode algorithm for translating Entity-Relationship data models into (Elementary) Mathematical Data Model schemes. We prove that this algorithm is linear, sound, complete, and optimal. As an example, we apply this…
Real-world AI/ML workflows often apply inference computations to feature vectors joined from multiple datasets. To avoid the redundant AI/ML computations caused by repeated data records in the join's output, factorized ML has been proposed…
With this work, we describe the concept of intent-based query rewriting and present a first viable solution. The aim is to allow rewrites to alter the structure and syntactic outcome of an original query while keeping the obtainable…
Machine unlearning in learned cardinality estimation (CE) systems presents unique challenges due to the complex distributional dependencies in multi-table relational data. Specifically, data deletion, a core component of machine unlearning,…
Outlier detection and cleaning are essential steps in data preprocessing to ensure the integrity and validity of data analyses. This paper focuses on outlier points within individual trajectories, i.e., points that deviate significantly…