数据库 — Scifaro

AI-Driven Generation of Data Contracts in Modern Data Engineering Systems

Data contracts formalize agreements between data producers and consumers regarding schema, semantics, and quality expectations. As data pipelines grow in complexity, manual authoring and maintenance of contracts becomes error-prone and…

数据库 · 计算机科学 2025-07-30 Harshraj Bhoite

Towards Next Generation Data Engineering Pipelines

Data engineering pipelines are a widespread way to provide high-quality data for all kinds of data science applications. However, numerous challenges still remain in the composition and operation of such pipelines. Data engineering…

数据库 · 计算机科学 2025-07-30 Kevin M. Kramer , Valerie Restat , Sebastian Strasser , Uta Störl , Meike Klettke

Data Cleaning of Data Streams

Streaming data can arise from a variety of contexts. Important use cases are continuous sensor measurements such as temperature, light or radiation values. In the process, streaming data may also contain data errors that should be cleaned…

数据库 · 计算机科学 2025-07-29 Valerie Restat , Niklas Rodenhausen , Carina Antonin , Uta Störl

MVIAnalyzer: A Holistic Approach to Analyze Missing Value Imputation

Missing values often limit the usage of data analysis or cause falsification of results. Therefore, methods of missing value imputation (MVI) are of great significance. However, in general, there is no universal, fair MVI method for…

数据库 · 计算机科学 2025-07-29 Valerie Restat , Kai Tejkl , Uta Störl

A Functional Data Model and Query Language is All You Need

We propose the vision of a functional data model (FDM) and an associated functional query language (FQL). Our proposal has far-reaching consequences: we show a path to come up with a modern QL that solves (almost if not) all problems of SQL…

数据库 · 计算机科学 2025-07-29 Jens Dittrich

TIMEST: Temporal Information Motif Estimator Using Sampling Trees

The mining of pattern subgraphs, known as motifs, is a core task in the field of graph mining. Edges in real-world networks often have timestamps, so there is a need for temporal motif mining. A temporal motif is a richer structure that…

数据库 · 计算机科学 2025-07-29 Yunjie Pan , Omkar Bhalerao , C. Seshadhri , Nishil Talati

SoftPipe: A Soft-Guided Reinforcement Learning Framework for Automated Data Preparation

Data preparation is a foundational yet notoriously challenging component of the machine learning lifecycle, characterized by a vast combinatorial search space. While reinforcement learning (RL) offers a promising direction, state-of-the-art…

数据库 · 计算机科学 2025-07-29 Jing Chang , Chang Liu , Jinbin Huang , Shuyuan Zheng , Rui Mao , Jianbin Qin

Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models

Exploratory data analysis (EDA), coupled with SQL, is essential for data analysts involved in data exploration and analysis. However, data analysts often encounter two primary challenges: (1) the need to craft SQL queries skillfully, and…

数据库 · 计算机科学 2025-07-29 Jun-Peng Zhu , Boyan Niu , Peng Cai , Zheming Ni , Jianwei Wan , Kai Xu , Jiajun Huang , Shengbo Ma , Bing Wang , Xuan Zhou , Guanglei Bao , Donghui Zhang , Liu Tang , Qi Liu

SiriusBI: A Comprehensive LLM-Powered Solution for Data Analytics in Business Intelligence

With the proliferation of Large Language Models (LLMs) in Business Intelligence (BI), existing solutions face critical challenges in industrial deployments: functionality deficiencies from legacy systems failing to meet evolving LLM-era…

数据库 · 计算机科学 2025-07-29 Jie Jiang , Haining Xie , Siqi Shen , Yu Shen , Zihan Zhang , Meng Lei , Yifeng Zheng , Yang Li , Chunyou Li , Danqing Huang , Yinjun Wu , Wentao Zhang , Xiaofeng Yang , Bin Cui , Peng Chen

Learning-Augmented Online Caching: New Upper Bounds

We address the problem of learning-augmented online caching in the scenario when each request is accompanied by a prediction of the next occurrence of the requested page. We improve currently known bounds on the competitive ratio of the…

数据库 · 计算机科学 2025-07-29 Daniel Skachkov , Denis Ponomaryov , Yuri Dorn , Alexander Demin

Towards Evolution Capabilities in Data Pipelines

Evolutionary change over time in the context of data pipelines is certain, especially with regard to the structure and semantics of data as well as to the pipeline operators. Dealing with these changes, i.e. providing long-term maintenance,…

数据库 · 计算机科学 2025-07-29 Kevin M. Kramer

DBMS-LLM Integration Strategies in Industrial and Business Applications: Current Status and Future Challenges

Modern enterprises are increasingly driven by the DATA+AI paradigm, in which Database Management Systems (DBMSs) and Large Language Models (LLMs) have become two foundational infrastructures powering a wide range of industrial and business…

数据库 · 计算机科学 2025-07-28 Zhengtong Yan , Gongsheng Yuan , Qingsong Guo , Jiaheng Lu

Big Data Energy Systems: A Survey of Practices and Associated Challenges

Energy systems generate vast amounts of data in extremely short time intervals, creating challenges for efficient data management. Traditional data management methods often struggle with scalability and accessibility, limiting their…

数据库 · 计算机科学 2025-07-28 Lunodzo J. Mwinuka , Massimo Cafaro , Lucas Pereira , Hugo Morais

ApproxJoin: Approximate Matching for Efficient Verification in Fuzzy Set Similarity Join

The set similarity join problem is a fundamental problem in data processing and discovery, relying on exact similarity measures between sets. In the presence of alterations, such as misspellings on string data, the fuzzy set similarity join…

数据库 · 计算机科学 2025-07-28 Michael Mandulak , S M Ferdous , Sayan Ghosh , Mahantesh Halappanavar , George Slota

An advanced AI driven database system

Contemporary database systems, while effective, suffer severe issues related to complexity and usability, especially among individuals who lack technical expertise but are unfamiliar with query languages like Structured Query Language…

数据库 · 计算机科学 2025-07-25 M. Tedeschi , S. Rizwan , C. Shringi , V. Devram Chandgir , S. Belich

Multi-Relational Algebra for Multi-Granular Data Analytics

In modern data analytics, analysts frequently face the challenge of searching for desirable entities by evaluating, for each entity, a collection of its feature relations to derive key analytical properties. This search is challenging…

数据库 · 计算机科学 2025-07-25 Xi Wu , Eugene Wu , Zichen Zhu , Fengan Li , Jeffrey F. Naughton

SHINE: A Scalable HNSW Index in Disaggregated Memory

Approximate nearest neighbor (ANN) search is a fundamental problem in computer science for which in-memory graph-based methods, such as Hierarchical Navigable Small World (HNSW), perform exceptionally well. To scale beyond billions of…

数据库 · 计算机科学 2025-07-24 Manuel Widmoser , Daniel Kocher , Nikolaus Augsten

Unfolding Data Quality Dimensions in Practice: A Survey

Data quality describes the degree to which data meet specific requirements and are fit for use by humans and/or downstream tasks (e.g., artificial intelligence). Data quality can be assessed across multiple high-level concepts called…

数据库 · 计算机科学 2025-07-24 Vasileios Papastergios , Lisa Ehrlinger , Anastasios Gounaris

Triadic First-Order Logic Queries in Temporal Networks

Motif counting is a fundamental problem in network analysis, and there is a rich literature of theoretical and applied algorithms for this problem. Given a large input network $G$, a motif $H$ is a small "pattern" graph indicative of…

数据库 · 计算机科学 2025-07-24 Omkar Bhalerao , Yunjie Pan , C. Seshadhri , Nishil Talati

Stitching Inner Product and Euclidean Metrics for Topology-aware Maximum Inner Product Search

Maximum Inner Product Search (MIPS) is a fundamental challenge in machine learning and information retrieval, particularly in high-dimensional data applications. Existing approaches to MIPS either rely solely on Inner Product (IP)…

数据库 · 计算机科学 2025-07-24 Tingyang Chen , Cong Fu , Xiangyu Ke , Yunjun Gao , Yabo Ni , Anxiang Zeng