数据库
Data contracts formalize agreements between data producers and consumers regarding schema, semantics, and quality expectations. As data pipelines grow in complexity, manual authoring and maintenance of contracts becomes error-prone and…
Data engineering pipelines are a widespread way to provide high-quality data for all kinds of data science applications. However, numerous challenges still remain in the composition and operation of such pipelines. Data engineering…
Streaming data can arise from a variety of contexts. Important use cases are continuous sensor measurements such as temperature, light or radiation values. In the process, streaming data may also contain data errors that should be cleaned…
Missing values often limit the usage of data analysis or cause falsification of results. Therefore, methods of missing value imputation (MVI) are of great significance. However, in general, there is no universal, fair MVI method for…
We propose the vision of a functional data model (FDM) and an associated functional query language (FQL). Our proposal has far-reaching consequences: we show a path to come up with a modern QL that solves (almost if not) all problems of SQL…
The mining of pattern subgraphs, known as motifs, is a core task in the field of graph mining. Edges in real-world networks often have timestamps, so there is a need for temporal motif mining. A temporal motif is a richer structure that…
Data preparation is a foundational yet notoriously challenging component of the machine learning lifecycle, characterized by a vast combinatorial search space. While reinforcement learning (RL) offers a promising direction, state-of-the-art…
Exploratory data analysis (EDA), coupled with SQL, is essential for data analysts involved in data exploration and analysis. However, data analysts often encounter two primary challenges: (1) the need to craft SQL queries skillfully, and…
With the proliferation of Large Language Models (LLMs) in Business Intelligence (BI), existing solutions face critical challenges in industrial deployments: functionality deficiencies from legacy systems failing to meet evolving LLM-era…
We address the problem of learning-augmented online caching in the scenario when each request is accompanied by a prediction of the next occurrence of the requested page. We improve currently known bounds on the competitive ratio of the…
Evolutionary change over time in the context of data pipelines is certain, especially with regard to the structure and semantics of data as well as to the pipeline operators. Dealing with these changes, i.e. providing long-term maintenance,…
Modern enterprises are increasingly driven by the DATA+AI paradigm, in which Database Management Systems (DBMSs) and Large Language Models (LLMs) have become two foundational infrastructures powering a wide range of industrial and business…
Energy systems generate vast amounts of data in extremely short time intervals, creating challenges for efficient data management. Traditional data management methods often struggle with scalability and accessibility, limiting their…
The set similarity join problem is a fundamental problem in data processing and discovery, relying on exact similarity measures between sets. In the presence of alterations, such as misspellings on string data, the fuzzy set similarity join…
Contemporary database systems, while effective, suffer severe issues related to complexity and usability, especially among individuals who lack technical expertise but are unfamiliar with query languages like Structured Query Language…
In modern data analytics, analysts frequently face the challenge of searching for desirable entities by evaluating, for each entity, a collection of its feature relations to derive key analytical properties. This search is challenging…
Approximate nearest neighbor (ANN) search is a fundamental problem in computer science for which in-memory graph-based methods, such as Hierarchical Navigable Small World (HNSW), perform exceptionally well. To scale beyond billions of…
Data quality describes the degree to which data meet specific requirements and are fit for use by humans and/or downstream tasks (e.g., artificial intelligence). Data quality can be assessed across multiple high-level concepts called…
Motif counting is a fundamental problem in network analysis, and there is a rich literature of theoretical and applied algorithms for this problem. Given a large input network $G$, a motif $H$ is a small "pattern" graph indicative of…
Maximum Inner Product Search (MIPS) is a fundamental challenge in machine learning and information retrieval, particularly in high-dimensional data applications. Existing approaches to MIPS either rely solely on Inner Product (IP)…