数据库
Finding the densest subgraph (DS) from a graph is a fundamental problem in graph databases. The DS obtained, which reveals closely related entities, has been found to be useful in various application domains such as e-commerce, social…
Trajectory similarity is a cornerstone of trajectory data management and analysis. Traditional similarity functions often suffer from high computational complexity and a reliance on specific distance metrics, prompting a shift towards deep…
Query reverse engineering (QRE) aims to synthesize a SQL query to connect a given database and result instance. A recent variation of QRE is where an additional input, an opaque executable containing a ground-truth query, is provided, and…
Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first…
Practitioners are increasingly turning to Extract-Load-Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent…
Streaming data pipelines remain challenging and expensive to build and maintain, despite significant advancements in stronger consistency, event time semantics, and SQL support over the last decade. Persistent obstacles continue to hinder…
We propose novel techniques that exploit data and computation sharing to improve the performance of complex stateful parallel computations, like agent-based simulations. Parallel computations are translated into behavioral equations, a…
Multi-model databases are designed to store, manage, and query data in various models, such as relational, hierarchical, and graph data, simultaneously. In this paper, we provide a theoretical basis for querying categorical databases. We…
Detecting fraudulent activities in financial and e-commerce transaction networks is crucial. One effective method for this is Densest Subgraph Discovery (DSD). However, deploying DSD methods in production systems faces substantial…
This paper presents a novel AI-powered framework designed to streamline database management and query optimization for PostgreSQL systems. Structured in three phases: Natural Language to SQL Translation, Query Execution and Analysis, and…
Finding relevant tables among databases, lakes, and repositories is the first step in extracting value from data. Such a task remains difficult because assessing whether a table is relevant to a problem does not always depend only on its…
In the era o fdat commodification,the pricing o fgraph data presents unique challenges that differ significantly from traditional data markets. This paper addresses the critical issue of node pricing within graph structures, an area that…
Existing data visualization formalisms are restricted to single-table inputs, which makes existing visualization grammars like Vega-lite or ggplot2 tedious to use, have overly complex APIs, and unsound when visualization multi-table data.…
In this paper, we study circuits and formulas for provenance polynomials of Datalog programs. We ask the following question: given an absorptive semiring and a fact of a Datalog program, what is the optimal depth and size of a…
Table integration aims to create a comprehensive table by consolidating tuples containing relevant information. In this work, we investigate the challenge of integrating multiple tables from a data lake, focusing on three core tasks: 1)…
The study of graph queries in database theory has spanned more than three decades, resulting in a multitude of proposals for graph query languages. These languages differ in the mechanisms. We can identify three main families of languages,…
Process discovery generates process models from event logs. Traditionally, an event log is defined as a multiset of traces, where each trace is a sequence of events. The total order of the events in a sequential trace is typically based on…
Generative AI (GenAI) is transforming industries by enabling intelligent content generation, automation, and decision-making. However, the effectiveness of GenAI applications depends significantly on efficient data storage, retrieval, and…
We present LearnedKV, a novel tiered key-value store that seamlessly integrates a Log-Structured Merge (LSM) tree with a Learned Index to achieve superior read and write performance on storage systems. While existing approaches use learned…
MatBase is a prototype intelligent data and knowledge base management system based on the Relational, Entity-Relationship, and (Elementary) Mathematical Data Models. The latter distinguishes itself especially by its rich panoply of…