数据库
Data has become a foundational asset driving innovation across domains such as finance, healthcare, and e-commerce. In these areas, predictive modeling over relational tables is commonly employed, with increasing emphasis on reducing manual…
Large language models (LLMs) have become essential for applications such as text summarization, sentiment analysis, and automated question-answering. Recently, LLMs have also been integrated into relational database management systems to…
Having access to realistic workloads for a given database instance is extremely important to enable stress and vulnerability testing, as well as to optimize for cost and performance. Recent advances in learned cost models have shown that…
Efficient multi-core parallel processing of recursive join queries is critical for achieving good performance in graph database management systems (GDBMSs). Prior work adopts two broad approaches. First is the state of the art morsel-driven…
Object-Centric Process Mining enables the analysis of complex operational behavior by capturing interactions among multiple business objects (e.g., orders, items, deliveries). These interactions are recorded using Object-Centric Event Data…
Efficient querying and analysis of large tabular datasets remain significant challenges, especially for users without expertise in programming languages like SQL. Text-to-SQL approaches have shown promising performance on benchmark data;…
Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix…
Dense subgraph search in bipartite graphs is a fundamental problem in graph analysis, with wide-ranging applications in fraud detection, recommendation systems, and social network analysis. The recently proposed $(\alpha, \beta)$-dense…
The problem of hotspots remains a critical challenge in high-contention workloads for concurrency control (CC) protocols. Traditional concurrency control approaches encounter significant difficulties under high contention, resulting in…
Data valuation and data monetisation are complex subjects but essential to most organisations today. Unfortunately, they still lack standard procedures and frameworks for organisations to follow. In this survey, we introduce the reader to…
We study the temporal k-core component search (TCCS), which outputs the k-core containing the query vertex in the snapshot over an arbitrary query time window in a temporal graph. The problem has been shown to be critical for tasks such as…
Join ordering is a key factor in query performance, yet traditional cost-based optimizers often produce sub-optimal plans due to inaccurate cardinality estimates in multi-predicate, multi-join queries. Existing alternatives such as…
Approximate Nearest Neighbor Search (ANNS) plays a crucial role in many key areas. Proximity graphs (PGs) are the leading method for ANNS, offering the best balance between query efficiency and accuracy. However, their performance heavily…
High-dimensional vector similarity search (HVSS) is critical for many data processing and AI applications. However, traditional HVSS methods often require extensive data access for distance calculations, leading to inefficiencies.…
We present RubikSQL, a novel NL2SQL system designed to address key challenges in real-world enterprise-level NL2SQL, such as implicit intents and domain-specific terminology. RubikSQL frames NL2SQL as a lifelong learning task, demanding…
Query optimization is a crucial problem in database systems that has been studied for decades. Learned query optimizers (LQOs) can improve performance over time by incorporating feedback; however, they suffer from cold-start issues and…
Tables are a prevalent format for structured data, yet their metadata, such as semantic types and column relationships, is often incomplete or ambiguous. Column annotation tasks, including Column Type Annotation (CTA) and Column Property…
Data cleaning is a time-consuming and error-prone manual process, even with modern workflow tools such as OpenRefine. We present AutoDCWorkflow, an LLM-based pipeline for automatically generating data-cleaning workflows. The pipeline takes…
Background. In the last decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased…
Operational consistent query answering (CQA) is a recent framework for CQA based on revised definitions of repairs, which are built by applying a sequence of operations (e.g., fact deletions) starting from an inconsistent database until we…