数据库
Ever since the Dennard scaling broke down in the early 2000s and the frequency of the CPUs stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA…
AIS data from ships is excellent for analyzing single-ship movements and monitoring all ships within a specific area. However, the AIS data needs to be cleaned, processed, and stored before being usable. This paper presents a system…
Database systems encompass several performance-critical optimization tasks, such as join ordering and index tuning. As data volumes grow and workloads become more complex, these problems have become exponentially harder to solve…
Modern semantic search and retrieval-augmented generation (RAG) systems rely predominantly on in-memory approximate nearest neighbor (ANN) indexes over high-precision floating-point vectors, resulting in escalating operational cost and…
Identity leakage can emerge when independent databases are joined, even when each dataset is anonymized individually. While previous work focuses on post-join detection or complex privacy models, little attention has been given to simple,…
The use of Large Language Models (LLMs) for querying relational data has given rise to relQuery, a workload pattern that applies templated LLM calls to structured tables. As relQuery services become more widely adopted in applications such…
The stock market is inherently complex, with interdependent relationships among companies, sectors, and financial indicators. Traditional research has largely focused on time-series forecasting and single-company analysis, relying on…
Relational databases (RDBs) underpin the majority of global data management systems, where information is structured into multiple interdependent tables. To effectively use the knowledge within RDBs for predictive tasks, recent advances…
Large Language Models (LLMs) in agentic workflows combine multi-step reasoning, heterogeneous tool use, and collaboration across multiple specialized agents. Existing LLM serving engines optimize individual calls in isolation, while…
In this paper, we present a static code analysis strategy to extract logical schemas from NoSQL applications. Our solution is based on a model-driven reverse engineering process composed of a chain of platform-independent model…
The Shapley value, originating from cooperative game theory, has been employed to define responsibility measures that quantify the contributions of database facts to obtaining a given query answer. For non-numeric queries, this is done by…
We present an index structure, called the color-index, to boost the evaluation of acyclic conjunctive queries (ACQs) over binary schemas. The color-index is based on the color refinement algorithm, a widely used subroutine for graph…
This paper explores a new opportunity to improve the performance of transaction processing at the application side by merging structurely similar statements or transactions. Concretely, we re-write transactions to 1) merge similar…
Functional dependencies (FDs) are basic constraints in relational databases and are used for many data management tasks. Most FD discovery algorithms find all valid dependencies, but this causes two problems. First, the computational cost…
The FAIR (Findable, Accessible, Interoperable, and Reusable) data principles [1] promote the interoperability of scientific data by encouraging the use of persistent identifiers, standardized vocabularies, and formal metadata structures.…
Data scarcity and unreliable self-reporting -- such as concealment or exaggeration -- pose fundamental challenges to psychiatric intake and assessment. We propose a multi-agent synthesis framework that explicitly models patient deception to…
This paper presents Global Benchmark Database (GBD), a comprehensive suite of tools for provisioning and sustainably maintaining benchmark instances and their metadata. The availability of benchmark metadata is essential for many tasks in…
Approximate Nearest Neighbor Search (ANNS) underpins modern applications such as information retrieval and recommendation. With the rapid growth of vector data, efficient indexing for real-time vector search has become rudimentary. Existing…
We describe a novel system, CSQL, which automatically converts a collection of unstructured text documents into an SQL-queryable causal database (CDB). A CDB differs from a traditional DB: it is designed to answer "why'' questions via…
Static filtering is a data-independent optimisation method for Datalog, which generalises algebraic query rewriting techniques from relational databases. In spite of its early discovery by Kifer and Lozinskii in 1986, the method has been…