数据库
Cardinality estimation has long been crucial for cost-based database optimizers in identifying optimal query execution plans, attracting significant attention over the past decades. While recent advancements have significantly improved the…
Data lakehouses (LHs) are at the core of current cloud analytics stacks by providing elastic, relational compute on data in cloud data lakes across vendors. For relational semantics, they rely on open table formats (OTFs). Unfortunately,…
People without a database background usually rely on file systems or tools such as Excel for data management, which often lead to redundancy and data inconsistency. Relational databases possess strong data management capabilities, but…
Over the past decade, the proliferation of public and enterprise data lakes has fueled intensive research into data discovery, aiming to identify the most relevant data from vast and complex corpora to support diverse user tasks.…
Analytical SQL queries are essential for extracting insights from relational databases but concurrently introduce significant privacy risks by potentially exposing sensitive information. To mitigate these risks, numerous query sanitization…
In this technical report, we present a formalisation of the MongoDB aggregation framework. Our aim is to identify a fragment that could serve as the starting point for an industry-wide standard for querying JSON document databases. We…
A growing trend in modern data analysis is the integration of data management with learning, guided by accuracy, latency, and cost requirements. In practice, applications draw data of different formats from many sources. In the meanwhile,…
We present the Poseidon engine behind the Neptune Analytics graph database service. Customers interact with Poseidon using the declarative openCypher query language, which enables requests that seamlessly combine traditional querying…
Data prefetching--loading data into the cache before it is requested--is essential for reducing I/O overhead and improving database performance. While traditional prefetchers focus on sequential patterns, recent learning-based approaches,…
Data and workload drift are key to evaluating database components such as caching, cardinality estimation, indexing, and query optimization. Yet, existing benchmarks are static, offering little to no support for modeling drift. This…
In this paper, we present the design and architecture of REI, a novel system for indexing log data for regular expression queries. Our main contribution is an $n$-gram-based indexing strategy and an efficient storage mechanism that results…
The proliferation of complex, multimodal datasets has exposed a critical gap between the capabilities of specialized vector databases and traditional graph databases. While vector databases excel at semantic similarity search, they lack the…
Timely detection of critical health conditions remains a major challenge in public health analytics, especially in Big Data environments characterized by high volume, rapid velocity, and diverse variety of clinical data. This study presents…
We present HES-SQL, a novel hybrid training framework that advances Text-to-SQL generation through the integration of thinking-mode-fused supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Our approach introduces…
The rise of distributed applications and cloud computing has created a demand for scalable, high-performance key-value storage systems. This paper presents a performance evaluation of three prominent NoSQL key-value stores: Redis,…
Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these…
Purpose: Cyber-Physical Systems (CPSs) integrate computation and physical processes, producing time series data from thousands of sensors. Knowledge graphs can contextualize these data, yet current approaches that are applicably to…
Semantic query processing engines often support semantic joins, enabling users to match rows that satisfy conditions specified in natural language. Such join conditions can be evaluated using large language models (LLMs) that solve novel…
Cardinality estimation is a fundamental task in database systems and plays a critical role in query optimization. Despite significant advances in learning-based cardinality estimation methods, most existing approaches remain difficult to…
The analytics of spatiotemporal data is increasingly important for mobility analytics. Despite extensive research on moving object databases (MODs), few systems are ready on production or lightweight enough for analytics. MobilityDB is a…