数据库
Large Language Models (LLMs) have recently become sophisticated enough to automate many tasks ranging from pattern finding to writing assistance to code generation. In this paper, we examine text-to-SQL generation. We have observed from…
The increasing complexity and scale of scientific datasets demand advanced tools for efficient discovery and exploration. Traditional search systems often fall short in addressing the multidimensional nature of data and their intricate…
The growing demand for data-driven decision-making has created an urgent need for data agents that can integrate structured and unstructured data for analysis. While data agents show promise for enabling users to perform complex analytics…
Computation-Enabled Object Storage (COS) systems, such as MinIO and Ceph, have recently emerged as promising storage solutions for post hoc, SQL-based analysis on large-scale datasets in High-Performance Computing (HPC) environments. By…
If one looks at contemporary mainstream development practices for conceptual modelling in computer science, these so clearly focus on a conceptual schema completely separated from its information base that the conceptual schema is often…
Unionable table search techniques input a query table from a user and search for data lake tables that can contribute additional rows to the query table. The definition of unionability is generally based on similarity measures which may…
Near-duplicate text alignment is the task of identifying, among the texts in a corpus, all the subsequences (substrings) that are similar to a given query. Traditional approaches rely on seeding-extension-filtering heuristics, which lack…
Hybrid storage solutions have emerged as potent strategies to alleviate the data storage bottlenecks prevalent in blockchain systems. These solutions harness off-chain Storage Services Providers (SPs) in conjunction with Authenticated Data…
Approximate nearest neighbor search (ANNS) is a crucial problem in information retrieval and AI applications. Recently, there has been a surge of interest in graph-based ANNS algorithms due to their superior efficiency and accuracy.…
Data engineering workflows require reliable differencing across files, databases, and query outputs, yet existing tools falter under schema drift, heterogeneous types, and limited explainability. SmartDiff is a unified system that combines…
The emergence of large-language models (LLMs) has enabled a new class of semantic data processing systems (SDPSs) to support declarative queries against unstructured documents. Existing SDPSs are, however, lacking a unified algebraic…
In recent years, organizing group meetups for entertainment or other necessities has gained significant importance, especially given the busy nature of daily schedules. People often combine multiple activities, such as dropping kids off at…
Deterministic databases enable scalable replicated systems by executing transactions in a predetermined order. However, existing designs fail to capture transaction dependencies, leading to insufficient scheduling, high abort rates, and…
High quality SQL corpus is essential for intelligent database. For example, Text-to-SQL requires SQL queries and correspond natural language questions as training samples. However, collecting such query corpus remains challenging in…
AI-augmented data processing systems (DPSs) integrate large language models (LLMs) into query pipelines, allowing powerful semantic operations on structured and unstructured data. However, the reliability (a.k.a. trust) of these systems is…
When complex SQL queries suffer slow executions despite query optimization, DBAs typically invoke automated query rewriting tools to recommend ``lean'' equivalents that are conducive to faster execution. The rewritings are usually achieved…
In this paper, we introduce DobLIX, a dual-objective learned index specifically designed for Log-Structured Merge(LSM) tree-based key-value stores. Although traditional learned indexes focus exclusively on optimizing index lookups, they…
Index tuning is crucial for optimizing database performance by selecting optimal indexes based on workload. The key to this process lies in an accurate and efficient benefit estimator. Traditional methods relying on what-if tools often…
We report our participation in the SISAP 2025 Indexing Challenge using a novel indexing technique called the Hilbert forest. The method is based on the fast Hilbert sort algorithm, which efficiently orders high-dimensional points along a…
Database normalization is crucial to preserving data integrity. However, it is time-consuming and error-prone, as it is typically performed manually by data engineers. To this end, we present Miffie, a database normalization framework that…