数据库
Human-data interaction (HDI) presents fundamentally different challenges from traditional data management. HDI systems must meet latency, correctness, and consistency needs that stem from usability rather than query semantics; failing to…
Disaggregating memory from compute offers the opportunity to better utilize stranded memory in cloud data centers. It is important to cache data in the compute nodes and maintain cache coherence across multiple compute nodes. However, the…
IVF is one of the most widely used ANNS (Approximate Nearest Neighbors Search) methods in vector databases. The idea of redundant assignment is to assign a data vector to more than one IVF lists for reducing the chance of missing true…
We investigate the problem of identifying database repairs for missing tuples in query answers. We show that when the query is part of the input - the combined complexity setting - determining whether or not a repair exists is…
The rapid integration of vector search into AI applications, particularly for Retrieval Augmented Generation (RAG), has catalyzed the emergence of a diverse ecosystem of specialized vector databases. While this innovation offers a rich…
Graph database query languages cannot express algorithms like PageRank, forcing costly data wrangling, while existing solutions such as algorithm libraries, vertex-centric APIs, and recursive CTEs lack the necessary combination of…
Robust text-to-SQL over complex, real-world databases remains brittle even with modern LLMs: iterative refinement often introduces syntactic and semantic drift, corrections tend to be non-transferable across queries, and naive use of large…
Embedding-based dense retrieval has become the cornerstone of many critical applications, where approximate nearest neighbor search (ANNS) queries are often combined with filters on labels such as dates and price ranges. Graph-based indexes…
Yannakakis' seminal algorithm is optimal for acyclic joins, yet it has not been widely adopted due to its poor performance in practice. This paper briefly surveys recent advancements in making Yannakakis' algorithm more practical, in terms…
In recent years, there has been significant progress in the development of deep learning models over relational databases, including architectures based on heterogeneous graph neural networks (hetero-GNNs) and heterogeneous graph…
Approximate Nearest Neighbor Search (ANNS) is a crucial operation in databases and artificial intelligence. While graph-based ANNS methods like HNSW and NSG excel in performance, they assume uniform query distribution. However, in…
Although machine learning (ML) shows potential in improving query optimization by generating and selecting more efficient plans, ensuring the robustness of learning-based cost models (LCMs) remains challenging. These LCMs currently lack…
The digital economy is powered by a continuous and massive exchange of personal data. Individuals provide data to platforms in return for services, from social networking and search to health monitoring, entertainment, and access to LLMs.…
How important is the weight of a given column in determining the ranking of tuples in a table? To address such an explanation question about a ranking function, we investigate the computation of SHAP scores for column weights, adopting a…
Translating SQL dialects across different relational database management systems (RDBMSs) is crucial for migrating RDBMS-based applications to the cloud. Traditional SQL dialect translation tools rely on manually-crafted rules,…
Modern database systems allow users to query or process unstructured text or document columns using LLM-powered functions. Users can express an operation in natural language (e.g., "identify if this review mentions billing issues"), with…
Maintaining spatial data (points in two or three dimensions) is crucial and has a wide range of applications, such as graphics, GIS, and robotics. To handle spatial data, many data structures, called spatial indexes, have been proposed,…
Many real-world matrix datasets arrive as high-throughput vector streams, making it impractical to store or process them in their entirety. To enable real-time analytics under limited computational, memory, and communication resources,…
We introduce QueryGym, an interactive environment for building, testing, and evaluating LLM-based query planning agents. Existing frameworks often tie agents to specific query language dialects or obscure their reasoning; QueryGym instead…
Time series decomposition into trend, seasonal structure, and residual components is a core primitive for downstream analytics such as anomaly detection, change-point detection, and forecasting. However, most existing seasonal-trend…