数据库
AI-driven analytics are increasingly crucial to data-centric decision-making. The practice of exporting data to machine learning runtimes incurs high overhead, limits robustness to data drift, and expands the attack surface, especially in…
In this paper, we introduce GraphLake, a purpose-built graph compute engine for Lakehouse. GraphLake is built on top of the commercial graph database TigerGraph. It maps Lakehouse tables to vertex and edge types in a labeled property graph…
Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation…
The query optimizer is a fundamental component of database management systems. Recent studies have shown that learned query optimizers outperform traditional cost-based query optimizers. However, they fail to exploit valuable runtime…
In this paper, we introduce TigerVector, a system that integrates vector search and graph query within TigerGraph, a Massively Parallel Processing (MPP) native graph database. We extend the vertex attribute type with the embedding type. To…
Tiered memory architectures have gained significant traction in the database community in recent years. In these architectures, the on-chip DRAM of the host processor is typically referred to as local memory, and forms the primary tier.…
This paper analyzes execution instability in traditional cost-based database management systems (DBMS) and identifies a structural timing misalignment between optimization and execution stages that contributes to tail-latency amplification.…
Table-text retrieval aims to retrieve relevant tables and text to support open-domain question answering. Existing studies use either early or late fusion, but face limitations. Early fusion pre-aligns a table row with its associated…
Tabular reasoning benchmarks mix semantic inference, numerical computation, and brittle table formatting, yet evaluations for small models remain vulnerable to contamination, dataset artifacts, and retrieval failures. We propose GLEAN, a…
Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges…
Graph-based indexing is the dominant approach for approximate nearest neighbor search in vector databases, offering high recall with low latency across billions of vectors. However, in such indices, the edge set of the proximity graph is…
With millisecond-level latency and support for mutable objects, recent low-latency object storage services as represented by Amazon S3 Express One Zone have become an attractive option for OLTP engines to directly commit transactions and…
Traditional query processing relies on engines that are carefully optimized and engineered by many experts. However, new techniques and user requirements evolve rapidly, and existing systems often cannot keep pace. At the same time, these…
Modern OLAP engines are designed to support arbitrary analytical workloads, but this generality incurs structural overhead, including runtime schema interpretation, indirection layers, and abstraction boundaries, even in highly optimized…
As data volumes grow while memory capacity remains limited, disk-resident graph-based approximate nearest neighbor (ANN) methods have become a practical alternative to memory-resident designs, shifting the bottleneck from computation to…
Benchmark workloads are extremely important to the database management research community, especially as more machine learning components are integrated into database systems. Here, we propose a Bayesian optimization technique to…
Approximate nearest neighbor search (ANNS) has become a cornerstone in modern vector database systems. Given a query vector, ANNS retrieves the closest vectors from a set of base vectors. In real-world applications, vectors are often…
Data quality (DQ) and transparency of secondary data are critical factors that delay the adoption of clinical AI models and affect clinician trust in them. Many DQ studies fail to clarify where, along the lifecycle, quality checks occur,…
Modern distributed databases face challenges in achieving transactional consistency across distributed partitions. Traditional two-phase commit (2PC) protocols incur high coordination overhead and latency, and require complex recovery for…
Blockchain provides a decentralized and tamper-resistant ledger for securely recording transactions across a network of untrusted nodes. While its transparency and integrity are beneficial, the substantial storage requirements for…