数据库
Geosocial reachability queries (\textsc{RangeReach}) determine whether a given vertex in a geosocial network can reach any spatial vertex within a query region. The state-of-the-art 3DReach method answers such queries by encoding graph…
Retrieval-augmented generation (RAG) enhances LLM reasoning in knowledge-intensive tasks, but existing RAG pipelines incur substantial retrieval and generation overhead when applied to large-scale entity matching. To address this…
Datalog is an increasingly popular recursive query language that is declarative by design, meaning its programs must be translated by an engine into the actual physical execution plan. When generating this plan, a central decision is how to…
Modern multi-socket architectures offer a single virtual address space, but physically divide main-memory across multiple regions, where each region is attached to a CPU and its cores. While this simplifies the usage, developers must be…
Recent standardization efforts for graph databases lead to standard query languages like GQL and SQL/PGQ, and constraint languages like Property Graph Constraints (PG-Constraints). In this paper, we embark on the study of repairing property…
Recent advances in Entity Resolution (ER) have leveraged Large Language Models (LLMs), achieving strong performance but at the cost of substantial computational resources or high financial overhead. Existing LLM-based ER approaches operate…
Retrieval-augmented generation (RAG) is now standard for knowledge-intensive LLM tasks, but most systems still treat every query as fresh, repeatedly re-retrieving long passages and re-reasoning from scratch, inflating tokens, latency, and…
Central to the efficacy of prognostics and health management methods is the acquisition and analysis of degradation data, which encapsulates the evolving health condition of engineering systems over time. Degradation data serves as a rich…
Biodiversity open-access databases are valuable resources in the structuring and accessibility of species occurrence data. By compiling different data sources, they reveal the uneven spatial distribution of knowledge, with areas or…
Data agents are an emerging paradigm that leverages large language models (LLMs) and tool-using agents to automate data management, preparation, and analysis tasks. However, the term "data agent" is currently used inconsistently, conflating…
As data volumes continue to grow, optimizing database performance has become increasingly critical, making the implementation of effective tuning methods essential. Among various approaches, database parameter tuning has proven to be a…
Low-level database operators often admit multiple physical implementations ("kernels") that are semantically equivalent but have vastly different performance characteristics depending on the input data distribution. Existing database…
Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints.…
Understanding dataset semantics is crucial for effective search, discovery, and integration pipelines. To this end, column type annotation (CTA) methods associate columns of tabular datasets with semantic types that accurately describe…
Log-Structured Merge-Trees (LSM-trees) dominate persistent key-value storage but suffer from high write amplification from 10x to 30x under random workloads due to repeated compaction. This overhead becomes prohibitive for large values with…
Approximate nearest neighbor search (ANNS) is a core problem in machine learning and information retrieval applications. GPUs offer a promising path to high-performance ANNS: they provide massive parallelism for distance computations, are…
We present an approach to computing consistent answers to queries possibly involving an aggregation operator in databases operating under a star schema and possibly containing missing values and inconsistent data. Our approach is based on…
Distributed Stream Processing Systems (DSPSs) form the backbone of real-time processing and analytics at ByteDance, where Apache Flink powers one of the largest production clusters worldwide. Ensuring resiliency, the ability to withstand…
The advancement of data-driven materials science is currently constrained by a fundamental bottleneck: the vast majority of historical experimental data remains locked within the unstructured text and rasterized figures of legacy scientific…
Database research and development rely heavily on realistic user workloads for benchmarking, instance optimization, migration testing, and database tuning. However, acquiring real-world SQL queries is notoriously challenging due to strict…