数据库
LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) establishes that GPT-4 agents achieve 65% on 139 industrial maintenance scenarios backed by CouchDB,…
Reverse k nearest neighbor (RkNN) queries are fundamental in spatial databases, location-based analytics, and recommendation systems. Existing state-of-the-art techniques rely on spatial pruning supported by R-trees and their variants.…
Approximate nearest neighbor (ANN) search with range filters has recently garnered significant attention. This paper delves into a generalized form of this problem, i.e., ANN search with exact range-range (RR) predicates on a range-valued…
Large collections of tabular data from data lakes, web tables and open data portals often originate from heterogeneous sources, leading to representational inconsistencies. Understanding and organizing such repositories therefore remains a…
Parquet is the de facto columnar file format in modern analytical systems, yet its configuration guidelines have largely been shaped by CPU-centric execution models. As GPU-accelerated data processing becomes increasingly prevalent, Parquet…
LLMs have recently shown strong potential in enhancing node-level tasks on text-attributed graphs (TAGs) by providing explanation features. However, their practical use is severely limited by the high computational and monetary cost of…
Large collections of tabular data from data lakes, web tables and open data portals often originate from heterogeneous sources, leading to representational inconsistencies. Understanding and organizing such repositories therefore remains a…
Large language models (LLMs) consistently achieve strong results on text-to-SQL benchmarks, but their robustness to schema variations remains poorly understood. Recent work suggests that the schema structure matters, but does not provide a…
Product Quantization (PQ) construction is deeply integrated into vector index construction for Approximate Nearest Neighbor Search (ANNS). The rapid growth in vector dimensionality and volume has significantly increased the computational…
Approximate functional dependencies (AFDs) relax exact functional dependencies by tolerating a bounded degree of violation, making them suited for data quality auditing. Threshold-based discovery returns all dependencies above a…
Untargeted metabolomics generates large volumes of tandem mass spectrometry (MS/MS) data and computational annotations that can reveal molecular mechanisms across organisms and environments. Public reuse has improved through harmonized…
We study the problem of cardinality estimation for LIKE queries on string data, focusing on the most common patterns in real workloads: prefix, suffix, and substring queries. We propose LEARNT, a LIKE query Estimator with Accuracy,…
Deep learning over relational databases is conventionally realized by translating data into graph representations and applying graph-based neural networks within external frameworks. This round-trip between the database and external machine…
We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than…
Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across deployments, paying a significant performance cost. We argue that LLM-based coding agents now make a…
This research paper introduces two new constraint types and four subtypes of database constraints added to our (Elementary) Mathematical Data Model, which are the duals of the existence and non-existence ones. They are formally defined,…
Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-update lifecycle. Despite substantial prior work, existing systems suffer from…
Laboratory workflows in pharmaceutical and biomedical research encode substantial tacit knowledge -- expert judgment about failure conditions, decision branching logic, and contextual dependencies -- that remains inaccessible to protocol…
Temporal range filtering is critical in large-scale search systems, particularly location-based services filtering businesses by operating hours. Traditional approaches suffer from poor query performance (scope filtering), index size…
The fastest indexes for Approximate Nearest Neighbor Search today are also the slowest to build: graph-based methods like HNSW and Vamana achieve state-of-the-art query performance but have large construction times due to relying on…