数据库
Data lakes spend a significant fraction of query execution time on scanning data from remote, disaggregated storage. Decoding alone accounts for 46% of runtime when running TPC-H directly on Parquet files. To address this bottleneck, we…
Modern database management systems (DBMS) face significant challenges in maintaining performance and availability under dynamic workloads. This paper proposes a novel self-healing framework that integrates Model-Agnostic Meta-Learning…
Whales are an important part of the oceanic ecosystem. Although historic commercial whale hunting a.k.a. whaling has severely threatened whale populations, whale researchers are looking at historical whaling data to inform current whale…
JDBC remains a key technology for database access in Java applications. Since the database dictionary and the Java type system have distinct scopes, developers inevitably need to deal with bugs in SQL-to-Java type mappings. We propose an…
Modern database systems increasingly co-schedule time-sensitive and background tasks. In such mixed workloads, background tasks should ideally utilize only spare CPU capacity without interfering with latency-critical requests. While some…
Despite unprecedented growth in biodiversity data, a persistent gap remains between what is known and what is acted upon. Existing frameworks such as the FAIR and CLEAR Principles have improved data accessibility and interpretability but do…
Large-scale search engines face a fundamental tension: the index must be updated frequently to maintain freshness, yet updates create resource contention that inflates query latency. In the dominant Lucene-based architecture, segment merges…
Knowledge Graph Question Answering (KGQA) has advanced through structured query generation, yet most efforts target RDF/SPARQL, leaving Cypher and property graphs underexplored, despite increasing demand for unified KGQA in industry…
Although Approximate Nearest Neighbor (ANN) search has been extensively studied, large-k ANN queries that aim to retrieve a large number of nearest neighbors remain underexplored, despite their numerous real-world applications. Existing ANN…
Materialized views are a core construct in database systems, used to accelerate analytical queries and optimize batch pipelines for extract-transform-load (ETL) workflows. Maintaining view consistency as underlying data evolves is a…
Range-filtered approximate nearest neighbor (RFANN) search is a fundamental operation in modern data systems. Given a set of objects, each with a vector and a numerical attribute, an RFANN query retrieves the nearest neighbors to a query…
Column matching is a central task in reconciling schemas for data integration. Column names and descriptions are valuable for this task. LLMs can leverage such natural-language schema metadata. However, in many datasets, correct matching…
Businesses increasingly rely on large language models (LLMs) to automate simple repetitive tasks instead of developing custom machine learning models. LLMs require few, if any, training examples and can be utilized by users without…
Vector search plays a crucial role in many real-world applications. In addition to single-vector search, multi-vector search becomes important for multi-modal and multi-feature scenarios today. In a multi-vector database, each row is an…
Differential privacy (DP) enables private data analysis. In a typical DP deployment, controllers manage individuals' sensitive data and are responsible for answering analysts' queries while protecting individuals' privacy. They do so by…
Team project-based learning (TPBL) combines two learning techniques: project-based learning (PBL) and teamwork. This combination leverages the learning outcomes of both methods and places students in a real work situation where they must…
Databases, and datasets more generally, evolve continuously through updates, transformations, versioning, schema changes, streaming operations, and other mechanisms. While prior work has noted connections among some of these areas, they…
Text-to-SQL enables non-expert users to query databases in natural language, yet real-world schemas often suffer from ambiguous, abbreviated, or inconsistent naming conventions that degrade model accuracy. Existing approaches treat schemas…
The paper analyzes and characterizes the algebraic and logical structure of the multiset semantics for SPARQL patterns involving AND, UNION, FILTER, EXCEPT, and SELECT. To do this, we align SPARQL with two well-established query languages:…
Big data platforms are widely used in modern enterprises, and an in-production intelligent assistant is increasingly important to help users quickly find actionable guidance and reduce operational burden. While recent LLM+RAG assistants…