数据库
The integration of large-scale chemical databases represents a critical bottleneck in modern cheminformatics research, particularly for machine learning applications requiring high-quality, multi-source validated datasets. This paper…
Direct access asks for the retrieval of query answers by their ranked position, given a query and a desired order. While the time complexity of data structures supporting such accesses has been studied in depth, and efficient algorithms for…
Evolving phenomena, often complex, can be represented using knowledge graphs, which have the capability to model heterogeneous data from multiple sources. Nowadays, a considerable amount of sources delivering periodic updates to knowledge…
We study the limits of linear time evaluation of conjunctive queries under constraints expressed as tuple-generating dependencies (TGDs), across several modes of query evaluation: single-testing, all-testing, counting, lexicographic direct…
The management of versioned knowledge graphs presents significant challenges, particularly in querying data across multiple versions efficiently. This paper introduces QuaQue, a key component of the ConVer-G system, which addresses this…
During research, domain experts often ask analytical questions whose answers require integrating data from a wide range of web sources. Thus, they must spend substantial effort searching, extracting, and organizing raw data before analysis…
This work presents a highly optimized implementation of PAC-DB, a recent and promising database privacy model. We prove that our SIMD-PAC-DB can compute the same privatized answer with just a single query, instead of the 128 stochastic…
Answering Select-Join-Aggregate queries with DP is a fundamental problem with important applications in various domains. The current SOTA methods ensure user-level DP (i.e., the adversary cannot infer the presence or absence of any given…
Elastic similarity measures are fundamental to time series similarity search because of their ability to handle temporal misalignments. These measures are inherently computationally expensive, therefore necessitating the use of lower bounds…
Software transactional memory (STM) allows programmers to easily implement concurrent data structures. STMs simplify atomicity. Recent STMs can achieve good performance for some workloads but they have some limitations. In particular, STMs…
Index recommendation is crucial for optimizing database performance. However, existing heuristic- and learning-based methods often rely on inefficient exhaustive search and estimated costs, leading to low efficiency (due to the vast search…
Long-context question answering (QA) over lengthy documents is critical for applications such as financial analysis, legal review, and scientific research. Current approaches, such as processing entire documents via a single LLM call or…
We consider database schemas consisting of a single binary relation, with key constraints and inclusion dependencies. Over this space of 20 schemas, we completely characterize when one schema is generically dominated by another schema.…
In data lakes, information on the same subject is often fragmented across multiple tables. Table union search aims to find the top-k tables that can be unioned with a query table to extend it with more rows, without relying on metadata or…
Semantic operators abstract large language model (LLM) calls in SQL clauses. It is gaining traction as an easy method to analyze semi-structured, unstructured, and multimodal datasets. While a plethora of recent works optimize various…
Traditional GPU hash tables preserve every inserted key -- a dictionary assumption that wastes scarce High Bandwidth Memory (HBM) when embedding tables routinely exceed single-GPU capacity. We challenge this assumption with cache semantics,…
Biomedical knowledge is fragmented across siloed databases -- Reactome for pathways, STRING for protein interactions, ClinicalTrials.gov for study registries, DrugBank for drug vocabularies, DGIdb for drug-gene interactions, SIDER for side…
This paper demonstrates that adopting out-of-place writes is essential for database systems to fully leverage SSD performance and extend SSD lifespan. We propose a set of out-of-place optimizations that collectively reduce write…
The rapid advancement of AI is transforming human-centered systems, with profound implications for human-AI interaction, human-data interaction, and visual analytics. In the AI era, data analysis increasingly involves large-scale,…
Causal analysis on relational databases is challenging, as analysis datasets must be repeatedly queried from complex schemas. Recent LLM systems can automate individual steps, but they hardly manage dependencies across analysis stages,…