数据库
To meet the standards of the Open Science movement, the FAIR Principles emphasize the importance of making scientific data Findable, Accessible, Interoperable, and Reusable. Yet, creating a repository that adheres to these principles…
The increasing adoption of digital health technologies has amplified the need for robust, interoperable solutions to manage complex healthcare data. We present the Spezi Data Pipeline, an open-source Python toolkit designed to streamline…
JSON Lines (JSONL) is widely used for managing large collections of semi-structured data, ranging from large language model (LLM) prompts to chemical compound records and geospatial datasets. A key operation is substructure search, which…
As the demand for querying databases in all areas of life continues to grow, researchers have devoted significant attention to the natural language interface for databases (NLIDB). This paper presents a comprehensive survey of recently…
The increasing volume and complexity of X-ray absorption spectroscopy (XAS) data generated at synchrotron facilities worldwide require robust infrastructure for data management, sharing, and analysis. This paper introduces the XAS Database…
In recent years, the Shapley value has emerged as a general game-theoretic measure for assessing the contribution of a tuple to the result of a database query. We study the complexity of calculating the Shapley value of a tuple for an…
The NIAID Data Ecosystem Discovery Portal (https://data.niaid.nih.gov) provides a unified search interface for over 4 million datasets relevant to infectious and immune-mediated disease (IID) research. Integrating metadata from…
The Key-Value (KV) cache reading latency increases significantly with context lengths, hindering the efficiency of long-context LLM inference. To address this, previous works propose retaining a small fraction of KV cache based on token…
Index recommendation is one of the most important problems in database management system (DBMS) optimization. Given queries and certain index-related constraints, traditional methods rely on heuristic optimization or learning-based models…
We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees…
SQL queries in real world analytical environments, whether written by humans or generated automatically often suffer from syntax errors, inefficiency, or semantic misalignment, especially in complex OLAP scenarios. To address these…
With the prevalence of in-database AI-powered analytics, there is an increasing demand for database systems to efficiently manage the ever-expanding number and size of deep learning models. However, existing database systems typically store…
We describe a method for merging multiple spreadsheets into one sheet, and/or exchanging data among the sheets, by expressing each sheet's formulae as an algebraic (equational) theory and each sheet's values as a model of its theory,…
Given a conjunctive query and a database instance, we aim to develop an index that can efficiently answer spatial queries on the results of a conjunctive query. We are interested in some commonly used spatial queries, such as range…
Large Language Models (LLMs) are being increasingly used as a building block in data systems to process large text datasets. To do so, LLM model providers offer multiple LLMs with different sizes, spanning various cost-quality trade-offs…
To obtain insights from event data, advanced process mining methods assess the similarity of activities to incorporate their semantic relations into the analysis. Here, distributional similarity that captures similarity from activity…
Knowledge graph construction has become an essential domain for the future of biomedical research. But current approaches demand a high amount of redundant labor. These redundancies are the result of the lack of data standards and…
Recording data changes in RDF systems is a crucial capability, needed to support auditing, incremental backups, database replication, and event-driven workflows. In large-scale and low-latency RDF applications, the high volume and frequency…
This paper revisits the problem of repairing and querying inconsistent databases equipped with universal constraints. We adopt symmetric difference repairs, in which both deletions and additions of facts can be used to restore consistency,…
This article proposes a paraconsistent framework for evaluating similarity in knowledge bases. Unlike classical approaches, this framework explicitly integrates contradictions, enabling a more robust and interpretable similarity measure. A…