数据库
The fitting problem for conjunctive queries (CQs) is the problem to construct a CQ that fits a given set of labeled data examples. When a fitting CQ exists, it is in general not unique. This leads us to proposing natural refinements of the…
The $ k $-plex model, which allows each vertex to miss connections with up to $ k $ neighbors, serves as a relaxation of the clique. Its adaptability makes it more suitable for analyzing real-world graphs where noise and imperfect data are…
Continuous and reliable access to curated biological data repositories is indispensable for accelerating rigorous scientific inquiry and fostering reproducible research. Centralized repositories, though widely used, are vulnerable to single…
Research data management (RDM) is a key data literacy skill that chemistry students must acquire. Concepts such as the FAIR data principles (Findable, Accessible, Interoperable, Reusable) should be taught and applied in undergraduate…
Embedding models capture both semantic and syntactic structures of queries, often mapping different queries to similar regions in vector space. This results in non-uniform cluster access patterns in modern disk-based vector databases. While…
Graph analytics is widely used in many fields to analyze various complex patterns. However, in most cases, important data in companies is stored in RDBMS's, and so, it is necessary to extract graphs from relational databases to perform…
We propose an interactive system using foundation models and user-provided technical documents to generate Failure Mode and Effects Analyses (FMEA) for industrial equipment. Our system aggregates unstructured content across documents to…
In the digital era, data spaces are emerging as key ecosystems for the secure and controlled exchange of information among participants. To achieve this, components such as metadata catalogs and data space connectors are essential. This…
We present EPIC, an AI-driven platform designed to augment operational data analytics. EPIC employs a hierarchical multi-agent architecture where a top-level large language model provides query processing, reasoning and synthesis…
When query evaluation produces too many tuples, a new approach in query answering is to retrieve a diverse subset of them. The standard approach for measuring the diversity of a set of tuples is to use a distance function between tuples,…
With the growing integration of structured and unstructured data, new methods have emerged for performing similarity searches on vectors while honoring structured attribute constraints, i.e., a process known as Filtering Approximate Nearest…
Electronic health records (EHRs) contain richly structured, longitudinal data essential for predictive modeling, yet stringent privacy regulations (e.g., HIPAA, GDPR) often restrict access to individual-level records. We introduce…
Deploying Large Language Models (LLMs) on resource-constrained devices remains challenging due to limited memory, lack of GPUs, and the complexity of existing runtimes. In this paper, we introduce TranSQL+, a template-based code generator…
The Causal Effect (CE) is a numerical measure of causal influence of variables on observed results. Despite being widely used in many areas, only preliminary attempts have been made to use CE as an attribution score in data management, to…
The transport sector is a major contributor to greenhouse gas emissions in Europe. Shifting to electric vehicles (EVs) powered by a low-carbon energy mix would reduce carbon emissions. However, to support the development of electric…
With the advent of big data, periodic pattern mining has demonstrated significant value in real-world applications, including smart home systems, healthcare systems, and the medical field. However, advances in network technology have…
With a user-specified minimum utility threshold (minutil), periodic high-utility pattern mining (PHUPM) aims to identify high-utility patterns that occur periodically in a transaction database. A pattern is deemed periodic if its period…
In this study, we optimize SQL+ML queries on top of OpenMLDB, an open-source database that seamlessly integrates offline and online feature computations. The work used feature-rich synthetic dataset experiments in Docker, which acted like…
In the past few decades, the life sciences have experienced an unprecedented accumulation of data, ranging from genomic sequences and proteomic profiles to heavy-content imaging, clinical assays, and commercial biological products for…
Unstructured data, such as text, images, audio, and video, comprises the vast majority of the world's information, yet it remains poorly supported by traditional data systems that rely on structured formats for computation. We argue for a…