数据库
The Graph Edit Distance (GED) is an important metric for measuring the similarity between two (labeled) graphs. It is defined as the minimum cost required to convert one graph into another through a series of (elementary) edit operations.…
In this paper we propose an approach to implement specific relation-ship set between two entities called combinatorial relationship set. For the combinatorial relationship set B between entity sets G and I the mapping cardinality is…
There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications…
In recent years, querying semantic web data using SPARQL has remained challenging, especially for non-expert users, due to the language's complex syntax and the prerequisite of understanding intricate data structures. To address these…
Process mining analyzes and improves processes by examining transactional data stored in event logs, which record sequences of events with timestamps. However, the effectiveness of process mining, especially when combined with machine or…
Large Language Models (LLMs) have demonstrated remarkable progress in translating natural language to SQL, but a significant semantic gap persists between their general knowledge and domain-specific semantics of databases. Historical…
In the real business world, data is stored in a variety of sources, including structured relational databases, unstructured databases (e.g., NoSQL databases), or even CSV/excel files. The ability to extract reasonable insights across these…
Reachability queries ask whether there exists a path from the source vertex to the target vertex on a graph. Recently, several powerful reachability queries, such as Label-Constrained Reachability (LCR) queries and Regular Path Queries…
Filtered approximate nearest neighbor search (ANNS) restricts the search to data objects whose attributes satisfy a given filter and retrieves the top-$K$ objects that are most semantically similar to the query object. Many graph-based ANNS…
In many real-world scenarios, query results must satisfy domain-specific constraints. For instance, a minimum percentage of interview candidates selected based on their qualifications should be female. These requirements can be expressed as…
Simpson's paradox, a long-standing statistical phenomenon, describes the reversal of an observed association when data are disaggregated into sub-populations. It has critical implications across statistics, epidemiology, economics, and…
Object Centric Event Data (OCED) has gained attention in recent years within the field of process mining. However, there are still many challenges, such as connecting the XES format to object-centric approaches to enable more insightful…
To discover new insights from data, there is a growing need to share information that is often held by different organisations. One key task in data integration is the calculation of similarities between records in different databases to…
Hybrid search, the integration of lexical and semantic retrieval, has become a cornerstone of modern information retrieval systems, driven by demanding applications like Retrieval-Augmented Generation (RAG). The architectural design space…
Publicly available vessel trajectory data is emitted continuously from the global AIS system. Continuous trajectory similarity search on this data has applications in, e.g., maritime navigation and safety. Existing proposals typically…
We study the problem of statically optimizing select-project-join-union (SPJU) plans where unary key constraints are allowed. A natural measure of a plan, which we call the output degree and which has been studied previously, is the minimum…
Approximate $k$-nearest neighbor search (A$k$-NNS) is a core operation in vector databases, underpinning applications such as retrieval-augmented generation (RAG) and image retrieval. In these scenarios, users often prefer diverse result…
Manually conducting real-world data analyses is labor-intensive and inefficient. Despite numerous attempts to automate data science workflows, none of the existing paradigms or systems fully demonstrate all three key capabilities required…
Automated data preparation pipeline construction is critical for machine learning success, yet existing methods suffer from two fundamental limitations: they treat pipeline construction as black-box optimization without quantifying…
Nowadays, the explosion of unstructured data presents immense analytical value. Leveraging the remarkable capability of large language models (LLMs) in extracting attributes of structured tables from unstructured data, researchers are…