数据库
The unification of large language models (LLMs) and knowledge graphs (KGs) has emerged as a hot topic. At the LLM+KG'24 workshop, held in conjunction with VLDB 2024 in Guangzhou, China, one of the key themes explored was important data…
The Provenance Ontology (PROV-O) is a World Wide Web Consortium (W3C) recommended ontology used to structure data about provenance across a wide variety of domains. Basic Formal Ontology (BFO) is a top-level ontology ISO/IEC standard used…
Spatial joins are among the most time-consuming spatial queries, remaining costly even in parallel and distributed systems. In this paper, we explore hardware acceleration for spatial joins by proposing SwiftSpatial, an FPGA-based…
The emerging paradigm of data economy can constitute an unmissable and attractive opportunity for companies that aim to consider their data as valuable assets. To fully leverage this opportunity, data owners need to have specific and…
The Big Data landscape poses challenges in managing diverse data formats, requiring efficient storage and processing for high-quality analysis. Effective metadata management is crucial for organizing, accessing, and reusing data within…
Open science represents a transformative research approach essential for enhancing sustainability and impact. Data generation encompasses various methods, from automated processes to human-driven inputs, creating a rich and diverse…
Cardinality estimation is a fundamental functionality in database systems. Most existing cardinality estimators focus on handling predicates over numeric or categorical data. They have largely omitted an important data type, set-valued…
The shift toward high-quality urbanization has brought increased attention to the issue of "urban villages", which has become a prominent social problem in China. However, there is a lack of available geospatial data on urban villages,…
The mathematics of redistricting is an area of study that has exploded in recent years. In particular, many different research groups and expert witnesses in court cases have used outlier analysis to argue that a proposed map is a…
Spatio-Temporal (ST) data science, which includes sensing, managing, and mining large-scale data across space and time, is fundamental to understanding complex systems in domains such as urban computing, climate science, and intelligent…
CARDS (Corpus of Acyclic Repositories and Dependency Systems) is a collection of directed graphs which express dependency relations, extracted from diverse real-world sources such as package managers, version control systems, and event…
Context: Entity resolution (ER) plays a pivotal role in data management by determining whether multiple records correspond to the same real-world entity. Because of its critical importance across domains such as healthcare, finance, and…
The graph-based index has been widely adopted to meet the demand for approximate nearest neighbor search (ANNS) for high-dimensional vectors. However, in dynamic scenarios involving frequent vector insertions and deletions, existing systems…
The same real-world entity (e.g., a movie, a restaurant, a person) may be described in various ways on different datasets. Entity Resolution (ER) aims to find such different descriptions of the same entity, this way improving data quality…
With the rapid advancement of brain-computer interface (BCI) technology, the volume of physiological data generated in related research and applications has grown significantly. Data is a critical resource in BCI research and a key factor…
Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single…
Data quality is crucial in machine learning (ML) applications, as errors in the data can significantly impact the prediction accuracy of the underlying ML model. Therefore, data cleaning is an integral component of any ML pipeline. However,…
JSON Schema is a logical language used to define the structure of JSON values. JSON Schema syntax is based on nested schema objects. In all versions of JSON Schema until Draft-07, collectively known as Classical JSON Schema, the semantics…
The task of joining two tables is fundamental for querying databases. In this paper, we focus on the equi-join problem, where a pair of records from the two joined tables are part of the join results if equality holds between their values…
One of the most celebrated results of computing join-aggregate queries defined over commutative semi-rings is the classic Yannakakis algorithm proposed in 1981. It is known that the runtime of the Yannakakis algorithm is $O(N + \OUT)$ for…