数据库
LLMs enable an exciting new class of data processing applications over large collections of unstructured documents. Several new programming frameworks have enabled developers to build these applications by composing them out of semantic…
Approximate $k$-nearest neighbor (AKNN) search is a fundamental problem with wide applications. To reduce memory and accelerate search, vector quantization is widely adopted. However, existing quantization methods either rely on codebooks…
Vector databases have become a cornerstone of modern information retrieval, powering applications in recommendation, search, and retrieval-augmented generation (RAG) pipelines. However, scaling approximate nearest neighbor (ANN) search to…
Machine learning models depend critically on feature quality, yet useful features are often scattered across multiple relational tables. Feature augmentation enriches a base table by discovering and integrating features from related tables…
Large Language Models have recently shown impressive capabilities in reasoning and code generation, making them promising tools for natural language interfaces to relational databases. However, existing approaches often fail to generalize…
Cross-domain data integration drives interdisciplinary data reuse and knowledge transfer across domains. However, each discipline maintains its own metadata schemas and domain ontologies, employing distinct conceptual models and application…
As artificial intelligence gains more and more popularity, vectors are one of the most widely used data structures for services such as information retrieval and recommendation. Approximate Nearest Neighbor Search (ANNS), which generally…
NoSQL databases are widely used in modern applications due to their scalability and schema flexibility, yet they often rely on eventual consistency models that limit reliable transaction processing. This study proposes a four-stage…
Location-based services rely heavily on efficient methods that search for relevant points-of-interest (POIs) near a given location. A k Nearest Neighbor (kNN) query is one such example that finds the k closest POIs from an agent's location.…
Within the domain of data mining, one critical objective is the discovery of sequential rules with high utility. The goal is to discover sequential rules that exhibit both high utility and strong confidence, which are valuable in real-world…
Utility-driven mining is an essential task in data science, as it can provide deeper insight into the real world. High-utility sequential rule mining (HUSRM) aims at discovering sequential rules with high utility and high confidence. It can…
For now 10 years, the Action Learning has allowed employees of University of Angers, private and public Companies to be initiated with the design of database, on projects financed by professional structures. These innovating training…
The rise of cryptocurrencies like Bitcoin and Ethereum has driven interest in blockchain database technology, with smart contracts enabling the growth of decentralized finance (DeFi). However, research has shown that adversaries exploit…
Modern blockchain applications benefit from the ability to specify sequencing constraints on the transactions that interact with them. This paper proposes a principled and axiomatically justified way of adding sequencing constraints on…
Entity Resolution (ER) is a critical task for data integration, yet state-of-the-art supervised deep learning models remain impractical for many real-world applications due to their need for massive, expensive-to-obtain labeled datasets.…
Modern storage systems, often deployed to support multiple tenants in the cloud, must provide performance isolation. Unfortunately, traditional approaches such as fair sharing do not provide performance isolation for storage systems,…
DBTuneSuite is a suite of experiments on four widely deployed free database systems to test their performance under various query/upsert loads and under various tuning options. The suite provides: (i) scripts to generate data and to install…
Key-value stores underpin a wide range of applications due to their simplicity and efficiency. Log-Structured Merge Trees (LSM-trees) dominate as their underlying structure, excelling at handling rapidly growing data. Recent research has…
Subset repair is an important data cleaning technique that enforces integrity constraints by deleting a minimal number of conflicting tuples, yet multiple minimal repairs often exist. Density-based methods address this ambiguity by favoring…
Data lakes have emerged as a flexible and scalable solution for storing and analyzing large volumes of heterogeneous data, including structured, semi-structured, and unstructured formats. Despite their growing adoption in both industry and…