数据库 — Scifaro

Abacus: A Cost-Based Optimizer for Semantic Operator Systems

LLMs enable an exciting new class of data processing applications over large collections of unstructured documents. Several new programming frameworks have enabled developers to build these applications by composing them out of semantic…

数据库 · 计算机科学 2026-02-04 Matthew Russo , Chunwei Liu , Sivaprasad Sudhir , Gerardo Vitagliano , Michael Cafarella , Tim Kraska , Samuel Madden

Quantization Meets Projection: A Happy Marriage for Approximate k-Nearest Neighbor Search

Approximate $k$-nearest neighbor (AKNN) search is a fundamental problem with wide applications. To reduce memory and accelerate search, vector quantization is widely adopted. However, existing quantization methods either rely on codebooks…

数据库 · 计算机科学 2026-02-04 Mingyu Yang , Liuchang Jing , Wentao Li , Wei Wang

QVCache: A Query-Aware Vector Cache

Vector databases have become a cornerstone of modern information retrieval, powering applications in recommendation, search, and retrieval-augmented generation (RAG) pipelines. However, scaling approximate nearest neighbor (ANN) search to…

数据库 · 计算机科学 2026-02-03 Anıl Eren Göçer , Ioanna Tsakalidou , Hamish Nicholson , Kyoungmin Kim , Anastasia Ailamaki

Hippasus: Effective and Efficient Automatic Feature Augmentation for Machine Learning Tasks on Relational Data

Machine learning models depend critically on feature quality, yet useful features are often scattered across multiple relational tables. Feature augmentation enriches a base table by discovering and integrating features from related tables…

数据库 · 计算机科学 2026-02-03 Serafeim Papadias , Kostas Patroumpas , Dimitrios Skoutas

SQLAgent: Learning to Explore Before Generating as a Data Engineer

Large Language Models have recently shown impressive capabilities in reasoning and code generation, making them promising tools for natural language interfaces to relational databases. However, existing approaches often fail to generalize…

数据库 · 计算机科学 2026-02-03 Wenjia Jiang , Yiwei Wang , Boyan Han , Joey Tianyi Zhou , Chi Zhang

ChemDCAT-AP: Enabling Semantic Interoperability with a Contextual Extension of DCAT-AP

Cross-domain data integration drives interdisciplinary data reuse and knowledge transfer across domains. However, each discipline maintains its own metadata schemas and domain ontologies, employing distinct conceptual models and application…

数据库 · 计算机科学 2026-02-03 Philip Stroemert , Hendrik Borgelt , David Linke , Mark Doerr , Bhavin Katabathuni , Oliver Koepler , Norbert Kockmann

Updatable Balanced Index for Stable Streaming Similarity Search over Large-Scale Fresh Vectors

As artificial intelligence gains more and more popularity, vectors are one of the most widely used data structures for services such as information retrieval and recommendation. Approximate Nearest Neighbor Search (ANNS), which generally…

数据库 · 计算机科学 2026-02-03 Yuhui Lai , Shixun Huang , Sheng Wang

A Scalable Transaction Management Framework for Consistent Document-Oriented NoSQL Databases

NoSQL databases are widely used in modern applications due to their scalability and schema flexibility, yet they often rely on eventual consistency models that limit reliable transaction processing. This study proposes a four-stage…

数据库 · 计算机科学 2026-02-03 Adam A. E. Alflahi , Mohammed A. Y. Mohammed , Abdallah Alsammani

COL-Trees: Efficient Hierarchical Object Search in Road Networks

Location-based services rely heavily on efficient methods that search for relevant points-of-interest (POIs) near a given location. A k Nearest Neighbor (kNN) query is one such example that finds the k closest POIs from an agent's location.…

数据库 · 计算机科学 2026-02-02 Tenindra Abeywickrama , Muhammad Aamir Cheema , Sabine Storandt

High-utility Sequential Rule Mining Utilizing Segmentation Guided by Confidence

Within the domain of data mining, one critical objective is the discovery of sequential rules with high utility. The goal is to discover sequential rules that exhibit both high utility and strong confidence, which are valuable in real-world…

数据库 · 计算机科学 2026-02-02 Chunkai Zhang , Jiarui Deng , Maohua Lyu , Wensheng Gan , Philip S. Yu

Discovering High-utility Sequential Rules with Increasing Utility Ratio

Utility-driven mining is an essential task in data science, as it can provide deeper insight into the real world. High-utility sequential rule mining (HUSRM) aims at discovering sequential rules with high utility and high confidence. It can…

数据库 · 计算机科学 2026-02-02 Zhenqiang Ye , Wensheng Gan , Gengsen Huang , Tianlong Gu , Philip S. Yu

An innovating approach to teaching applied to database design. Improvement of Action Learning in Lifelong Learning

For now 10 years, the Action Learning has allowed employees of University of Angers, private and public Companies to be initiated with the design of database, on projects financed by professional structures. These innovating training…

数据库 · 计算机科学 2026-02-02 Christophe Béchade

FairDAG: Consensus Fairness over Multi-Proposer Causal Design

The rise of cryptocurrencies like Bitcoin and Ethereum has driven interest in blockchain database technology, with smart contracts enabling the growth of decentralized finance (DeFi). However, research has shown that adversaries exploit…

数据库 · 计算机科学 2026-01-30 Dakai Kang , Junchao Chen , Tien Tuan Anh Dinh , Mohammad Sadoghi

The Monotone Priority System: Foundations of Contract-Specific Sequencing

Modern blockchain applications benefit from the ability to specify sequencing constraints on the transactions that interact with them. This paper proposes a principled and axiomatically justified way of adding sequencing constraints on…

数据库 · 计算机科学 2026-01-29 Naveen Durvasula

ALER: An Active Learning Hybrid System for Efficient Entity Resolution

Entity Resolution (ER) is a critical task for data integration, yet state-of-the-art supervised deep learning models remain impractical for many real-world applications due to their need for massive, expensive-to-obtain labeled datasets.…

数据库 · 计算机科学 2026-01-29 Dimitrios Karapiperis , Leonidas Akritidis , Panayiotis Bozanis , Vassilios Verykios

Delta Fair Sharing: Performance Isolation for Multi-Tenant Storage Systems

Modern storage systems, often deployed to support multiple tenants in the cloud, must provide performance isolation. Unfortunately, traditional approaches such as fair sharing do not provide performance isolation for storage systems,…

数据库 · 计算机科学 2026-01-29 Tyler Griggs , Soujanya Ponnapalli , Dev Bali , Wenjie Ma , James DeLoye , Audrey Cheng , Jaewan Hong , Natacha Crooks , Scott Shenker , Ion Stoica , Matei Zaharia

DBTuneSuite: An Extendible Experimental Suite to Test the Time Performance of Multi-layer Tuning Options on Database Management Systems

DBTuneSuite is a suite of experiments on four widely deployed free database systems to test their performance under various query/upsert loads and under various tuning options. The suite provides: (i) scripts to generate data and to install…

数据库 · 计算机科学 2026-01-29 Amani Agrawal , Tianxin Wang , Dennis Shasha

[Extended Version] ArceKV: Towards Workload-driven LSM-compactions for Key-Value Store Under Dynamic Workloads

Key-value stores underpin a wide range of applications due to their simplicity and efficiency. Log-Structured Merge Trees (LSM-trees) dominate as their underlying structure, excelling at handling rapidly growing data. Recent research has…

数据库 · 计算机科学 2026-01-29 Junfeng Liu , Haoxuan Xie , Siqiang Luo

Topology-Aware Subset Repair via Entropy-Guided Density and Graph Decomposition

Subset repair is an important data cleaning technique that enforces integrity constraints by deleting a minimal number of conflicting tuples, yet multiple minimal repairs often exist. Density-based methods address this ambiguity by favoring…

数据库 · 计算机科学 2026-01-28 Guoqi Zhao , Xixian Han , Xiaolong Wan

Create Benchmarks for Data Lakes

Data lakes have emerged as a flexible and scalable solution for storing and analyzing large volumes of heterogeneous data, including structured, semi-structured, and unstructured formats. Despite their growing adoption in both industry and…

数据库 · 计算机科学 2026-01-28 Yi Lyu , Pei-Chieh Lo , Natan Lidukhover