数据库 — Scifaro

A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

The torrential influx of floating-point data from domains like IoT and HPC necessitates high-performance lossless compression to mitigate storage costs while preserving absolute data fidelity. Leveraging GPU parallelism for this task…

数据库 · 计算机科学 2025-11-12 Zheng Li , Weiyan Wang , Ruiyuan Li , Chao Chen , Xianlei Long , Linjiang Zheng , Quanqing Xu , Chuanhui Yang

Towards a Multimodal Stream Processing System

In this paper, we present a vision for a new generation of multimodal streaming systems that embed MLLMs as first-class operators, enabling real-time query processing across multiple modalities. Achieving this is non-trivial: while recent…

数据库 · 计算机科学 2025-11-12 Uélison Jean Lopes dos Santos , Alessandro Ferri , Szilard Nistor , Riccardo Tommasini , Carsten Binnig , Manisha Luthra

Trading Vector Data in Vector Databases

Vector data trading is essential for cross-domain learning with vector databases, yet it remains largely unexplored. We study this problem under online learning, where sellers face uncertain retrieval costs and buyers provide stochastic…

数据库 · 计算机科学 2025-11-11 Jin Cheng , Xiangxiang Dai , Ningning Ding , John C. S. Lui , Jianwei Huang

OntoTune: Ontology-Driven Learning for Query Optimization with Convolutional Models

Query optimization has been studied using machine learning, reinforcement learning, and, more recently, graph-based convolutional networks. Ontology, as a structured, information-rich knowledge representation, can provide context,…

数据库 · 计算机科学 2025-11-11 Songhui Yue , Yang Shao , Sean Hayes

A Multi-Agent System for Semantic Mapping of Relational Data to Knowledge Graphs

Enterprises often maintain multiple databases for storing critical business data in siloed systems, resulting in inefficiencies and challenges with data interoperability. A key to overcoming these challenges lies in integrating disparate…

数据库 · 计算机科学 2025-11-11 Milena Trajanoska , Riste Stojanov , Dimitar Trajanov

MemoriesDB: A Temporal-Semantic-Relational Database for Long-Term Agent Memory / Modeling Experience as a Graph of Temporal-Semantic Surfaces

We introduce MemoriesDB, a unified data architecture designed to avoid decoherence across time, meaning, and relation in long-term computational memory. Each memory is a time-semantic-relational entity-a structure that simultaneously…

数据库 · 计算机科学 2025-11-11 Joel Ward

Don't Forget Range Delete! Enhancing LSM-based Key-Value Stores with More Compatible Lookups and Deletes

LSM-trees are featured by out-of-place updates, where key deletion is handled by inserting a tombstone to mark its staleness instead of removing it in place. This defers actual removal to compactions with greatly reduced overhead. However,…

数据库 · 计算机科学 2025-11-11 Fan Wang , Dingheng Mo , Siqiang Luo

RF-Behavior: A Multimodal Radio-Frequency Dataset for Human Behavior and Emotion Analysis

Recent research has demonstrated the complementary nature of camera-based and inertial data for modeling human gestures, activities, and sentiment. Yet, despite its growing importance for environmental sensing as well as the advance of…

数据库 · 计算机科学 2025-11-11 Si Zuo , Yuqing Song , Sahar Golipoor , Ying Liu , Xujun Ma , Stephan Sigg

ZipLLM: Efficient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression

Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques -- such as deduplication and…

数据库 · 计算机科学 2025-11-11 Zirui Wang , Tingfeng Lan , Zhaoyuan Su , Juncheng Yang , Yue Cheng

GPC: A Pattern Calculus for Property Graphs

The development of practical query languages for graph databases runs well ahead of the underlying theory. The ISO committee in charge of database query languages is currently developing a new standard called Graph Query Language (GQL) as…

数据库 · 计算机科学 2025-11-11 Nadime Francis , Amélie Gheerbrant , Paolo Guagliardo , Leonid Libkin , Victor Marsault , Wim Martens , Filip Murlak , Liat Peterfreund , Alexandra Rogova , Domagoj Vrgoč

An Efficient Proximity Graph-based Approach to Table Union Search

Neural embedding models are extensively employed in the table union search problem, which aims to find semantically compatible tables that can be merged with a given query table. In particular, multi-vector models, which represent a table…

数据库 · 计算机科学 2025-11-10 Yiming Xie , Hua Dai , Mingfeng Jiang , Pengyue Li , zhengkai Zhang , Bohan Li

L2T-Tune:LLM-Guided Hybrid Database Tuning with LHS and TD3

Configuration tuning is critical for database performance. Although recent advancements in database tuning have shown promising results in throughput and latency improvement, challenges remain. First, the vast knob space makes direct…

数据库 · 计算机科学 2025-11-10 Xinyue Yang , Chen Zheng , Yaoyang Hou , Renhao Zhang , Yinyan Zhang , Yanjun Wu , Heng Zhang

SHARP: Shared State Reduction for Efficient Matching of Sequential Patterns

The detection of sequential patterns in data is a basic functionality of modern data processing systems for complex event processing (CEP), OLAP, and retrieval-augmented generation (RAG). In practice, pattern matching is challenging, since…

数据库 · 计算机科学 2025-11-07 Cong Yu , Tuo Shi , Matthias Weidlich , Bo Zhao

TCSR-SQL: Towards Table Content-aware Text-to-SQL with Self-retrieval

Large Language Model-based (LLM-based) Text-to-SQL methods have achieved important progress in generating SQL queries for real-world applications. When confronted with table content-aware questions in real-world scenarios, ambiguous data…

数据库 · 计算机科学 2025-11-07 Wenbo Xu , Liang Yan , Chuanyi Liu , Peiyi Han , Haifeng Zhu , Yong Xu , Yingwei Liang , Bob Zhang

Analytical Queries for Unstructured Data

Unstructured data, in the form of text, images, video, and audio, is produced at exponentially higher rates. In tandem, machine learning (ML) methods have become increasingly powerful at analyzing unstructured data. Modern ML methods can…

数据库 · 计算机科学 2025-11-06 Daniel Kang

In-Memory Indexing and Querying of Provenance in Data Preparation Pipelines

Data provenance has numerous applications in the context of data preparation pipelines. It can be used for debugging faulty pipelines, interpreting results, verifying fairness, and identifying data quality issues, which may affect the…

数据库 · 计算机科学 2025-11-06 Khalid Belhajjame , Haroun Mezrioui , Yuyan Zhao

Formalizing ETLT and ELTL Design Patterns and Proposing Enhanced Variants: A Systematic Framework for Modern Data Engineering

Traditional ETL and ELT design patterns struggle to meet modern requirements of scalability, governance, and real-time data processing. Hybrid approaches such as ETLT (Extract-Transform-Load-Transform) and ELTL (Extract-Load-Transform-Load)…

数据库 · 计算机科学 2025-11-06 Chiara Rucco , Motaz Saad , Antonella Longo

Differentially Private Data Generation with Missing Data

Despite several works that succeed in generating synthetic data with differential privacy (DP) guarantees, they are inadequate for generating high-quality synthetic data when the input data has missing values. In this work, we formalize the…

数据库 · 计算机科学 2025-11-06 Shubhankar Mohapatra , Jianqiao Zong , Florian Kerschbaum , Xi He

Relational Deep Dive: Error-Aware Queries Over Unstructured Data

Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to…

数据库 · 计算机科学 2025-11-05 Daren Chao , Kaiwen Chen , Naiqing Guan , Nick Koudas

EasyTUS: A Comprehensive Framework for Fast and Accurate Table Union Search across Data Lakes

Data lakes enable easy maintenance of heterogeneous data in its native form. While this flexibility can accelerate data ingestion, it shifts the complexity of data preparation and query processing to data discovery tasks. One such task is…

数据库 · 计算机科学 2025-11-05 Tim Otto