数据库 — Scifaro

The Impact of Data Compression in Real-Time and Historical Data Acquisition Systems on the Accuracy of Analytical Solutions

In industrial and IoT environments, massive amounts of real-time and historical process data are continuously generated and archived. With sensors and devices capturing every operational detail, the volume of time-series data has become a…

数据库 · 计算机科学 2025-11-03 Reham Faqehi , Haya Alhuraib , Hamad Saiari , Zyad Bamigdad

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

LLM serving systems process heterogeneous query workloads where different categories exhibit different characteristics. Code queries cluster densely in embedding space while conversational queries distribute sparsely. Content staleness…

数据库 · 计算机科学 2025-11-03 Chen Wang , Xunzhuo Liu , Yue Zhu , Alaa Youssef , Priya Nagpurkar , Huamin Chen

RADAR: Benchmarking Language Models on Imperfect Tabular Data

Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and…

数据库 · 计算机科学 2025-11-03 Ken Gu , Zhihan Zhang , Kate Lin , Yuwei Zhang , Akshay Paruchuri , Hong Yu , Mehran Kazemi , Kumar Ayush , A. Ali Heydari , Maxwell A. Xu , Girish Narayanswamy , Yun Liu , Ming-Zher Poh , Yuzhe Yang , Mark Malhotra , Shwetak Patel , Hamid Palangi , Xuhai Xu , Daniel McDuff , Tim Althoff , Xin Liu

One Join Order Does Not Fit All: Reducing Intermediate Results with Per-Split Query Plans

Minimizing intermediate results is critical for efficient multi-join query processing. Although the seminal Yannakakis algorithm offers strong guarantees for acyclic queries, cyclic queries remain an open challenge. In this paper, we…

数据库 · 计算机科学 2025-10-30 Yujun He , Hangdong Zhao , Simon Frisk , Yifei Yang , Kevin Kristensen , Paraschos Koutris , Xiangyao Yu

StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems

Automatically configuring storage systems is hard: parameter spaces are large and conditions vary across workloads, deployments, and versions. Heuristic and ML tuners are often system specific, require manual glue, and degrade under…

数据库 · 计算机科学 2025-10-30 Qi Lin , Zhenyu Zhang , Viraj Thakkar , Zhenjie Sun , Mai Zheng , Zhichao Cao

ODataX: A Progressive Evolution of the Open Data Protocol

The Open Data Protocol (OData) provides a standardized approach for building and consuming RESTful APIs with rich query capabilities. Despite its power and maturity, OData adoption remains confined primarily to enterprise environments,…

数据库 · 计算机科学 2025-10-30 Anirudh Ganesh , Nitin Sood

Odyssey: An End-to-End System for Pareto-Optimal Serverless Query Processing

Running data analytics queries on serverless (FaaS) workers has been shown to be cost- and performance-efficient for a variety of real-world scenarios, including intermittent query arrival patterns, sudden load spikes and management…

数据库 · 计算机科学 2025-10-30 Shyam Jesalpura , Shengda Zhu , Amir Shaikhha , Antonio Barbalace , Boris Grot

Evaluating Joinable Column Discovery Approaches for Context-Aware Search

Joinable Column Discovery is a critical challenge in automating enterprise data analysis. While existing approaches focus on syntactic overlap and semantic similarity, there remains limited understanding of which methods perform best for…

数据库 · 计算机科学 2025-10-29 Harsha Kokel , Aamod Khatiwada , Tejaswini Pedapati , Haritha Ananthakrishnan , Oktie Hassanzadeh , Horst Samulowitz , Kavitha Srinivas

Dynamically Detect and Fix Hardness for Efficient Approximate Nearest Neighbor Search

Approximate Nearest Neighbor Search (ANNS) has become a fundamental component in many real-world applications. Among various ANNS algorithms, graph-based methods are state-of-the-art. However, ANNS often suffers from a significant drop in…

数据库 · 计算机科学 2025-10-28 Zhiyuan Hua , Qiji Mo , Zebin Yao , Lixiao Cui , Xiaoguang Liu , Gang Wang , Zijing Wei , Xinyu Liu , Tianxiao Tang , Shaozhi Liu , Lin Qu

Determining Window Sizes using Species Estimation for Accurate Process Mining over Streams

Streaming process mining deals with the real-time analysis of event streams. A common approach for it is to adopt windowing mechanisms that select event data from a stream for subsequent analysis. However, the size of these windows denotes…

数据库 · 计算机科学 2025-10-28 Christian Imenkamp , Martin Kabierski , Hendrik Reiter , Matthias Weidlich , Wilhelm Hasselbring , Agnes Koschmider

Leveraging Approximate Caching for Faster Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) improves the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large…

数据库 · 计算机科学 2025-10-28 Shai Bergman , Anne-Marie Kermarrec , Diana Petrescu , Rafael Pires , Mathis Randl , Martijn de Vos , Ji Zhang

A Unified Approach for Multi-Granularity Search over Spatial Datasets

There has been increased interest in data search as a means to find relevant datasets or data points in data lakes and repositories. Although approaches have been proposed to support spatial dataset search and data point search, they…

数据库 · 计算机科学 2025-10-28 Wenzhe Yang , Sheng Wang , Shixun Huang , Hao Liu , Yuan Sun , Juliana Freire , Zhiyong Peng

SurVigilance: An Application for Accessing Global Pharmacovigilance Data

Even though several publicly accessible pharmacovigilance databases are available, extracting data from them is a technically challenging process. Existing tools typically focus on a single database. We present SurVigilance, an open-source…

数据库 · 计算机科学 2025-10-27 Raktim Mukhopadhyay , Marianthi Markatou

World-POI: Global Point-of-Interest Data Enriched from Foursquare and OpenStreetMap as Tabular and Graph Data

Recently, Foursquare released a global dataset with more than 100 million points of interest (POIs), each representing a real-world business on its platform. However, many entries lack complete metadata such as addresses or categories, and…

数据库 · 计算机科学 2025-10-27 Hossein Amiri , Mohammad Hashemi , Andreas Züfle

Transformer-Gather, Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution

Entity resolution plays a significant role in enterprise systems where data integrity must be rigorously maintained. Traditional methods often struggle with handling noisy data or semantic understanding, while modern methods suffer from…

数据库 · 计算机科学 2025-10-27 Mohammadreza Sharifi , Danial Ahmadzadeh

The billboard advertisement has emerged as an effective out-of-home advertisement technique where the objective is to choose a limited number of slots to play some advertisement content (e.g., animation, video, etc.) with the hope that the…

数据库 · 计算机科学 2025-10-24 Dildar Ali , Suman Banerjee , Yamuna Prasad

An Empirical Study on Database Usage in Microservices

Microservices architectures are an integral part of modern software development. Their adoption brings significant changes to database management. Instead of relying on a single database, a microservices architecture is typically composed…

数据库 · 计算机科学 2025-10-24 Maxime André , Marco Raglianti , Souhaila Serbout , Anthony Cleve , Michele Lanza

Hybrid Mixed Integer Linear Programming for Large-Scale Join Order Optimisation

Finding optimal join orders is among the most crucial steps to be performed by query optimisers. Though extensively studied in data management research, the problem remains far from solved: While query optimisers rely on exhaustive search…

数据库 · 计算机科学 2025-10-24 Manuel Schönberger , Immanuel Trummer , Wolfgang Mauerer

RAG-Stack: Co-Optimizing RAG Quality and Performance From the Vector Database Perspective

Retrieval-augmented generation (RAG) has emerged as one of the most prominent applications of vector databases. By integrating documents retrieved from a database into the prompt of a large language model (LLM), RAG enables more reliable…

数据库 · 计算机科学 2025-10-24 Wenqi Jiang

UREM: A High-performance Unified and Resilient Enhancement Method for Multi- and High-Dimensional Indexes

Numerous multi- or high-dimensional indexes with distinct advantages have been proposed on various platforms to meet application requirements. To achieve higher-performance queries, most indexes employ enhancement methods, including…

数据库 · 计算机科学 2025-10-24 Ming Sheng , Shuliang Wang , Yong Zhang , Yi Luo , Xianbo Liu , Zeming Li