数据库 — Scifaro

TableVault: Managing Dynamic Data Collections for LLM-Augmented Workflows

Large Language Models (LLMs) have emerged as powerful tools for automating and executing complex data tasks. However, their integration into more complex data workflows introduces significant management challenges. In response, we present…

数据库 · 计算机科学 2025-06-24 Jinjin Zhao , Sanjay Krishnan

Fast Capture of Cell-Level Provenance in Numpy

Effective provenance tracking enhances reproducibility, governance, and data quality in array workflows. However, significant challenges arise in capturing this provenance, including: (1) rapidly evolving APIs, (2) diverse operation types,…

数据库 · 计算机科学 2025-06-24 Jinjin Zhao , Sanjay Krishnan

Learning Lineage Constraints for Data Science Operations

Data science workflows often integrate functionalities from a diverse set of libraries and frameworks. Tasks such as debugging require data lineage that crosses library boundaries. The problem is that the way that "lineage" is represented…

数据库 · 计算机科学 2025-06-24 Jinjin Zhao

Dual-Hierarchy Labelling: Scaling Up Distance Queries on Dynamic Road Networks

Computing the shortest-path distance between any two given vertices in road networks is an important problem. A tremendous amount of research has been conducted to address this problem, most of which are limited to static road networks.…

数据库 · 计算机科学 2025-06-24 Muhammad Farhan , Henning Koehler , Qing Wang

Lower Bounds for Conjunctive Query Evaluation

In this tutorial, we will survey known results on the complexity of conjunctive query evaluation in different settings, ranging from Boolean queries over counting to more complex models like enumeration and direct access. A particular focus…

数据库 · 计算机科学 2025-06-24 Stefan Mengel

Transient Concepts in Streaming Graphs

Concept Drift (CD) occurs when a change in a hidden context can induce changes in a target concept. CD is a natural phenomenon in non-stationary settings such as data streams. Understanding, detection, and adaptation to CD in streaming data…

数据库 · 计算机科学 2025-06-24 Aida Sheshbolouki , M. Tamer Ozsu

DCMF: A Dynamic Context Monitoring and Caching Framework for Context Management Platforms

The rise of context-aware IoT applications has increased the demand for timely and accurate context information. Context is derived by aggregating and inferring from dynamic IoT data, making it highly volatile and posing challenges in…

数据库 · 计算机科学 2025-06-24 Ashish Manchanda , Prem Prakash Jayaraman , Abhik Banerjee , Kaneez Fizza , Arkady Zaslavsky

EnhanceGraph: A Continuously Enhanced Graph-based Index for High-dimensional Approximate Nearest Neighbor Search

Recently, Approximate Nearest Neighbor Search in high-dimensional vector spaces has garnered considerable attention due to the rapid advancement of deep learning techniques. We observed that a substantial amount of search and construction…

数据库 · 计算机科学 2025-06-24 Xiaoyao Zhong , Jiabao Jin , Peng Cheng , Mingyu Yang , Haoyang Li , Zhitao Shen , Heng Tao Shen , Jingkuan Song

LaPuda: LLM-Enabled Policy-Based Query Optimizer for Multi-modal Data

Large language model (LLM) has marked a pivotal moment in the field of machine learning and deep learning. Recently its capability for query planning has been investigated, including both single-modal and multi-modal queries. However, there…

数据库 · 计算机科学 2025-06-24 Yifan Wang , Haodi Ma , Daisy Zhe Wang

When Large Language Models Meet Vector Databases: A Survey

This survey explores the synergistic potential of Large Language Models (LLMs) and Vector Databases (VecDBs), a burgeoning but rapidly evolving research area. With the proliferation of LLMs comes a host of challenges, including…

数据库 · 计算机科学 2025-06-24 Zhi Jing , Yongye Su , Yikun Han , Bo Yuan , Haiyun Xu , Chunjiang Liu , Kehai Chen , Min Zhang

PUL: Pre-load in Software for Caches Wouldn't Always Play Along

Memory latencies and bandwidth are major factors, limiting system performance and scalability. Modern CPUs aim at hiding latencies by employing large caches, out-of-order execution, or complex hardware prefetchers. However, software-based…

数据库 · 计算机科学 2025-06-23 Arthur Bernhardt , Sajjad Tamimi , Florian Stock , Andreas Koch , Ilia Petrov

Advancing Fact Attribution for Query Answering: Aggregate Queries and Novel Algorithms

In this paper, we introduce a novel approach to computing the contribution of input tuples to the result of the query, quantified by the Banzhaf and Shapley values. In contrast to prior algorithmic work that focuses on…

数据库 · 计算机科学 2025-06-23 Omer Abramovich , Daniel Deutch , Nave Frost , Ahmet Kara , Dan Olteanu

PBench: Workload Synthesizer with Real Statistics for Cloud Analytics Benchmarking

Cloud service providers commonly use standard benchmarks like TPC-H and TPC-DS to evaluate and optimize cloud data analytics systems. However, these benchmarks rely on fixed query patterns and fail to capture the real execution statistics…

数据库 · 计算机科学 2025-06-23 Yan Zhou , Chunwei Liu , Bhuvan Urgaonkar , Zhengle Wang , Magnus Mueller , Chao Zhang , Songyue Zhang , Pascal Pfeil , Dominik Horn , Zhengchun Liu , Davide Pagano , Tim Kraska , Samuel Madden , Ju Fan

Data-Agnostic Cardinality Learning from Imperfect Workloads

Cardinality estimation (CardEst) is a critical aspect of query optimization. Traditionally, it leverages statistics built directly over the data. However, organizational policies (e.g., regulatory compliance) may restrict global data…

数据库 · 计算机科学 2025-06-23 Peizhi Wu , Rong Kang , Tieying Zhang , Jianjun Chen , Ryan Marcus , Zachary G. Ives

Filter-Centric Vector Indexing: Geometric Transformation for Efficient Filtered Vector Search

The explosive growth of vector search applications demands efficient handling of combined vector similarity and attribute filtering; a challenge where current approaches force an unsatisfying choice between performance and accuracy. We…

数据库 · 计算机科学 2025-06-23 Alireza Heidari , Wei Zhang

Empowering Graph-based Approximate Nearest Neighbor Search with Adaptive Awareness Capabilities

Approximate Nearest Neighbor Search (ANNS) in high-dimensional spaces finds extensive applications in databases, information retrieval, recommender systems, etc. While graph-based methods have emerged as the leading solution for ANNS due to…

数据库 · 计算机科学 2025-06-23 Jiancheng Ruan , Tingyang Chen , Renchi Yang , Xiangyu Ke , Yunjun Gao

Delta: A Learned Mixed Cost-based Query Optimization Framework

Query optimizer is a crucial module for database management systems. Existing optimizers exhibit two flawed paradigms: (1) cost-based optimizers use dynamic programming with cost models but face search space explosion and heuristic pruning…

数据库 · 计算机科学 2025-06-23 Jiazhen Peng , Zheng Qu , Xiaoye Miao , Rong Zhu

AQETuner: Reliable Query-level Configuration Tuning for Analytical Query Engines

Modern analytical query engines (AQEs) are essential for large-scale data analysis and processing. These systems usually provide numerous query-level tunable knobs that significantly affect individual query performance. While several…

数据库 · 计算机科学 2025-06-23 Lixiang Chen , Yuxing Han , Yu Chen , Xing Chen , Chengcheng Yang , Weining Qian

Pruning in Snowflake: Working Smarter, Not Harder

Modern cloud-based data analytics systems must efficiently process petabytes of data residing on cloud storage. A key optimization technique in state-of-the-art systems like Snowflake is partition pruning - skipping chunks of data that do…

数据库 · 计算机科学 2025-06-23 Andreas Zimmerer , Damien Dam , Jan Kossmann , Juliane Waack , Ismail Oukid , Andreas Kipf

Selective Use of Yannakakis' Algorithm to Improve Query Performance: Machine Learning to the Rescue

Query optimization has played a central role in database research for decades. However, more often than not, the proposed optimization techniques lead to a performance improvement in some, but not in all, situations. Therefore, we urgently…

数据库 · 计算机科学 2025-06-23 Daniela Böhm , Georg Gottlob , Matthias Lanzinger , Davide Longo , Cem Okulmus , Reinhard Pichler , Alexander Selzer