数据库 — Scifaro

Optimal Bounds-Only Pruning for Spatial AkNN Joins

We propose a bounds-only pruning test for exact Euclidean AkNN joins on partitioned spatial datasets. Data warehouses commonly partition large tables and store row group statistics for them to accelerate searches and joins, rather than…

数据库 · 计算机科学 2026-02-11 Dominik Winecki

SciDataCopilot: An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical…

数据库 · 计算机科学 2026-02-11 Jiyong Rao , Yicheng Qiu , Jiahui Zhang , Juntao Deng , Shangquan Sun , Fenghua Ling , Hao Chen , Nanqing Dong , Zhangyang Gao , Siqi Sun , Yuqiang Li , Dongzhan Zhou , Guangyu Wang , Lijun Wu , Conghui He , Xuhong Wang , Jing Shao , Xiang Liu , Yu Zhu , Mianxin Liu , Qihao Zheng , Yinghui Zhang , Jiamin Wu , Xiaosong Wang , Shixiang Tang , Wenlong Zhang , Bo Zhang , Wanli Ouyang , Runkai Zhao , Chunfeng Song , Lei Bai , Chi Zhang

Efficient Distance Pruning for Process Suffix Comparison in Prescriptive Process Monitoring

Prescriptive process monitoring seeks to recommend actions that improve process outcomes by analyzing possible continuations of ongoing cases. A key obstacle is the heavy computational cost of large-scale suffix comparisons, which grows…

数据库 · 计算机科学 2026-02-11 Sarra Madad

Beyond Text-to-SQL: Autonomous Research-Driven Database Exploration with DAR

Large language models can already query databases, yet most existing systems remain reactive: they rely on explicit user prompts and do not actively explore data. We introduce DAR (Data Agnostic Researcher), a multi-agent system that…

数据库 · 计算机科学 2026-02-11 Ostap Vykhopen , Viktoria Skorik , Maksym Tereshchenko , Veronika Solopova

EntroGD: Scalable Generalized Deduplication for Efficient Direct Analytics on Compressed IoT Data

Massive data streams from IoT and cyber-physical systems must be processed under strict bandwidth, latency, and resource constraints. Generalized Deduplication (GD) is a promising lossless compression framework, as it supports random access…

数据库 · 计算机科学 2026-02-11 Xiaobo Zhao , Daniel E. Lucani

MAPS: A Multilingual Benchmark for Agent Performance and Security

Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in…

数据库 · 计算机科学 2026-02-11 Omer Hofman , Jonathan Brokman , Oren Rachmil , Shamik Bose , Vikas Pahuja , Toshiya Shimizu , Trisha Starostina , Kelly Marchisio , Seraphina Goldfarb-Tarrant , Roman Vainshtein

DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor Search

Locality-sensitive hashing (LSH) is a well-known solution for approximate nearest neighbor (ANN) search in high-dimensional spaces due to its robust theoretical guarantee on query accuracy. Traditional LSH-based methods mainly focus on…

数据库 · 计算机科学 2026-02-11 Jiuqi Wei , Botao Peng , Xiaodong Lee , Themis Palpanas

MMTS-BENCH: A Comprehensive Benchmark for Time Series Understanding and Reasoning

Time series data are central to domains such as finance, healthcare, and cloud computing, yet existing benchmarks for evaluating various large language models (LLMs) on temporal tasks remain scattered and unsystematic. To bridge this gap,…

数据库 · 计算机科学 2026-02-10 Yao Yin , Zhenyu Xiao , Musheng Li , Yiwen Liu , Sutong Nan , Yiting He , Ruiqi Wang , Zhenwei Zhang , Qingmin Liao , Yuantao Gu

Semantics and Multi-Query Optimization Algorithms for the Analyze Operator

In their hunt for highlights, i.e., interesting patterns in the data, data analysts have to issue groups of related queries and manually combine their results. To the extent that the analyst's goals are based on an intention on what to…

数据库 · 计算机科学 2026-02-10 Marios Iakovidis , Panos Vassiliadis

ZipFlow: a Compiler-based Framework to Unleash Compressed Data Movement for Modern GPUs

In GPU-accelerated data analytics, the overhead of data transfer from CPU to GPU becomes a performance bottleneck when the data scales beyond GPU memory capacity due to the limited PCIe bandwidth. Data compression has come to rescue for…

数据库 · 计算机科学 2026-02-10 Gwangoo Yeo , Zhiyang Shen , Wei Cui , Matteo Interlandi , Rathijit Sen , Bailu Ding , Qi Chen , Minsoo Rhu

Nexus: Inferring Join Graphs from Metadata Alone via Iterative Low-Rank Matrix Completion

Automatically inferring join relationships is a critical task for effective data discovery, integration, querying and reuse. However, accurately and efficiently identifying these relationships in large and complex schemas can be…

数据库 · 计算机科学 2026-02-10 Tianji Cong , Yuanyuan Tian , Andreas Mueller , Rathijit Sen , Yeye He , Fotis Psallidas , Shaleen Deep , H. V. Jagadish

How to evaluate NoSQL Database Paradigms for Knowledge Graph Processing

Knowledge Graph (KG) processing faces critical infrastructure challenges in selecting optimal NoSQL database paradigms, as traditional performance evaluations rely on static benchmarks that fail to capture the complexity of real-world KG…

数据库 · 计算机科学 2026-02-10 Rosario Napoli , Antonio Celesti , Massimo Villari , Maria Fazio

Building an OceanBase-based Distributed Nearly Real-time Analytical Processing Database System

The growing demand for database systems capable of efficiently managing massive datasets while delivering real-time transaction processing and advanced analytical capabilities has become critical in modern data infrastructure. While…

数据库 · 计算机科学 2026-02-10 Quanqing Xu , Chuanhui Yang , Ruijie Li , Dongdong Xie , Hui Cao , Yi Xiao , Junquan Chen , Yanzuo Wang , Saitong Zhao , Fusheng Han , Bin Liu , Guoping Wang , Yuzhong Zhao , Mingqiang Zhuang

DeepPrep: An LLM-Powered Agentic System for Autonomous Data Preparation

Data preparation, which aims to transform heterogeneous and noisy raw tables into analysis-ready data, remains a major bottleneck in data science. Recent approaches leverage large language models (LLMs) to automate data preparation from…

数据库 · 计算机科学 2026-02-10 Meihao Fan , Ju Fan , Yuxin Zhang , Shaolei Zhang , Xiaoyong Du , Jie Song , Peng Li , Fuxin Jiang , Tieying Zhang , Jianjun Chen

Learned Query Optimizer in Alibaba MaxCompute: Challenges, Analysis, and Solutions

Existing learned query optimizers remain ill-suited to modern distributed, multi-tenant data warehouses due to idealized modeling assumptions and design choices. Using Alibaba's MaxCompute as a representative, we surface four fundamental,…

数据库 · 计算机科学 2026-02-10 Lianggui Weng , Dandan Liu , Wenzhuang Zhu , Rong Zhu , Junzheng Zheng , Bolin Ding , Zhiguo Zhang , Jingren Zhou

Towards Scalable Visual Data Wrangling via Direct Manipulation

Data wrangling, the process of cleaning, transforming, and preparing data for analysis, is a well-known bottleneck in data science workflows. A wide range of data wrangling techniques have been proposed to mitigate this challenge. Of…

数据库 · 计算机科学 2026-02-10 El Kindi Rezig , Mir Mahathir Mohammad , Nicolas Baret , Ricardo Mayerhofer , Andrew McNutt , Paul Rosen

Machine Learning Practitioners' Views on Data Quality in Light of EU Regulatory Requirements: A European Online Survey

Understanding how data quality aligns with regulatory requirements in machine learning (ML) systems presents a critical challenge for practitioners navigating the evolving EU regulatory landscape. To address this, we first propose a…

数据库 · 计算机科学 2026-02-09 Yichun Wang , Kristina Irion , Paul Groth , Hazar Harmouch

The Stretto Execution Engine for LLM-Augmented Data Systems

LLM-augmented data systems enable semantic querying over structured and unstructured data, but executing queries with LLM-powered operators introduces a fundamental runtime-accuracy trade-off. In this paper, we present Stretto, a new…

数据库 · 计算机科学 2026-02-09 Gabriele Sanmartino , Matthias Urban , Paolo Papotti , Carsten Binnig

Heterogeneity in Entity Matching: A Survey and Experimental Analysis

Entity matching (EM) is a fundamental task in data integration and analytics, essential for identifying records that refer to the same real-world entity across diverse sources. In practice, datasets often differ widely in structure, format,…

数据库 · 计算机科学 2026-02-09 Mohammad Hossein Moslemi , Amir Mousavi , Behshid Behkamal , Mostafa Milani

"Detective Work We Shouldn't Have to Do": Practitioner Challenges in Regulatory-Aligned Data Quality in Machine Learning Systems

Ensuring data quality in machine learning (ML) systems has become increasingly complex as regulatory requirements expand. In the European Union (EU), frameworks such as the General Data Protection Regulation (GDPR) and the Artificial…

数据库 · 计算机科学 2026-02-06 Yichun Wang , Kristina Irion , Paul Groth , Hazar Harmouch