数据库
We propose a bounds-only pruning test for exact Euclidean AkNN joins on partitioned spatial datasets. Data warehouses commonly partition large tables and store row group statistics for them to accelerate searches and joins, rather than…
The current landscape of AI for Science (AI4S) is predominantly anchored in large-scale textual corpora, where generative AI systems excel at hypothesis generation, literature search, and multi-modal reasoning. However, a critical…
Prescriptive process monitoring seeks to recommend actions that improve process outcomes by analyzing possible continuations of ongoing cases. A key obstacle is the heavy computational cost of large-scale suffix comparisons, which grows…
Large language models can already query databases, yet most existing systems remain reactive: they rely on explicit user prompts and do not actively explore data. We introduce DAR (Data Agnostic Researcher), a multi-agent system that…
Massive data streams from IoT and cyber-physical systems must be processed under strict bandwidth, latency, and resource constraints. Generalized Deduplication (GD) is a promising lossless compression framework, as it supports random access…
Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in…
Locality-sensitive hashing (LSH) is a well-known solution for approximate nearest neighbor (ANN) search in high-dimensional spaces due to its robust theoretical guarantee on query accuracy. Traditional LSH-based methods mainly focus on…
Time series data are central to domains such as finance, healthcare, and cloud computing, yet existing benchmarks for evaluating various large language models (LLMs) on temporal tasks remain scattered and unsystematic. To bridge this gap,…
In their hunt for highlights, i.e., interesting patterns in the data, data analysts have to issue groups of related queries and manually combine their results. To the extent that the analyst's goals are based on an intention on what to…
In GPU-accelerated data analytics, the overhead of data transfer from CPU to GPU becomes a performance bottleneck when the data scales beyond GPU memory capacity due to the limited PCIe bandwidth. Data compression has come to rescue for…
Automatically inferring join relationships is a critical task for effective data discovery, integration, querying and reuse. However, accurately and efficiently identifying these relationships in large and complex schemas can be…
Knowledge Graph (KG) processing faces critical infrastructure challenges in selecting optimal NoSQL database paradigms, as traditional performance evaluations rely on static benchmarks that fail to capture the complexity of real-world KG…
The growing demand for database systems capable of efficiently managing massive datasets while delivering real-time transaction processing and advanced analytical capabilities has become critical in modern data infrastructure. While…
Data preparation, which aims to transform heterogeneous and noisy raw tables into analysis-ready data, remains a major bottleneck in data science. Recent approaches leverage large language models (LLMs) to automate data preparation from…
Existing learned query optimizers remain ill-suited to modern distributed, multi-tenant data warehouses due to idealized modeling assumptions and design choices. Using Alibaba's MaxCompute as a representative, we surface four fundamental,…
Data wrangling, the process of cleaning, transforming, and preparing data for analysis, is a well-known bottleneck in data science workflows. A wide range of data wrangling techniques have been proposed to mitigate this challenge. Of…
Understanding how data quality aligns with regulatory requirements in machine learning (ML) systems presents a critical challenge for practitioners navigating the evolving EU regulatory landscape. To address this, we first propose a…
LLM-augmented data systems enable semantic querying over structured and unstructured data, but executing queries with LLM-powered operators introduces a fundamental runtime-accuracy trade-off. In this paper, we present Stretto, a new…
Entity matching (EM) is a fundamental task in data integration and analytics, essential for identifying records that refer to the same real-world entity across diverse sources. In practice, datasets often differ widely in structure, format,…
Ensuring data quality in machine learning (ML) systems has become increasingly complex as regulatory requirements expand. In the European Union (EU), frameworks such as the General Data Protection Regulation (GDPR) and the Artificial…