数据库 — Scifaro

Finding Locally Densest Subgraphs: Convex Programming with Edge and Triangle Density

Finding the densest subgraph (DS) from a graph is a fundamental problem in graph databases. The DS obtained, which reveals closely related entities, has been found to be useful in various application domains such as e-commerce, social…

数据库 · 计算机科学 2025-04-16 Yi Yang , Chenhao Ma , Reynold Cheng , Laks V. S. Lakshmanan , Xiaolin Han

Towards Robust Trajectory Embedding for Similarity Computation: When Triangle Inequality Violations in Distance Metrics Matter

Trajectory similarity is a cornerstone of trajectory data management and analysis. Traditional similarity functions often suffer from high computational complexity and a reliance on specific distance metrics, prompting a shift towards deep…

数据库 · 计算机科学 2025-04-16 Jianing Si , Haitao Yuan , Nan Jiang , Minxiao Chen , Xiao Ma , Shangguang Wang

Xpose: Bi-directional Engineering for Hidden Query Extraction

Query reverse engineering (QRE) aims to synthesize a SQL query to connect a given database and result instance. A recent variation of QRE is where an additional input, an opaque executable containing a ground-truth query, is provided, and…

数据库 · 计算机科学 2025-04-16 Ahana Pradhan , Jayant Haritsa

Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first…

数据库 · 计算机科学 2025-04-16 Qixu Chen , Yeye He , Raymond Chi-Wing Wong , Weiwei Cui , Song Ge , Haidong Zhang , Dongmei Zhang , Surajit Chaudhuri

ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

Practitioners are increasingly turning to Extract-Load-Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent…

数据库 · 计算机科学 2025-04-16 Tengjun Jin , Yuxuan Zhu , Daniel Kang

Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables

Streaming data pipelines remain challenging and expensive to build and maintain, despite significant advancements in stronger consistency, event time semantics, and SQL support over the last decade. Persistent obstacles continue to hinder…

数据库 · 计算机科学 2025-04-15 Daniel Sotolongo , Daniel Mills , Tyler Akidau , Anirudh Santhiar , Attila-Péter Tóth , Ilaria Battiston , Ankur Sharma , Botong Huang , Boyuan Zhang , Dzmitry Pauliukevich , Enrico Sartorello , Igor Belianski , Ivan Kalev , Lawrence Benson , Leon Papke , Ling Geng , Matt Uhlar , Nikhil Shah , Niklas Semmler , Olivia Zhou , Saras Nowak , Sasha Lionheart , Till Merker , Vlad Lifliand , Wendy Grus , Yi Huang , Yiwen Zhu

Using Process Calculus for Optimizing Data and Computation Sharing in Complex Stateful Parallel Computations

We propose novel techniques that exploit data and computation sharing to improve the performance of complex stateful parallel computations, like agent-based simulations. Parallel computations are translated into behavioral equations, a…

数据库 · 计算机科学 2025-04-15 Zilu Tian , Dan Olteanu , Christoph Koch

A Categorical Unification for Multi-Model Data: Part II Categorical Algebra and Calculus

Multi-model databases are designed to store, manage, and query data in various models, such as relational, hierarchical, and graph data, simultaneously. In this paper, we provide a theoretical basis for querying categorical databases. We…

数据库 · 计算机科学 2025-04-15 Jiaheng Lu

Dupin: A Parallel Framework for Densest Subgraph Discovery in Fraud Detection on Massive Graphs (Technical Report)

Detecting fraudulent activities in financial and e-commerce transaction networks is crucial. One effective method for this is Densest Subgraph Discovery (DSD). However, deploying DSD methods in production systems faces substantial…

数据库 · 计算机科学 2025-04-15 Jiaxin Jiang , Siyuan Yao , Yuchen Li , Qiange Wang , Bingsheng He , Min Chen

Enhancing Productivity in Database Management Through AI: A Three-Phase Approach for Database

This paper presents a novel AI-powered framework designed to streamline database management and query optimization for PostgreSQL systems. Structured in three phases: Natural Language to SQL Translation, Query Execution and Analysis, and…

数据库 · 计算机科学 2025-04-15 Kushagra Parashar , Ajay Dev , Aditya Kumar , Darpan Khatri

Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System

Finding relevant tables among databases, lakes, and repositories is the first step in extracting value from data. Such a task remains difficult because assessing whether a table is relevant to a problem does not always depend only on its…

数据库 · 计算机科学 2025-04-15 Muhammad Imam Luthfi Balaka , David Alexander , Qiming Wang , Yue Gong , Adila Krisnadhi , Raul Castro Fernandez

Substitutability-Based Graph Node Pricing

In the era o fdat commodification,the pricing o fgraph data presents unique challenges that differ significantly from traditional data markets. This paper addresses the critical issue of node pricing within graph structures, an area that…

数据库 · 计算机科学 2025-04-15 Huiju Wang , Yuanyuan Gao , Zhengkui Wang , Xiao Yue

A Formalism and Library for Database Visualization

Existing data visualization formalisms are restricted to single-table inputs, which makes existing visualization grammars like Vega-lite or ggplot2 tedious to use, have overly complex APIs, and unsound when visualization multi-table data.…

数据库 · 计算机科学 2025-04-15 Eugene Wu , Xiang Yu Tuang , Antonio Li , Vareesh Bainwala

Circuits and Formulas for Datalog over Semirings

In this paper, we study circuits and formulas for provenance polynomials of Datalog programs. We ask the following question: given an absorptive semiring and a fact of a Datalog program, what is the optimal depth and size of a…

数据库 · 计算机科学 2025-04-15 Austen Z. Fan , Paraschos Koutris , Sudeepa Roy

Table Integration in Data Lakes Unleashed: Pairwise Integrability Judgment, Integrable Set Discovery, and Multi-Tuple Conflict Resolution

Table integration aims to create a comprehensive table by consolidating tuples containing relevant information. In this work, we investigate the challenge of integrating multiple tables from a data lake, focusing on three core tasks: 1)…

数据库 · 计算机科学 2025-04-15 Daomin Ji , Hui Luo , Zhifeng Bao , Shane Culpepper

Revisiting the Expressiveness Landscape of Data Graph Queries

The study of graph queries in database theory has spanned more than three decades, resulting in a multitude of proposals for graph query languages. These languages differ in the mechanisms. We can identify three main families of languages,…

数据库 · 计算机科学 2025-04-15 Michael Benedikt , Anthony Widjaja Lin , Di-De Yen

eST$^2$ Miner -- Process Discovery Based on Firing Partial Orders

Process discovery generates process models from event logs. Traditionally, an event log is defined as a multiset of traces, where each trace is a sequence of events. The total order of the events in a sequential trace is typically based on…

数据库 · 计算机科学 2025-04-14 Sabine Folz-Weinstein , Christian Rennert , Lisa Luise Mannel , Robin Bergenthum , Wil van der Aalst

Role of Databases in GenAI Applications

Generative AI (GenAI) is transforming industries by enabling intelligent content generation, automation, and decision-making. However, the effectiveness of GenAI applications depends significantly on efficient data storage, retrieval, and…

数据库 · 计算机科学 2025-04-14 Santosh Bhupathi

LearnedKV: Integrating LSM and Learned Index for Superior Performance on Storage

We present LearnedKV, a novel tiered key-value store that seamlessly integrates a Log-Structured Merge (LSM) tree with a Learned Index to achieve superior read and write performance on storage systems. While existing approaches use learned…

数据库 · 计算机科学 2025-04-14 Wenlong Wang , David Hung-Chang Du

MatBase Metadata Catalog Management

MatBase is a prototype intelligent data and knowledge base management system based on the Relational, Entity-Relationship, and (Elementary) Mathematical Data Models. The latter distinguishes itself especially by its rich panoply of…

数据库 · 计算机科学 2025-04-11 Christian Mancas