数据库 — Scifaro

SciEGQA: A Dataset for Scientific Evidence-Grounded Question Answering and Reasoning

Scientific documents contain complex multimodal structures, which makes evidence localization and scientific reasoning in Document Visual Question Answering particularly challenging. However, most existing benchmarks evaluate models only at…

数据库 · 计算机科学 2026-03-31 Wenhan Yu , Zhaoxi Zhang , Wang Chen , Guanqiang Qi , Weikang Li , Lei Sha , Deguo Xia , Jizhou Huang

Finding a Fair Scoring Function for Top-$k$ Selection: From Hardness to Practice

Selecting a subset of the $k$ "best" items from a dataset of $n$ items, based on a scoring function, is a key task in decision-making. Given the rise of automated decision-making software, it is important that the outcome of this process,…

数据库 · 计算机科学 2026-03-31 Guangya Cai

Fair Data Pre-Processing with Imperfect Attribute Space

Fair data pre-processing is a widely used strategy for mitigating bias in machine learning. A promising line of research focuses on calibrating datasets to satisfy a designed fairness policy so that sensitive attributes influence outcomes…

数据库 · 计算机科学 2026-03-30 Ying Zheng , Yangfan Jiang , Kian-Lee Tan

Query-Specific Pruning of RML Mappings (Extended Version)

Current approaches for knowledge graph construction with RML focus on full RDF graph materialization without considering user queries. As a result, mapping engines are inefficient in dynamic query environments, materializing large graphs…

数据库 · 计算机科学 2026-03-30 Sitt Min Oo , Olaf Hartig

Responsibility Measures for Conjunctive Queries with Negation

We contribute to the recent line of work on responsibility measures that quantify the contributions of database facts to obtaining a query result. In contrast to existing work which has almost exclusively focused on monotone queries, here…

数据库 · 计算机科学 2026-03-30 Meghyn Bienvenu , Diego Figueira , Pierre Lafourcade

Are LLMs Overkill for Databases?: A Study on the Finiteness of SQL

Translating natural language to SQL for data retrieval has become more accessible thanks to code generation LLMs. But how hard is it to generate SQL code? While databases can become unbounded in complexity, the complexity of queries is…

数据库 · 计算机科学 2026-03-27 Yue Li , David Mimno , Unso Eun Seo Jo

Enabling Homomorphic Analytical Operations on Compressed Scientific Data with Multi-stage Decompression

Error-controlled lossy compressors have been widely used in scientific applications to reduce the unprecedented size of scientific data while keeping data distortion within a user-specified threshold. While they significantly mitigate the…

数据库 · 计算机科学 2026-03-27 Xuan Wu , Sheng Di , Tripti Agarwal , Kai Zhao , Xin Liang , Franck Cappello

PDET-LSH: Scalable In-Memory Indexing for High-Dimensional Approximate Nearest Neighbor Search with Quality Guarantees

Locality-sensitive hashing (LSH) is a well-known solution for approximate nearest neighbor (ANN) search with theoretical guarantees. Traditional LSH-based methods mainly focus on improving the efficiency and accuracy of query phase by…

数据库 · 计算机科学 2026-03-27 Jiuqi Wei , Xiaodong Lee , Botao Peng , Quanqing Xu , Chuanhui Yang , Themis Palpanas

TaCo: Data-adaptive and Query-aware Subspace Collision for High-dimensional Approximate Nearest Neighbor Search

Approximate Nearest Neighbor Search (ANNS) in high-dimensional Euclidean spaces is a fundamental problem with broad applications. Subspace Collision is a newly proposed ANNS framework that provides a novel paradigm for similarity search and…

数据库 · 计算机科学 2026-03-27 Jiuqi Wei , Zhenyu Liao , Ruoyu Han , Quanqing Xu , Chuanhui Yang , Themis Palpanas

Zero-Cost NDV Estimation from Columnar File Metadata

We present a method for estimating the number of distinct values (NDV) of a column in columnar file formats, using only existing file metadata--no extra storage, no data access. Two complementary signals are exploited: (1)~inverting the…

数据库 · 计算机科学 2026-03-27 Claude Brisson

A Hypergraph-Based Framework for Exploratory Business Intelligence

Business Intelligence (BI) analysis is evolving towards Exploratory BI, an iterative, multi-round exploration paradigm where analysts progressively refine their understanding. However, traditional BI systems impose critical limits for…

数据库 · 计算机科学 2026-03-27 Yunkai Lou , Shunyang Li , Longbin Lai , Jianke Yu , Wenyuan Yu , Ying Zhang

SIVF: GPU-Resident IVF Index for Streaming Vector Search

GPU-accelerated Inverted File (IVF) index is one of the industry standards for large-scale vector search but relies on static VRAM layouts that hinder real-time mutability. Our benchmark and analysis reveal that existing designs of GPU IVF…

数据库 · 计算机科学 2026-03-27 Dongfang Zhao

A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge

As high-dimensional vector data increasingly surpasses the processing capabilities of traditional database management systems, Vector Databases (VDBs) have emerged and become tightly integrated with large language models, being widely…

数据库 · 计算机科学 2026-03-27 Le Ma , Ran Zhang , Yikun Han , Shirui Yu , Zaitian Wang , Zhiyuan Ning , Jinghan Zhang , Ping Xu , Pengjiang Li , Ziyue Qiao , Wei Ju , Chong Chen , Dongjie Wang , Kunpeng Liu , Pengyang Wang , Pengfei Wang , Yanjie Fu , Chunjiang Liu , Yuanchun Zhou , Chang-Tien Lu

Hierarchical Spatial-Temporal Graph-Enhanced Model for Map-Matching

The integration of GNSS data into portable devices has led to the generation of vast amounts of trajectory data, which is crucial for applications such as map-matching. To tackle the limitations of rule-based methods, recent works in deep…

数据库 · 计算机科学 2026-03-26 Anjun Gao , Zhenglin Wan , Pingfu Chao , Shunyu Yao

An In-Depth Study of Filter-Agnostic Vector Search on a PostgreSQL Database System: [Experiments and Analysis]

Filtered Vector Search (FVS) is critical for supporting semantic search and GenAI applications in modern database systems. However, existing research most often evaluates algorithms in specialized libraries, making optimistic assumptions…

数据库 · 计算机科学 2026-03-26 Duo Lu , Helena Caminal , Manos Chatzakis , Yannis Papakonstantinou , Yannis Chronis , Vaibhav Jain , Fatma Özcan

Structure Selection for Fairness-Constrained Differentially Private Data Synthesis

Differential privacy (DP) enables safe data release, with synthetic data generation emerging as a common approach in recent years. Yet standard synthesizers preserve all dependencies in the data, including spurious correlations between…

数据库 · 计算机科学 2026-03-26 Naeim Ghahramanpour , Mostafa Milani

Graph-centric Cross-model Data Integration and Analytics in a Unified Multi-model Database

Graph-centric cross-model data integration and analytics (GCDIA) refer to tasks that leverage the graph model as a central paradigm to integrate relevant information across heterogeneous data models, such as relational and document, and…

数据库 · 计算机科学 2026-03-26 Zepeng Liu , Sheng Wang , Shixun Huang , Hailang Qiu , Yuwei Peng , Jiale Feng , Shunan Liao , Yushuai Ji , Zhiyong Peng

The Human Factor in Data Cleaning: Exploring Preferences and Biases

Data cleaning is often framed as a technical preprocessing step, yet in practice it relies heavily on human judgment. We report results from a controlled survey study in which participants performed error detection, data repair and…

数据库 · 计算机科学 2026-03-26 Hazim AbdElazim , Shadman Islam , Mostafa Milani

ByteHouse: ByteDance's Cloud-Native Data Warehouse for Real-Time Multimodal Data Analytics

With the rapid rise of intelligent data services, modern enterprises increasingly require efficient, multimodal, and cost-effective data analytics infrastructures. However, in ByteDance's production environments, existing systems fall short…

数据库 · 计算机科学 2026-03-26 Yuxing Han , Yu Lin , Yifeng Dong , Xuanhe Zhou , Xindong Peng , Xinhui Tian , Zhiyuan You , Yingzhong Guo , Xi Chen , Weiping Qu , Tao Meng , Dayue Gao , Haoyu Wang , Liuxi Wei , Huanchen Zhang , Fan Wu

Automated Discovery of Test Oracles for Database Management Systems Using LLMs

Since 2020, automated testing for Database Management Systems (DBMSs) has flourished, uncovering hundreds of bugs in widely-used systems. A cornerstone of these techniques is test oracle, which typically implements a mechanism to generate…

数据库 · 计算机科学 2026-03-26 Qiuyang Mang , Runyuan He , Suyang Zhong , Xiaoxuan Liu , Huanchen Zhang , Alvin Cheung