数据库 — Scifaro

Decomposition of contexts into independent subcontexts based on thresholds

The process of decomposing databases into smaller datasets, with the objective of extrapolating the information obtained in the smaller ones to the original database, represents a relevant and complex challenge in real applications. It is…

数据库 · 计算机科学 2026-04-16 Roberto G. Aragón , Jesús Medina , Eloísa Ramírez-Poussa

Independent subcontexts and blocks of concept lattices. Definitions and relationships to decompose fuzzy contexts

The decomposition of datasets is a useful mechanism in the processing of large datasets and it is required in many cases. In formal concept analysis (FCA), the dataset is interpreted as a context and the notion of independent context is…

数据库 · 计算机科学 2026-04-16 Roberto G. Aragón , Jesús Medina , Eloísa Ramírez-Poussa

OVT-MLCS: An Online Visual Tool for MLCS Mining from Long or Big Sequences

Mining multiple longest common subsequences (\textit{MLCS}) from a set of sequences of three or more over a finite alphabet $\Sigma$ (a classical NP-hard problem) is an important task in a wide variety of application fields. Unfortunately,…

数据库 · 计算机科学 2026-04-16 Zhi Wang , Yanni Li , Tihua Duan , Bing Liu , Liyong Zhang , Hui Li

100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of…

数据库 · 计算机科学 2026-04-16 Yeounoh Chung , Rushabh Desai , Jian He , Yu Xiao , Thibaud Hottelier , Yves-Laurent Kom Samo , Pushkar Khadilkar , Xianshun Chen , Sam Idicula , Fatma Özcan , Alon Halevy , Yannis Papakonstantinou

NeurBench: A Benchmark Suite for Learned Database Components with Drift Modeling

Learned database components, which deeply integrate machine learning into their design, have been extensively studied in recent years. Given the dynamism of databases, where data and workloads continuously drift, it is crucial for learned…

数据库 · 计算机科学 2026-04-16 Zhanhao Zhao , Haotian Gao , Naili Xing , Lingze Zeng , Meihui Zhang , Gang Chen , Manuel Rigger , Beng Chin Ooi

ROSE: An Intent-Centered Evaluation Metric for NL2SQL

Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit…

数据库 · 计算机科学 2026-04-15 Wenqi Pei , Shizheng Hou , Boyan Li , Han Chen , Zhichao Shi , Yuyu Luo

Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining

We present Lit2Vec, a reproducible workflow for constructing and validating a chemistry corpus from the Semantic Scholar Open Research Corpus using conservative, metadata-based license screening. Using this workflow, we assembled an…

数据库 · 计算机科学 2026-04-15 Mahmoud Amiri , Jamile Mohammad Jafari , Sara Mostafapour , Thomas Bocklitz

GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, their immense number of parameters and complex transformer-based architectures result in significant resource…

数据库 · 计算机科学 2026-04-15 Tianhao Tang , Haoyang Li , Lei Chen

Foundations of the GraphAlg Language

The GraphAlg domain-specific language for graph algorithms enables user-defined algorithms in graph databases. In this work we show how GraphAlg is built on top of the formal MATLANG language for matrix manipulation. Starting from MATLANG,…

数据库 · 计算机科学 2026-04-14 Daan de Graaf , Robert Brijder , Nikolay Yakovets

Natural Language to What? A Vision for Intermediate Representations in NL-to-X Querying

Natural-language-initiated querying is usually framed as translation into a predetermined backend language such as SQL, Cypher, or SPARQL. That framing is appropriate when the semantic target is known in advance, but it does not cover the…

数据库 · 计算机科学 2026-04-14 Shengqi Li , Amarnath Gupta

gMatch: Fine-Grained and Hardware-Efficient Subgraph Matching on GPUs

Subgraph matching is a core operation in graph analytics, supporting a broad spectrum of applications from social network analysis to bioinformatics. Recent GPU-based approaches accelerate subgraph matching by leveraging parallelism but…

数据库 · 计算机科学 2026-04-14 Weitian Chen , Shixuan Sun , Cheng Chen , Yongmin Hu , Yingqian Hu , Minyi Guo

JSON Schema Inclusion through Refutational Normalization: Reconciling Efficiency and Completeness

JSON Schema is the de facto standard for describing the structure of JSON documents. Reasoning about JSON Schema inclusion -- whether every instance satisfying a schema S1 also satisfies a schema S2 -- is a key building block for a variety…

数据库 · 计算机科学 2026-04-14 Mohamed-Amine Baazizi , Nour El Houda Ben Ali , Dario Colazzo , Giorgio Ghelli , Stefan Klessinger , Carlo Sartiani , Stefanie Scherzinger

DGAI: Decoupled On-Disk Graph-Based ANN Index for Efficient Updates and Queries

On-disk graph-based indexes are favored for billion-scale Approximate Nearest Neighbor Search (ANNS) due to their high performance and cost-efficiency. However, existing systems typically rely on a coupled storage architecture that…

数据库 · 计算机科学 2026-04-14 Jiahao Lou , Shufeng Gong , Quan Yu , Hao Guo , Youyou Lu , Song Yu , Yanfeng Zhang , Tiezheng Nie , Ge Yu

A Catalog of Data Errors

Data errors are widespread in real-world databases and severely impact downstream applications, such as machine learning pipelines or business analytics reports. Causes of such errors are manifold and can arise during both the design phase…

数据库 · 计算机科学 2026-04-13 Divya Bhadauria , Hazar Harmouch , Felix Naumann , Divesh Srivastava , Lisa Ehrlinger

Evaluating Data Quality Tools: Measurement Capabilities and LLM Integration

High data quality is critical for reliable analytics and operational efficiency. A growing ecosystem of tools has emerged to support data quality management, ranging from lightweight open-source libraries to comprehensive enterprise…

数据库 · 计算机科学 2026-04-13 Tobias Rehberger , Thomas Hütter , Lisa Ehrlinger , Wolfram Wöß

STIndex: A Context-Aware Multi-Dimensional Spatiotemporal Information Extraction System

Extracting structured knowledge from unstructured data still faces practical limitations: entity and event extraction pipelines remain brittle, knowledge graph construction requires costly ontology engineering, and cross-domain…

数据库 · 计算机科学 2026-04-13 Wenxiao Zhang , Yu Liu , Qiang sun , Yihao Ding , Sirui Li , Yanbing Liu , Jin B. Hong , Wei Liu

QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference

Cache fusion accelerates generation process of LLMs equipped with RAG through KV caching and selective token recomputation, thereby reducing computational costs and improving efficiency. However, existing methods primarily rely on local…

数据库 · 计算机科学 2026-04-13 Jianxin Yan , Zeheng Qian , Wangze Ni , Zhitao Shen , Zhiping Wang , Haoyang Li , Jia Zhu , Lei Chen , Kui Ren

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. When reporting guidelines exist, they typically lack machine-actionable representations.…

数据库 · 计算机科学 2026-04-13 Josef Hardi , Martin J. O'Connor , Marcos Martinez-Romero , Jean G. Rosario , Stephen A. Fisher , Mark A. Musen

SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking

Database research and the development of learned query optimisers rely heavily on realistic SQL workloads. Acquiring real-world queries is increasingly difficult, however, due to strict privacy regulations, and publicly released anonymised…

数据库 · 计算机科学 2026-04-10 Kahan Mehta , Amit Mankodi

CobbleDB: Modelling Levelled Storage by Composition

We present a composition-based approach to building correctby-construction database backing stores. In previous work, we specified the behaviour of several store variants and proved their correctness and equivalence. Here, we derive a Java…

数据库 · 计算机科学 2026-04-10 Emilie Ma , Ayush Pandey , Annette Bieniusa , Marc Shapiro