Computer Science

Zero-Scan Data Quality: Leveraging Table Format Metadata for Continuous Observability at Scale

Modern table formats such as Apache Iceberg compute and store metadata-commit timestamps, record counts, and column-level statistics such as null counts and value bounds at write time as part of file writing. These statistics serve query…

Databases · Computer Science 2026-05-29 Mohit Verma , Shantanu Rawat , Christian Bush , Sumedh Sakdeo , Lokesh Amarnath Ravindranathan , Dwarak Bakshi

The Missing Dimensions in Geo-Distributed Database Evaluation

Geo-distributed OLTP databases are widely deployed across cloud regions, yet current evaluation practices do not cover the challenges of this aspect. Existing benchmarks assume stable network conditions; they lack explicit settings for data…

Databases · Computer Science 2026-05-29 Oto Mraz , Kyriakos Psarakis , George Christodoulou , Paris Carbone , Asterios Katsifodimos

Towards Reliable Agentic Progressive Text-to-Visualization with Verification Rules

Text-to-Visualization (Text-to-Vis) translates natural language queries into visualization query languages, enabling non-expert users to perform data analysis. However, most existing methods follow a one-shot paradigm that requires users to…

Databases · Computer Science 2026-05-29 Wenxin Xu , Chen Jason Zhang , Xiaoyong Wei , Haoyang Li , Hwanhee Kim , Yuanfeng Song , Raymond Chi-Wing Wong

One Ring to Shuffle Them All: Scalable Intra-Process Data Redistribution with Ring-Buffer Shuffle in Redpanda Oxla

As server CPUs scale to dozens and now hundreds of cores per socket, parallel query engines must rethink how they redistribute data between threads. Partitioned operators such as hash joins and aggregations require frequent data…

Databases · Computer Science 2026-05-29 Adam Szymański , Tyler Akidau

ScanTwin: Simulating Performance Regressions Without Access to Tenant Data

In cloud data platforms, developers often encounter performance regressions that occur in specific tenant datasets. However, due to confidentiality constraints, they cannot access the original data, which makes it difficult to reproduce…

Databases · Computer Science 2026-05-29 Donghyun Sohn , Jennie Rogers

IORM: Hierarchical I/O Governance for Thousands of Consolidated Databases on Oracle Exadata

Oracle Exadata consolidates thousands of tenant databases onto shared storage infrastructure deployed at hundreds of customer sites worldwide. Oracle Multitenant architecture enables this extreme density, with thousands of tenant databases…

Databases · Computer Science 2026-05-29 Rajarshi Chowdhury , Akshay Shah , Zakaria Alrmaih , Chenhao Guo , Anubhav Singh , Sue Lee

Cone-Induced Observation Congruences for Vector-Valued Quantitative Languages

We study the observation congruences induced by rational polyhedral cones on vector-valued quantitative languages. The extreme rays of the dual cone define intrinsic covectors, and these covectors classify every incremental residual future…

Formal Languages and Automata Theory · Computer Science 2026-05-29 Faruk Alpay , Baris Basaran

Extending QuAK with Nested Quantitative Automata

Quantitative automata (QAs) extend finite-state automata on infinite words with weighted transitions to specify quantitative system properties. However, their finite weight sets rule out properties like average response time, where response…

Formal Languages and Automata Theory · Computer Science 2026-05-29 Thomas A. Henzinger , Nicolas Mazzocchi , N. Ege Saraç , Harun Yılmaz

E2E: Efficient Filtered AKNN Search via Adaptive Termination

Approximate k-Nearest Neighbor (AKNN) search is widely used in vector databases. When vectors carry additional attributes (e.g., labels or numerical values), filtered AKNN search retrieves the nearest vectors to a query vector under…

Databases · Computer Science 2026-05-29 Wenxuan Xia , Mingyu Yang , Wentao Li , Wei Wang

Grammar-Aware Literate Generative Mathematical Programming with Compiler-in-the-Loop

Mathematical programming is widely employed across various sectors - such as logistics, energy, and workforce planning - to model and solve industrial optimisation problems, but its use requires substantial domain expertise. Large language…

Programming Languages · Computer Science 2026-05-29 Roberto Rossi , Steven D. Prestwich

Grain Theory: Type-Level Granularity Correctness in Data Pipelines

Data transformation correctness is a fundamental challenge in data engineering: how can we verify that pipelines produce correct results before executing on production data? Existing practice relies on iterative testing over materialized…

Databases · Computer Science 2026-05-29 Nikos Karayannidis

Redbench: Workload Synthesis From Cloud Traces

Workload traces from cloud data warehouse providers reveal that standard benchmarks such as TPC-H and TPC-DS fail to capture key characteristics of real-world workloads, including query repetition and string-heavy queries. In this paper, we…

Databases · Computer Science 2026-05-29 Johannes Wehrstein , Roman Heinrich , Mihail Stoian , Skander Krid , Martin Stemmer , Andreas Kipf , Carsten Binnig , Muhammad El-Hindi

Smallest Suffixient Sets: Effectiveness, Resilience, and Calculation

A suffixient set is a novel combinatorial object that captures the essential information of repetitive strings in a way that, provided with a random access mechanism, supports various forms of pattern matching. In this paper, we study the…

Formal Languages and Automata Theory · Computer Science 2026-05-29 Hiroto Fujimaru , Gonzalo Navarro , Giuseppe Romana , Cristian Urbina

Resolving Nondeterminism with Randomness

In automata theory, while determinisation provides a standard route to solving many common problems in automata theory, some weak forms of nondeterminism can be dealt with in some problems without costly determinisation. For example, the…

Formal Languages and Automata Theory · Computer Science 2026-05-29 Thomas A. Henzinger , Keya Prakash , K. S. Thejaswini

CompilerDream: Learning a Compiler World Model for General Code Optimization

Effective code optimization in compilers is crucial for computer and software engineering. The success of these optimizations primarily depends on the selection and ordering of the optimization passes applied to the code. While most…

Programming Languages · Computer Science 2026-05-29 Chaoyi Deng , Jialong Wu , Ningya Feng , Jianmin Wang , Mingsheng Long

E-Path: Equality Saturation for Control-Flow Graphs

Modern equality saturation systems excel at expression-level rewrites by exploring large spaces of equivalent programs without suffering from the phase-ordering problem. How- ever, these systems struggle to represent equivalence directly…

Programming Languages · Computer Science 2026-05-28 Guillermo Garcia

Towards Cost-effective LLMs Routing with Batch Prompting

Large Language Model (LLM) serving systems must balance task performance against monetary cost. Two prominent optimization techniques have emerged independently: LLM routing, which directs each query to the most cost-effective model in a…

Databases · Computer Science 2026-05-28 Haotian Xu , Kangfei Zhao , Jiadong Xie

Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents

Markdown skill libraries for LLM agents ship as free-form prose, forcing the agent to re-derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a "confused -> re-retrieve ->…

Programming Languages · Computer Science 2026-05-28 Xinze Li , Yuhang Zang , Yixin Cao , Aixin Sun

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even…

Programming Languages · Computer Science 2026-05-28 Loc Pham , Lang Hong Nguyet Anh , Thanh Le-Cong

Are Diffusion Language Models Good Database Analysts?

Recent advancements in large language models (LLMs) have significantly improved Natural Language to SQL (NL2SQL) tasks, yet most NL2SQL systems continue to rely on the autoregressive (AR) paradigm. The highly structured nature of SQL makes…

Databases · Computer Science 2026-05-28 Peixian Ma , Xialie Zhuang , Jiantao Tan , Changlun Li , Ruirui Chen , Chengwei Qin