Computer Science

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a…

Artificial Intelligence · Computer Science 2026-05-29 Nhat-Minh Nguyen

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination…

Computation and Language · Computer Science 2026-05-29 Yaxin Luo , Jiacheng Cui , Xiaohan Zhao , Xinyi Shang , Jiacheng Liu , Xinyue Bi , Zhaoyi Li , Zhiqiang Shen

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from…

Artificial Intelligence · Computer Science 2026-05-29 Qinpei Luo , Ruichun Ma , Xinyu Zhang , Lili Qiu

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in…

Artificial Intelligence · Computer Science 2026-05-29 Xiaona Zhou , Muntasir Wahed , Tianjiao Yu , Constantin Brif , Ismini Lourentzou

Unlocking the Working Memory of Large Language Models for Latent Reasoning

To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby…

Computation and Language · Computer Science 2026-05-29 Lukas Aichberger , Sepp Hochreiter

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

Test-time finetuning (TTFT) is a rapidly evolving paradigm that adapts a language model to each prompt by retrieving related sequences, updating the model on them, and then evaluating the prompt. However, TTFT is only practical if it is…

Machine Learning · Computer Science 2026-05-29 Alaa Khamis , Alaa Maalouf

Fairness-Aware Federated Learning with Trajectory Shapley Value

Federated learning is an emerging distributed paradigm that addresses the challenges posed by heterogeneous, privacy-sensitive data. It enables multiple clients to train a model collaboratively by aggregating their local updates at a…

Machine Learning · Computer Science 2026-05-29 Daniel Kuznetsov , Ziqi Wang

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this…

Artificial Intelligence · Computer Science 2026-05-29 Anany Kotawala

Demystifying Data Organization for Enhanced LLM Training

Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced…

Artificial Intelligence · Computer Science 2026-05-29 Yalun Dai , Yangyu Huang , Tongshen Yang , Yonghan Wang , Xin Zhang , Wenshan Wu , Qihao Zhao , Hao Li , Yuanyuan Gao , Kim-Hui Yap , Scarlett Li

COMPOSE: Composing Future Theorems from Citations and Formal Structure

A plausible future mathematical claim must satisfy two constraints: it should follow the direction of prior work and respect the formal dependencies that constrain what can validly follow. Existing approaches typically model only one of…

Computation and Language · Computer Science 2026-05-29 David Busbib , Michael Werman

When, why, and how do diffusion posterior samplers fail? A finite-sample lens

Diffusion models have excellent capacity to model complex distributions of natural data, which has made them a popular and effective choice for posterior sampling in imaging inverse problems. Existing methods can incorporate any measurement…

Machine Learning · Computer Science 2026-05-29 Benjamin A. Burns , Sara Fridovich-Keil

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language…

Machine Learning · Computer Science 2026-05-29 Sy-Tuyen Ho , Minghui Liu , Huy Nghiem , Furong Huang

Reasoning with Sampling: Cutting at Decision Points

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power…

Machine Learning · Computer Science 2026-05-29 Felix Zhou , Anay Mehrotra , Quanquan C. Liu

In-Context Reward Adaptation for Robust Preference Modeling

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model…

Machine Learning · Computer Science 2026-05-29 Zhenyu Sun , Zheng Xu , Ermin Wei

Gram: Assessing sabotage propensities via automated alignment auditing

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini…

Machine Learning · Computer Science 2026-05-29 David Lindner , Victoria Krakovna , Sebastian Farquhar

Resolution Diagnostics for Paired LLM Evaluation

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9…

Computation and Language · Computer Science 2026-05-29 Anany Kotawala

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or…

Computation and Language · Computer Science 2026-05-29 Valentina Bui Muti , Eugénie Dulout , Ziquan Fu

Self-Trained Verification for Training- and Test-Time Self-Improvement

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are…

Machine Learning · Computer Science 2026-05-29 Chen Henry Wu , Aditi Raghunathan

Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

Numeric tabular datasets are the dominant data format in scientific practice, yet large language models lack native mechanisms for representing numeric datasets in a meaningful way across heterogeneous feature spaces. Existing approaches…

Machine Learning · Computer Science 2026-05-29 M. Ross Kunz , John Merickel , Keith Wilson

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a…

Artificial Intelligence · Computer Science 2026-05-29 Haowen Wang , Yaxin Du , Jian Yang , Jiajun Wu , Shukai Liu , Yuxuan Zhang , Pingjie Wang , Siheng Chen , Tuney Zheng , Ming Zhou , Xianglong Liu