Computer Science

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination…

Computation and Language · Computer Science 2026-05-29 Yaxin Luo , Jiacheng Cui , Xiaohan Zhao , Xinyi Shang , Jiacheng Liu , Xinyue Bi , Zhaoyi Li , Zhiqiang Shen

Unlocking the Working Memory of Large Language Models for Latent Reasoning

To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby…

Computation and Language · Computer Science 2026-05-29 Lukas Aichberger , Sepp Hochreiter

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

Test-time finetuning (TTFT) is a rapidly evolving paradigm that adapts a language model to each prompt by retrieving related sequences, updating the model on them, and then evaluating the prompt. However, TTFT is only practical if it is…

Machine Learning · Computer Science 2026-05-29 Alaa Khamis , Alaa Maalouf

Fairness-Aware Federated Learning with Trajectory Shapley Value

Federated learning is an emerging distributed paradigm that addresses the challenges posed by heterogeneous, privacy-sensitive data. It enables multiple clients to train a model collaboratively by aggregating their local updates at a…

Machine Learning · Computer Science 2026-05-29 Daniel Kuznetsov , Ziqi Wang

COMPOSE: Composing Future Theorems from Citations and Formal Structure

A plausible future mathematical claim must satisfy two constraints: it should follow the direction of prior work and respect the formal dependencies that constrain what can validly follow. Existing approaches typically model only one of…

Computation and Language · Computer Science 2026-05-29 David Busbib , Michael Werman

When, why, and how do diffusion posterior samplers fail? A finite-sample lens

Diffusion models have excellent capacity to model complex distributions of natural data, which has made them a popular and effective choice for posterior sampling in imaging inverse problems. Existing methods can incorporate any measurement…

Machine Learning · Computer Science 2026-05-29 Benjamin A. Burns , Sara Fridovich-Keil

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language…

Machine Learning · Computer Science 2026-05-29 Sy-Tuyen Ho , Minghui Liu , Huy Nghiem , Furong Huang

Reasoning with Sampling: Cutting at Decision Points

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power…

Machine Learning · Computer Science 2026-05-29 Felix Zhou , Anay Mehrotra , Quanquan C. Liu

In-Context Reward Adaptation for Robust Preference Modeling

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model…

Machine Learning · Computer Science 2026-05-29 Zhenyu Sun , Zheng Xu , Ermin Wei

Gram: Assessing sabotage propensities via automated alignment auditing

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini…

Machine Learning · Computer Science 2026-05-29 David Lindner , Victoria Krakovna , Sebastian Farquhar

Resolution Diagnostics for Paired LLM Evaluation

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9…

Computation and Language · Computer Science 2026-05-29 Anany Kotawala

Generalizing a Highly Configurable Analytics Pipeline to Replicate and Support Educational Research Across Multiple Domains

Artificial intelligence assistants deployed in online learning environments create new opportunities to collect large volumes of learner interaction data and generate insights to improve student outcomes. Architecture for AI-Augmented…

Computers and Society · Computer Science 2026-05-29 Yallen Bai , Ploy Thajchayapong , Ashok Goel

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or…

Computation and Language · Computer Science 2026-05-29 Valentina Bui Muti , Eugénie Dulout , Ziquan Fu

Self-Trained Verification for Training- and Test-Time Self-Improvement

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are…

Machine Learning · Computer Science 2026-05-29 Chen Henry Wu , Aditi Raghunathan

Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

Numeric tabular datasets are the dominant data format in scientific practice, yet large language models lack native mechanisms for representing numeric datasets in a meaningful way across heterogeneous feature spaces. Existing approaches…

Machine Learning · Computer Science 2026-05-29 M. Ross Kunz , John Merickel , Keith Wilson

Neural Operator-Based Surrogate Model for CFD:Helical Coil Steam Generator in Small Modular Reactor

Real-time thermal-hydraulic simulation is essential for digital twin (DT) technology that supports the safe and efficient operation of small modular reactors (SMRs). Computational fluid dynamics (CFD) provides high-fidelity flow analysis,…

Machine Learning · Computer Science 2026-05-29 Minseo Lee , Seongmin Oh , Chaehyeon Song , Bumjin Cho , Shilaj Baral , Sangam Khanal , Minseop Song , Joongoo Jeon

Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories

Earlier detection of pancreatic cancer is key to enabling wider access to curative treatment and reducing cancer deaths; however, screening is presently not viable. Latent indicators of pathology are evident in an individual's disease and…

Machine Learning · Computer Science 2026-05-29 Chris Varghese , Leo Y. Li-Han , Richa Bisht , Ellen Larson , Frank Lee , Ryan M. Carr , Tanios S. Bekaii-Saab , Shounak Majumder , John D. Halamka , Mark Truty , Ajit H. Goenka , Hojjat Salehinejad , Cornelius A. Thiels

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual…

Computation and Language · Computer Science 2026-05-29 Yutong Wang , Xuebo Liu , Derek F. Wong , Zhilin Li , Rongqing Jiang , Min Zhang , Shimin Tao , Daimeng Wei , Min Zhang

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on…

Computation and Language · Computer Science 2026-05-29 Ziwen Xu , Haiwen Hong , Linsong Yu , Benglei Cui , Longtao Huang , Hui Xue , Ningyu Zhang

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the…

Computation and Language · Computer Science 2026-05-29 Zizhuo Lin , Quanling Liu , Jinsheng Quan , Chao Zhang , Yifan Zhu , Xing Shi , Jingtao Xu , Zhihui Li , Yawei Luo