Computer Science

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination…

Computation and Language · Computer Science 2026-05-29 Yaxin Luo , Jiacheng Cui , Xiaohan Zhao , Xinyi Shang , Jiacheng Liu , Xinyue Bi , Zhaoyi Li , Zhiqiang Shen

Unlocking the Working Memory of Large Language Models for Latent Reasoning

To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby…

Computation and Language · Computer Science 2026-05-29 Lukas Aichberger , Sepp Hochreiter

COMPOSE: Composing Future Theorems from Citations and Formal Structure

A plausible future mathematical claim must satisfy two constraints: it should follow the direction of prior work and respect the formal dependencies that constrain what can validly follow. Existing approaches typically model only one of…

Computation and Language · Computer Science 2026-05-29 David Busbib , Michael Werman

Resolution Diagnostics for Paired LLM Evaluation

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9…

Computation and Language · Computer Science 2026-05-29 Anany Kotawala

On abelian periodicity of purely morphic words

Deciding periodicity of infinite words generated by morphisms is a classical result in combinatorics on words from 80's by Harju, Linna and Pansiot. In this paper, we are interested in this question in the abelian setting. Two words are…

Discrete Mathematics · Computer Science 2026-05-29 Arina Filimonova , Svetlana Puzynina

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or…

Computation and Language · Computer Science 2026-05-29 Valentina Bui Muti , Eugénie Dulout , Ziquan Fu

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual…

Computation and Language · Computer Science 2026-05-29 Yutong Wang , Xuebo Liu , Derek F. Wong , Zhilin Li , Rongqing Jiang , Min Zhang , Shimin Tao , Daimeng Wei , Min Zhang

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on…

Computation and Language · Computer Science 2026-05-29 Ziwen Xu , Haiwen Hong , Linsong Yu , Benglei Cui , Longtao Huang , Hui Xue , Ningyu Zhang

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the…

Computation and Language · Computer Science 2026-05-29 Zizhuo Lin , Quanling Liu , Jinsheng Quan , Chao Zhang , Yifan Zhu , Xing Shi , Jingtao Xu , Zhihui Li , Yawei Luo

Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning

Current plan-based reasoning methods improve large language models (LLMs) by inserting a planning stage before execution, giving rise to the question $\rightarrow$ plan $\rightarrow$ cot paradigm. While effective, a closer examination…

Computation and Language · Computer Science 2026-05-29 Shaojie Wang , Liang Zhang

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability. We introduce CommunityFact, a refreshable benchmark for…

Computation and Language · Computer Science 2026-05-29 Sahajpreet Singh , Insyirah Mujtahid , Min-Yen Kan , Kokil Jaidka

Do Language Models Track Entities Across State Changes?

Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\textit{without}$…

Computation and Language · Computer Science 2026-05-29 Zilu Tang , Qiao Zhao , Gabriel Franco , Derry Wijaya , Aaron Mueller , Sebastian Schuster , Najoung Kim

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with…

Computation and Language · Computer Science 2026-05-29 Fabian Mewes , Anne Lauscher , Vagrant Gautam

A Dual-Path Architecture for Scaling Compute and Capacity in LLMs

Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We…

Computation and Language · Computer Science 2026-05-29 Markus Frey , Behzad Shomali , Joachim Koehler , Mehdi Ali

Neural Network Verification using Partial Multi-Neuron Relaxation

The increasing integration of deep neural networks in critical systems has spawned a theoretical and practical interest in formally guaranteeing safety properties about their behavior. To achieve this, contemporary verification algorithms…

Logic in Computer Science · Computer Science 2026-05-29 Ido Shmuel , Guy Katz

Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?

Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating…

Computation and Language · Computer Science 2026-05-29 Xiaoze Liu , Ruowang Zhang , Amir H. Abdi , Michel Galley , Zhikai Chen , Siheng Xiong , Xiaoqian Wang , Jing Gao

CorPipe at CRAC 2026: Empty Nodes and Cross-Lingual Transfer in Multilingual Coreference Resolution

We introduce CorPipe 26, our winning submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. The fifth edition of this shared task focuses mainly on the comparison of generative LLMs and specialized systems;…

Computation and Language · Computer Science 2026-05-29 Milan Straka

CCS: Clinical Consensus Selection for Radiology Report Generation

Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by…

Computation and Language · Computer Science 2026-05-29 Xi Zhang , Yingshu Li , Zaiqiao Meng , Jake Lever , Edmond S. L. Ho

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

Creating spoken dialogue datasets is methodologically challenging, and these challenges are amplified when the goal is to build multilingual, multi-parallel datasets at scale. This work introduces HEALTHDIAL, a large-scale, multilingual,…

Computation and Language · Computer Science 2026-05-29 Songbo Hu , Yinhong Liu , Ej Zhou , Evgeniia Razumovskaia , Xiaobin Wang , Alexander Fraser , Ivan Vulić , Anna Korhonen

A Rust-to-Lean Verification Pipeline with AI Provers: An Experience Report

We describe a verification pipeline that takes production Rust cryptographic code and produces machine-checked correctness proofs in Lean 4. The pipeline combines three components: symbolic extraction tools (Charon and Aeneas, or Hax) that…

Logic in Computer Science · Computer Science 2026-05-29 Natalia Klaus , Palina Tolmach , Juan Conejero