Related papers: Code Representation Pre-training with Complements …

Understanding Programs by Exploiting (Fuzzing) Test Cases

Semantic understanding of programs has attracted great attention in the community. Inspired by recent successes of large language models (LLMs) in natural language understanding, tremendous progress has been made by treating programming…

Machine Learning · Computer Science 2023-06-13 Jianyu Zhao , Yuyang Rong , Yiwen Guo , Yifeng He , Hao Chen

Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models

Current language models tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures…

Computation and Language · Computer Science 2024-01-22 Mayank Agarwal , Yikang Shen , Bailin Wang , Yoon Kim , Jie Chen

Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs

Code-generating Large Language Models (LLMs) have become essential tools in modern software development, enhancing productivity and accelerating development. This paper aims to investigate the fine-tuning of code-generating LLMs using…

Software Engineering · Computer Science 2025-05-06 Marina Sakharova , Abhinav Anand , Mira Mezini

Benchmarking Language Models for Code Syntax Understanding

Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works…

Computation and Language · Computer Science 2022-10-27 Da Shen , Xinyun Chen , Chenguang Wang , Koushik Sen , Dawn Song

Code Execution with Pre-trained Language Models

Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and…

Programming Languages · Computer Science 2023-05-10 Chenxiao Liu , Shuai Lu , Weizhu Chen , Daxin Jiang , Alexey Svyatkovskiy , Shengyu Fu , Neel Sundaresan , Nan Duan

How Does Code Pretraining Affect Language Model Task Performance?

Large language models are increasingly trained on corpora containing both natural language and non-linguistic data like source code. Aside from aiding programming-related tasks, anecdotal evidence suggests that including code in pretraining…

Computation and Language · Computer Science 2025-02-26 Jackson Petty , Sjoerd van Steenkiste , Tal Linzen

Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis

Recent work using auxiliary prediction task classifiers to investigate the properties of LSTM representations has begun to shed light on why pretrained representations, like ELMo (Peters et al., 2018) and CoVe (McCann et al., 2017), are so…

Computation and Language · Computer Science 2019-01-08 Kelly W. Zhang , Samuel R. Bowman

The Program Testing Ability of Large Language Models for Code

Recent development of large language models (LLMs) for code like CodeX and CodeT5+ demonstrates tremendous promise in achieving code intelligence. Their ability of synthesizing code that completes a program for performing a pre-defined task…

Computation and Language · Computer Science 2023-10-10 Weimin Xiong , Yiwen Guo , Hao Chen

SynthCoder: A Synthetical Strategy to Tune LLMs for Code Completion

Code completion is a prominent application of Large Language Models (LLMs) in software engineering. Due to the near real-time response requirements of this task, base models with small to medium-sized parameters are typically employed,…

Software Engineering · Computer Science 2025-09-18 Dongjun Yu , Xiao Yan , Zhenrui Li , Jipeng Xiao , Haochuan He , Yongda Yu , Hao Zhang , Guoping Rong , Xiaobo Huang

FuzzCoder: Byte-level Fuzzing Test via Large Language Model

Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory…

Computation and Language · Computer Science 2024-09-04 Liqun Yang , Jian Yang , Chaoren Wei , Guanglin Niu , Ge Zhang , Yunli Wang , Linzheng ChaI , Wanxu Xia , Hongcheng Guo , Shun Zhang , Jiaheng Liu , Yuwei Yin , Junran Peng , Jiaxin Ma , Liang Sun , Zhoujun Li

Self-Execution Simulation Improves Coding Models

A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code…

Computation and Language · Computer Science 2026-04-07 Gallil Maimon , Ori Yoran , Felix Kreuk , Michael Hassid , Gal Cohen , Pierre Chambon , Yossi Adi

Learning Code Preference via Synthetic Evolution

Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we…

Machine Learning · Computer Science 2024-10-25 Jiawei Liu , Thanh Nguyen , Mingyue Shang , Hantian Ding , Xiaopeng Li , Yu Yu , Varun Kumar , Zijian Wang

Can Large Language Models Generalize Procedures Across Representations?

Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these…

Computation and Language · Computer Science 2026-02-04 Fangru Lin , Valentin Hofmann , Xingchen Wan , Weixing Wang , Zifeng Ding , Anthony G. Cohn , Janet B. Pierrehumbert

CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via…

Artificial Intelligence · Computer Science 2025-05-30 Huimu Yu , Xing Wu , Haotian Xu , Debing Zhang , Songlin Hu

Specification-Driven Code Translation Powered by Large Language Models: How Far Are We?

Large Language Models (LLMs) are increasingly being applied across various domains, including code-related tasks such as code translation. Previous studies have explored using LLMs for translating code between different programming…

Software Engineering · Computer Science 2026-05-05 Soumit Kanti Saha , Fazle Rabbi , Song Wang , Jinqiu Yang

A Comparison of Code Embeddings and Beyond

Program representation learning is a fundamental task in software engineering applications. With the availability of "big code" and the development of deep learning techniques, various program representation learning models have been…

Software Engineering · Computer Science 2021-09-17 Siqi Han , DongXia Wang , Wanting Li , Xuesong Lu

Emergent Representations of Program Semantics in Language Models Trained on Programs

We present evidence that language models (LMs) of code can learn to represent the formal semantics of programs, despite being trained only to perform next-token prediction. Specifically, we train a Transformer model on a synthetic corpus of…

Machine Learning · Computer Science 2024-08-06 Charles Jin , Martin Rinard

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of…

Software Engineering · Computer Science 2023-12-04 Weisong Sun , Chunrong Fang , Yun Miao , Yudu You , Mengzhe Yuan , Yuchen Chen , Quanjun Zhang , An Guo , Xiang Chen , Yang Liu , Zhenyu Chen

Exploring Large Language Models for Code Explanation

Automating code documentation through explanatory text can prove highly beneficial in code understanding. Large Language Models (LLMs) have made remarkable strides in Natural Language Processing, especially within software engineering tasks…

Software Engineering · Computer Science 2023-10-26 Paheli Bhattacharya , Manojit Chakraborty , Kartheek N S N Palepu , Vikas Pandey , Ishan Dindorkar , Rakesh Rajpurohit , Rishabh Gupta

Procedural Pretraining: Warming Up Language Models with Abstract Data

Pretraining language models directly on web-scale corpora is the de facto paradigm. We study an alternative where the model is initially exposed to abstract structured data to ease the subsequent acquisition of rich semantic knowledge, much…

Computation and Language · Computer Science 2026-05-29 Liangze Jiang , Zachary Shinnick , Anton van den Hengel , Hemanth Saratchandran , Damien Teney