Related papers: SemRep: Generative Code Representation Learning wi…

CCRep: Learning Code Change Representations via Pre-Trained Code Model and Query Back

Representing code changes as numeric feature vectors, i.e., code change representations, is usually an essential step to automate many software engineering tasks related to code changes, e.g., commit message generation and just-in-time…

Software Engineering · Computer Science 2023-02-09 Zhongxin Liu , Zhijie Tang , Xin Xia , Xiaohu Yang

Code Copycat Conundrum: Demystifying Repetition in LLM-based Code Generation

Despite recent advances in Large Language Models (LLMs) for code generation, the quality of LLM-generated code still faces significant challenges. One significant issue is code repetition, which refers to the model's tendency to generate…

Software Engineering · Computer Science 2025-04-18 Mingwei Liu , Juntao Li , Ying Wang , Xueying Du , Zuoyu Ou , Qiuyuan Chen , Bingxu An , Zhao Wei , Yong Xu , Fangming Zou , Xin Peng , Yiling Lou

What matters for Representation Alignment: Global Information or Spatial Structure?

Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target…

Computer Vision and Pattern Recognition · Computer Science 2025-12-12 Jaskirat Singh , Xingjian Leng , Zongze Wu , Liang Zheng , Richard Zhang , Eli Shechtman , Saining Xie

LOOPRAG: Enhancing Loop Transformation Optimization with Retrieval-Augmented Large Language Models

Loop transformations are semantics-preserving optimization techniques, widely used to maximize objectives such as parallelism. Despite decades of research, applying the optimal composition of loop transformations remains challenging due to…

Programming Languages · Computer Science 2025-12-19 Yijie Zhi , Yayu Cao , Jianhua Dai , Xiaoyang Han , Jingwen Pu , Qingran Wu , Sheng Cheng , Ming Cai

GramTrans: A Better Code Representation Approach in Code Generation

Code generation has shown great promise in assisting software development. A fundamental yet underexplored question is how the choice of code representation affects model performance. While existing studies employ various representations,…

Software Engineering · Computer Science 2025-10-06 Zhao Zhang , Qingyuan Liang , Zeyu Sun , Yizhou Chen , Guoqing Wang , Yican Sun , Lu Zhang , Ge Li , Yingfei Xiong

CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning

Code reasoning is a fundamental capability for large language models (LLMs) in the code domain. It involves understanding and predicting a program's execution behavior, such as determining the output for a given input or whether a specific…

Software Engineering · Computer Science 2025-07-24 Lingxiao Tang , He Ye , Zhongxin Liu , Xiaoxue Ren , Lingfeng Bao

CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval

Despite the success of text retrieval in many NLP tasks, code retrieval remains a largely underexplored area. Most text retrieval systems are tailored for natural language queries, often neglecting the specific challenges of retrieving…

Software Engineering · Computer Science 2025-08-11 Ye Liu , Rui Meng , Shafiq Joty , Silvio Savarese , Caiming Xiong , Yingbo Zhou , Semih Yavuz

CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs

Machine Learning (ML) for software engineering (SE) has gained prominence due to its ability to significantly enhance the performance of various SE applications. This progress is largely attributed to the development of generalizable source…

Software Engineering · Computer Science 2024-11-25 Alex Mathai , Kranthi Sedamaki , Debeshee Das , Noble Saji Mathews , Srikanth Tamilselvam , Sridhar Chimalakonda , Atul Kumar

Toward Code Generation: A Survey and Lessons from Semantic Parsing

With the growth of natural language processing techniques and demand for improved software engineering efficiency, there is an emerging interest in translating intention from human languages to programming languages. In this survey paper,…

Software Engineering · Computer Science 2021-05-20 Celine Lee , Justin Gottschlich , Dan Roth

A Comparison of Code Embeddings and Beyond

Program representation learning is a fundamental task in software engineering applications. With the availability of "big code" and the development of deep learning techniques, various program representation learning models have been…

Software Engineering · Computer Science 2021-09-17 Siqi Han , DongXia Wang , Wanting Li , Xuesong Lu

SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text…

Computation and Language · Computer Science 2024-11-04 Yangruibo Ding , Jinjun Peng , Marcus J. Min , Gail Kaiser , Junfeng Yang , Baishakhi Ray

Self-Supervised Learning via Maximum Entropy Coding

A mainstream type of current self-supervised learning methods pursues a general-purpose representation that can be well transferred to downstream tasks, typically by optimizing on a given pretext task such as instance discrimination. In…

Computer Vision and Pattern Recognition · Computer Science 2022-10-21 Xin Liu , Zhongdao Wang , Yali Li , Shengjin Wang

Generative Semantic Communications with Foundation Models: Perception-Error Analysis and Semantic-Aware Power Allocation

Generative foundation models can revolutionize the design of semantic communication (SemCom) systems allowing high fidelity exchange of semantic information at ultra low rates. In this work, a generative SemCom framework with pretrained…

Signal Processing · Electrical Eng. & Systems 2025-01-15 Chunmei Xu , Mahdi Boloursaz Mashhadi , Yi Ma , Rahim Tafazolli , Jiangzhou Wang

Structural Embedding Projection for Contextual Large Language Model Inference

Structured embedding transformations offer a promising approach for enhancing the efficiency and coherence of language model inference. The introduction of Structural Embedding Projection (SEP) provides a mechanism for refining token…

Computation and Language · Computer Science 2025-08-11 Vincent Enoasmo , Cedric Featherstonehaugh , Xavier Konstantinopoulos , Zacharias Huntington

Generative Code Modeling with Graphs

Generative models for source code are an interesting structured prediction problem, requiring to reason about both hard syntactic and semantic constraints as well as about natural, likely programs. We present a novel model for this problem…

Machine Learning · Computer Science 2019-04-18 Marc Brockschmidt , Miltiadis Allamanis , Alexander L. Gaunt , Oleksandr Polozov

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

Pretraining large language models (LLMs) with next-token prediction has led to remarkable advances, yet the context-dependent nature of token embeddings in such models results in high intra-class variance and inter-class similarity, thus…

Computation and Language · Computer Science 2026-05-12 Yan Sun , Guoxia Wang , Jinle Zeng , JiaBin Yang , Shuai Li , Li Shen , Dacheng Tao , DianHai Yu , Haifeng Wang

ReCode: Reinforcing Code Generation with Reasoning-Process Rewards

In practice, rigorous reasoning is often a key driver of correct code, while Reinforcement Learning (RL) for code generation often neglects optimizing reasoning quality. Bringing process-level supervision into RL is appealing, but it faces…

Software Engineering · Computer Science 2026-05-06 Lishui Fan , Yu Zhang , Mouxiang Chen , Zhongxin Liu

An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities

Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. The continuous advancements in deep learning, particularly pre-trained models, have empowered the code…

Software Engineering · Computer Science 2025-01-24 Zezhou Yang , Sirong Chen , Cuiyun Gao , Zhenhao Li , Xing Hu , Kui Liu , Xin Xia

Learning Program Semantics with Code Representations: An Empirical Study

Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for…

Software Engineering · Computer Science 2022-03-23 Jing Kai Siow , Shangqing Liu , Xiaofei Xie , Guozhu Meng , Yang Liu

SEMAG: Self-Evolutionary Multi-Agent Code Generation

Large Language Models (LLMs) have made significant progress in handling complex programming tasks. However, current methods rely on manual model selection and fixed workflows, which limit their ability to adapt to changing task…

Software Engineering · Computer Science 2026-03-18 Yulin Peng , Haowen Hou , Xinxin Zhu , Ying Tiffany He , F. Richard Yu