Related papers: Structured Code Representations Enable Data-Effici…

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These…

Software Engineering · Computer Science 2022-02-15 Yao Wan , Wei Zhao , Hongyu Zhang , Yulei Sui , Guandong Xu , Hai Jin

Benchmarking Language Models for Code Syntax Understanding

Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works…

Computation and Language · Computer Science 2022-10-27 Da Shen , Xinyun Chen , Chenguang Wang , Koushik Sen , Dawn Song

Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding

With the great success of pre-trained models, the pretrain-then-finetune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to…

Software Engineering · Computer Science 2022-03-16 Deze Wang , Zhouyang Jia , Shanshan Li , Yue Yu , Yun Xiong , Wei Dong , Xiangke Liao

GraphCodeBERT: Pre-training Code Representations with Data Flow

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code…

Software Engineering · Computer Science 2021-09-14 Daya Guo , Shuo Ren , Shuai Lu , Zhangyin Feng , Duyu Tang , Shujie Liu , Long Zhou , Nan Duan , Alexey Svyatkovskiy , Shengyu Fu , Michele Tufano , Shao Kun Deng , Colin Clement , Dawn Drain , Neel Sundaresan , Jian Yin , Daxin Jiang , Ming Zhou

Code Representation Learning with Pr\"ufer Sequences

An effective and efficient encoding of the source code of a computer program is critical to the success of sequence-to-sequence deep neural network models for tasks in computer program comprehension, such as automated code summarization and…

Artificial Intelligence · Computer Science 2021-11-16 Tenzin Jinpa , Yong Gao

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of…

Software Engineering · Computer Science 2023-12-04 Weisong Sun , Chunrong Fang , Yun Miao , Yudu You , Mengzhe Yuan , Yuchen Chen , Quanjun Zhang , An Guo , Xiang Chen , Yang Liu , Zhenyu Chen

Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond

Recently, fine-tuning pre-trained code models such as CodeBERT on downstream tasks has achieved great success in many software testing and analysis tasks. While effective and prevalent, fine-tuning the pre-trained parameters incurs a large…

Software Engineering · Computer Science 2023-04-12 Ensheng Shi , Yanlin Wang , Hongyu Zhang , Lun Du , Shi Han , Dongmei Zhang , Hongbin Sun

Comparative Code Structure Analysis using Deep Learning for Performance Prediction

Performance analysis has always been an afterthought during the application development process, focusing on application correctness first. The learning curve of the existing static and dynamic analysis tools are steep, which requires…

Machine Learning · Computer Science 2021-04-23 Nathan Pinnow , Tarek Ramadan , Tanzima Z. Islam , Chase Phelps , Jayaraman J. Thiagarajan

Text-to-Code Generation with Modality-relative Pre-training

Large pre-trained language models have recently been expanded and applied to programming language tasks with great success, often through further pre-training of a strictly-natural language model--where training sequences typically contain…

Computation and Language · Computer Science 2024-02-13 Fenia Christopoulou , Guchun Zhang , Gerasimos Lampouras

Code Representation Pre-training with Complements from Program Executions

Large language models (LLMs) for natural language processing have been grafted onto programming language modeling for advancing code intelligence. Although it can be represented in the text format, code is syntactically more rigorous in…

Software Engineering · Computer Science 2023-09-20 Jiabo Huang , Jianyu Zhao , Yuyang Rong , Yiwen Guo , Yifeng He , Hao Chen

Synchromesh: Reliable code generation from pre-trained language models

Large pre-trained language models have been used to generate code,providing a flexible interface for synthesizing programs from natural language specifications. However, they often violate syntactic and semantic rules of their output…

Machine Learning · Computer Science 2022-01-28 Gabriel Poesia , Oleksandr Polozov , Vu Le , Ashish Tiwari , Gustavo Soares , Christopher Meek , Sumit Gulwani

Towards Understanding What Code Language Models Learned

Pre-trained language models are effective in a variety of natural language tasks, but it has been argued their capabilities fall short of fully learning meaning or understanding language. To understand the extent to which language models…

Software Engineering · Computer Science 2024-02-29 Toufique Ahmed , Dian Yu , Chengxuan Huang , Cathy Wang , Prem Devanbu , Kenji Sagae

AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models

The objective of pre-trained language models is to learn contextual representations of textual data. Pre-trained language models have become mainstream in natural language processing and code modeling. Using probes, a technique to study the…

Computation and Language · Computer Science 2022-09-13 José Antonio Hernández López , Martin Weyssow , Jesús Sánchez Cuadrado , Houari Sahraoui

TreeBERT: A Tree-Based Pre-Trained Model for Programming Language

Source code can be parsed into the abstract syntax tree (AST) based on defined syntax rules. However, in pre-training, little work has considered the incorporation of tree structure into the learning process. In this paper, we present…

Machine Learning · Computer Science 2021-07-16 Xue Jiang , Zhuoran Zheng , Chen Lyu , Liang Li , Lei Lyu

Language-Agnostic Representation Learning of Source Code from Structure and Context

Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure…

Machine Learning · Computer Science 2021-03-23 Daniel Zügner , Tobias Kirschstein , Michele Catasta , Jure Leskovec , Stephan Günnemann

Code Execution with Pre-trained Language Models

Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and…

Programming Languages · Computer Science 2023-05-10 Chenxiao Liu , Shuai Lu , Weizhu Chen , Daxin Jiang , Alexey Svyatkovskiy , Shengyu Fu , Neel Sundaresan , Nan Duan

CCT5: A Code-Change-Oriented Pre-Trained Model

Software is constantly changing, requiring developers to perform several derived tasks in a timely manner, such as writing a description for the intention of the code change, or identifying the defect-prone code changes. Considering that…

Software Engineering · Computer Science 2023-05-19 Bo Lin , Shangwen Wang , Zhongxin Liu , Yepang Liu , Xin Xia , Xiaoguang Mao

Endowing Protein Language Models with Structural Knowledge

Understanding the relationships between protein sequence, structure and function is a long-standing biological challenge with manifold implications from drug design to our understanding of evolution. Recently, protein language models have…

Quantitative Methods · Quantitative Biology 2024-01-29 Dexiong Chen , Philip Hartout , Paolo Pellizzoni , Carlos Oliver , Karsten Borgwardt

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

Recent years have seen the successful application of large pre-trained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application…

Software Engineering · Computer Science 2022-05-26 Changan Niu , Chuanyi Li , Vincent Ng , Jidong Ge , Liguo Huang , Bin Luo

Structure-aware Fine-tuning for Code Pre-trained Models

Over the past few years, we have witnessed remarkable advancements in Code Pre-trained Models (CodePTMs). These models achieved excellent representation capabilities by designing structure-based pre-training tasks for code. However, how to…

Software Engineering · Computer Science 2024-04-12 Jiayi Wu , Renyu Zhu , Nuo Chen , Qiushi Sun , Xiang Li , Ming Gao