Related papers: Unified Abstract Syntax Tree Representation Learni…
Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of…
The lexical and syntactic disparities among different programming languages (e.g., Java and Python) pose significant challenges for multi-language software engineering tasks such as cross-language code clone detection and code retrieval,…
Code summarization aims to generate concise natural language descriptions of source code, which can help improve program comprehension and maintenance. Recent studies show that syntactic and structural information extracted from abstract…
An effective and efficient encoding of the source code of a computer program is critical to the success of sequence-to-sequence deep neural network models for tasks in computer program comprehension, such as automated code summarization and…
Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure…
Program source code contains complex structure information, which can be represented in structured data forms like trees or graphs. To acquire the structural information in source code, most existing researches use abstract syntax trees…
Automatic code summarization frees software developers from the heavy burden of manual commenting and benefits software development and maintenance. Abstract Syntax Tree (AST), which depicts the source code's syntactic structure, has been…
The objective of pre-trained language models is to learn contextual representations of textual data. Pre-trained language models have become mainstream in natural language processing and code modeling. Using probes, a technique to study the…
Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models.…
Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code…
Program representation learning is a fundamental task in software engineering applications. With the availability of "big code" and the development of deep learning techniques, various program representation learning models have been…
Performance analysis has always been an afterthought during the application development process, focusing on application correctness first. The learning curve of the existing static and dynamic analysis tools are steep, which requires…
This paper presents Tree Notation, a new simple, universal syntax. Language designers can invent new programming languages, called Tree Languages, on top of Tree Notation. Tree Languages have a number of advantages over traditional…
The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has…
During software maintenance, programmers spend a lot of time on code comprehension. Reading comments is an effective way for programmers to reduce the reading and navigating time when comprehending source code. Therefore, as a critical task…
Tasks like code generation and semantic parsing require mapping unstructured (or partially structured) inputs to well-formed, executable outputs. We introduce abstract syntax networks, a modeling framework for these problems. The outputs…
Code summarization aims to generate brief natural language descriptions for source code. As source code is highly structured and follows strict programming language grammars, its Abstract Syntax Tree (AST) is often leveraged to inform the…
Learning representation for source code is a foundation of many program analysis tasks. In recent years, neural networks have already shown success in this area, but most existing models did not make full use of the unique structural…
We introduce the MultiLang Code Parser Dataset (MLCPD), a large-scale, language-agnostic dataset unifying syntactic and structural representations of code across ten major programming languages. MLCPD contains over seven million parsed…
Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for…