Related papers: Text-to-Code Generation with Modality-relative Pre…

How Does Code Pretraining Affect Language Model Task Performance?

Large language models are increasingly trained on corpora containing both natural language and non-linguistic data like source code. Aside from aiding programming-related tasks, anecdotal evidence suggests that including code in pretraining…

Computation and Language · Computer Science 2025-02-26 Jackson Petty , Sjoerd van Steenkiste , Tal Linzen

Towards Understanding What Code Language Models Learned

Pre-trained language models are effective in a variety of natural language tasks, but it has been argued their capabilities fall short of fully learning meaning or understanding language. To understand the extent to which language models…

Software Engineering · Computer Science 2024-02-29 Toufique Ahmed , Dian Yu , Chengxuan Huang , Cathy Wang , Prem Devanbu , Kenji Sagae

Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding

With the great success of pre-trained models, the pretrain-then-finetune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to…

Software Engineering · Computer Science 2022-03-16 Deze Wang , Zhouyang Jia , Shanshan Li , Yue Yu , Yun Xiong , Wei Dong , Xiangke Liao

Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity

We present a comparison of word-based and character-based sequence-to-sequence models for data-to-text natural language generation, which generate natural language descriptions for structured inputs. On the datasets of two recent generation…

Computation and Language · Computer Science 2018-10-12 Glorianna Jagfeld , Sabrina Jenne , Ngoc Thang Vu

Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences

Training code-switched language models is difficult due to lack of data and complexity in the grammatical structure. Linguistic constraint theories have been used for decades to generate artificial code-switching sentences to cope with this…

Computation and Language · Computer Science 2019-09-19 Genta Indra Winata , Andrea Madotto , Chien-Sheng Wu , Pascale Fung

Benchmarking Language Models for Code Syntax Understanding

Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works…

Computation and Language · Computer Science 2022-10-27 Da Shen , Xinyun Chen , Chenguang Wang , Koushik Sen , Dawn Song

An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities

Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. The continuous advancements in deep learning, particularly pre-trained models, have empowered the code…

Software Engineering · Computer Science 2025-01-24 Zezhou Yang , Sirong Chen , Cuiyun Gao , Zhenhao Li , Xing Hu , Kui Liu , Xin Xia

Improving Tree-Structured Decoder Training for Code Generation via Mutual Learning

Code generation aims to automatically generate a piece of code given an input natural language utterance. Currently, among dominant models, it is treated as a sequence-to-tree task, where a decoder outputs a sequence of actions…

Artificial Intelligence · Computer Science 2021-06-01 Binbin Xie , Jinsong Su , Yubin Ge , Xiang Li , Jianwei Cui , Junfeng Yao , Bin Wang

How to get better embeddings with code pre-trained models? An empirical study

Pre-trained language models have demonstrated powerful capabilities in the field of natural language processing (NLP). Recently, code pre-trained model (PTM), which draw from the experiences of the NLP field, have also achieved…

Software Engineering · Computer Science 2023-11-15 Yu Zhao , Lina Gong , Haoxiang Zhang , Yaoshen Yu , Zhiqiu Huang

Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models

Current language models tailored for code tasks often adopt the pre-training-then-fine-tuning paradigm from natural language processing, modeling source code as plain text. This approach, however, overlooks the unambiguous structures…

Computation and Language · Computer Science 2024-01-22 Mayank Agarwal , Yikang Shen , Bailin Wang , Yoon Kim , Jie Chen

Toward Code Generation: A Survey and Lessons from Semantic Parsing

With the growth of natural language processing techniques and demand for improved software engineering efficiency, there is an emerging interest in translating intention from human languages to programming languages. In this survey paper,…

Software Engineering · Computer Science 2021-05-20 Celine Lee , Justin Gottschlich , Dan Roth

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These…

Software Engineering · Computer Science 2022-02-15 Yao Wan , Wei Zhao , Hongyu Zhang , Yulei Sui , Guandong Xu , Hai Jin

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their…

Computer Vision and Pattern Recognition · Computer Science 2024-09-26 Yuhui Zhang , Brandon McKinzie , Zhe Gan , Vaishaal Shankar , Alexander Toshev

Automatic Code Generation using Pre-Trained Language Models

Recent advancements in natural language processing \cite{gpt2} \cite{BERT} have led to near-human performance in multiple natural language tasks. In this paper, we seek to understand whether similar techniques can be applied to a highly…

Computation and Language · Computer Science 2021-02-23 Luis Perez , Lizi Ottens , Sudharshan Viswanathan

End-to-End Content and Plan Selection for Data-to-Text Generation

Learning to generate fluent natural language from structured data with neural networks has become an common approach for NLG. This problem can be challenging when the form of the structured data varies between examples. This paper presents…

Computation and Language · Computer Science 2018-10-12 Sebastian Gehrmann , Falcon Z. Dai , Henry Elder , Alexander M. Rush

Multi-task Learning based Pre-trained Language Model for Code Completion

Code completion is one of the most useful features in the Integrated Development Environments (IDEs), which can accelerate software development by suggesting the next probable token based on the contextual code in real-time. Recent studies…

Software Engineering · Computer Science 2021-01-01 Fang Liu , Ge Li , Yunfei Zhao , Zhi Jin

Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence.…

Computation and Language · Computer Science 2026-04-17 Atsuki Yamaguchi , Maggie Mi , Nikolaos Aletras

Exploring and Evaluating Personalized Models for Code Generation

Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large…

Software Engineering · Computer Science 2022-09-21 Andrei Zlotchevski , Dawn Drain , Alexey Svyatkovskiy , Colin Clement , Neel Sundaresan , Michele Tufano

Towards Multi-Task Multi-Modal Models: A Video Generative Perspective

Advancements in language foundation models have primarily fueled the recent surge in artificial intelligence. In contrast, generative learning of non-textual modalities, especially videos, significantly trails behind language modeling. This…

Computer Vision and Pattern Recognition · Computer Science 2024-05-28 Lijun Yu

Token-wise Curriculum Learning for Neural Machine Translation

Existing curriculum learning approaches to Neural Machine Translation (NMT) require sampling sufficient amounts of "easy" samples from training data at the early training stage. This is not always achievable for low-resource languages where…

Computation and Language · Computer Science 2021-03-23 Chen Liang , Haoming Jiang , Xiaodong Liu , Pengcheng He , Weizhu Chen , Jianfeng Gao , Tuo Zhao