Related papers: How Does Code Pretraining Affect Language Model Ta…

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a…

Computation and Language · Computer Science 2024-08-21 Viraat Aryabumi , Yixuan Su , Raymond Ma , Adrien Morisot , Ivan Zhang , Acyr Locatelli , Marzieh Fadaee , Ahmet Üstün , Sara Hooker

Code Pretraining Improves Entity Tracking Abilities of Language Models

Recent work has provided indirect evidence that pretraining language models on code improves the ability of models to track state changes of discourse entities expressed in natural language. In this work, we systematically test this claim…

Computation and Language · Computer Science 2024-06-03 Najoung Kim , Sebastian Schuster , Shubham Toshniwal

Exploring Data Augmentation for Code Generation Tasks

Advances in natural language processing, such as transfer learning from pre-trained language models, have impacted how models are trained for programming language tasks too. Previous research primarily explored code pre-training and…

Computation and Language · Computer Science 2023-02-08 Pinzhen Chen , Gerasimos Lampouras

Text-to-Code Generation with Modality-relative Pre-training

Large pre-trained language models have recently been expanded and applied to programming language tasks with great success, often through further pre-training of a strictly-natural language model--where training sequences typically contain…

Computation and Language · Computer Science 2024-02-13 Fenia Christopoulou , Guchun Zhang , Gerasimos Lampouras

Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification

Text classification is a significant branch of natural language processing, and has many applications including document classification and sentiment analysis. Unsurprisingly, those who do text classification are concerned with the run-time…

Computation and Language · Computer Science 2021-04-09 Wilson Fearn , Orion Weller , Kevin Seppi

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data…

Computation and Language · Computer Science 2024-02-21 Demin Song , Honglin Guo , Yunhua Zhou , Shuhao Xing , Yudong Wang , Zifan Song , Wenwei Zhang , Qipeng Guo , Hang Yan , Xipeng Qiu , Dahua Lin

Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training

Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training…

Computation and Language · Computer Science 2025-04-23 Zhijun Wang , Jiahuan Li , Hao Zhou , Rongxiang Weng , Jingang Wang , Xin Huang , Xue Han , Junlan Feng , Chao Deng , Shujian Huang

Training Bilingual LMs with Data Constraints in the Targeted Language

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high…

Computation and Language · Computer Science 2025-02-07 Skyler Seto , Maartje ter Hoeve , Richard He Bai , Natalie Schluter , David Grangier

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and…

Software Engineering · Computer Science 2024-03-11 Martin Riddell , Ansong Ni , Arman Cohan

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Pretraining data of large language models composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or…

Computation and Language · Computer Science 2025-03-21 Jiasheng Ye , Peiju Liu , Tianxiang Sun , Jun Zhan , Yunhua Zhou , Xipeng Qiu

Benchmarking Language Models for Code Syntax Understanding

Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works…

Computation and Language · Computer Science 2022-10-27 Da Shen , Xinyun Chen , Chenguang Wang , Koushik Sen , Dawn Song

Revisiting Multilingual Data Mixtures in Language Model Pretraining

The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the…

Computation and Language · Computer Science 2025-10-31 Negar Foroutan , Paul Teiletche , Ayush Kumar Tarun , Antoine Bosselut

Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's…

Computation and Language · Computer Science 2023-11-16 Gregory Yauney , Emily Reif , David Mimno

Exploring the Curious Case of Code Prompts

Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural…

Computation and Language · Computer Science 2023-04-27 Li Zhang , Liam Dugan , Hainiu Xu , Chris Callison-Burch

On Code-Induced Reasoning in LLMs

Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We…

Computation and Language · Computer Science 2025-10-03 Abdul Waheed , Zhen Wu , Carolyn Rosé , Daphne Ippolito

Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study

Transformer-based pre-trained models have recently achieved great results in solving many software engineering tasks including automatic code completion which is a staple in a developer's toolkit. While many have striven to improve the…

Computation and Language · Computer Science 2023-04-25 Tim van Dam , Maliheh Izadi , Arie van Deursen

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget…

Computation and Language · Computer Science 2026-04-21 Zhuo Chen , Yuxuan Miao , Supryadi , Deyi Xiong

Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents…

Computation and Language · Computer Science 2024-05-08 Frances A. Laureano De Leon , Harish Tayyar Madabushi , Mark Lee

When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning…

Computation and Language · Computer Science 2025-07-29 Rei Higuchi , Ryotaro Kawata , Naoki Nishikawa , Kazusato Oko , Shoichiro Yamaguchi , Sosuke Kobayashi , Seiya Tokui , Kohei Hayashi , Daisuke Okanohara , Taiji Suzuki

Curriculum Learning for Small Code Language Models

Code language models have emerged as useful tools for various programming tasks, yet they often struggle when it comes to complex ones. In this paper, we explore the potential of curriculum learning in enhancing the performance of these…

Machine Learning · Computer Science 2024-07-16 Marwa Naïr , Kamel Yamani , Lynda Said Lhadj , Riyadh Baghdadi