English
Related papers

Related papers: How Does Code Pretraining Affect Language Model Ta…

200 papers

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a…

Computation and Language · Computer Science 2024-08-21 Viraat Aryabumi , Yixuan Su , Raymond Ma , Adrien Morisot , Ivan Zhang , Acyr Locatelli , Marzieh Fadaee , Ahmet Üstün , Sara Hooker

Recent work has provided indirect evidence that pretraining language models on code improves the ability of models to track state changes of discourse entities expressed in natural language. In this work, we systematically test this claim…

Computation and Language · Computer Science 2024-06-03 Najoung Kim , Sebastian Schuster , Shubham Toshniwal

Advances in natural language processing, such as transfer learning from pre-trained language models, have impacted how models are trained for programming language tasks too. Previous research primarily explored code pre-training and…

Computation and Language · Computer Science 2023-02-08 Pinzhen Chen , Gerasimos Lampouras

Large pre-trained language models have recently been expanded and applied to programming language tasks with great success, often through further pre-training of a strictly-natural language model--where training sequences typically contain…

Computation and Language · Computer Science 2024-02-13 Fenia Christopoulou , Guchun Zhang , Gerasimos Lampouras

Text classification is a significant branch of natural language processing, and has many applications including document classification and sentiment analysis. Unsurprisingly, those who do text classification are concerned with the run-time…

Computation and Language · Computer Science 2021-04-09 Wilson Fearn , Orion Weller , Kevin Seppi

The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data…

Computation and Language · Computer Science 2024-02-21 Demin Song , Honglin Guo , Yunhua Zhou , Shuhao Xing , Yudong Wang , Zifan Song , Wenwei Zhang , Qipeng Guo , Hang Yan , Xipeng Qiu , Dahua Lin

Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training…

Computation and Language · Computer Science 2025-04-23 Zhijun Wang , Jiahuan Li , Hao Zhou , Rongxiang Weng , Jingang Wang , Xin Huang , Xue Han , Junlan Feng , Chao Deng , Shujian Huang

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high…

Computation and Language · Computer Science 2025-02-07 Skyler Seto , Maartje ter Hoeve , Richard He Bai , Natalie Schluter , David Grangier

While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and…

Software Engineering · Computer Science 2024-03-11 Martin Riddell , Ansong Ni , Arman Cohan

Pretraining data of large language models composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or…

Computation and Language · Computer Science 2025-03-21 Jiasheng Ye , Peiju Liu , Tianxiang Sun , Jun Zhan , Yunhua Zhou , Xipeng Qiu

Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works…

Computation and Language · Computer Science 2022-10-27 Da Shen , Xinyun Chen , Chenguang Wang , Koushik Sen , Dawn Song

The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the…

Computation and Language · Computer Science 2025-10-31 Negar Foroutan , Paul Teiletche , Ayush Kumar Tarun , Antoine Bosselut

Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's…

Computation and Language · Computer Science 2023-11-16 Gregory Yauney , Emily Reif , David Mimno

Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural…

Computation and Language · Computer Science 2023-04-27 Li Zhang , Liam Dugan , Hainiu Xu , Chris Callison-Burch

Code data has been shown to enhance the reasoning capabilities of large language models (LLMs), but it remains unclear which aspects of code are most responsible. We investigate this question with a systematic, data-centric framework. We…

Computation and Language · Computer Science 2025-10-03 Abdul Waheed , Zhen Wu , Carolyn Rosé , Daphne Ippolito

Transformer-based pre-trained models have recently achieved great results in solving many software engineering tasks including automatic code completion which is a staple in a developer's toolkit. While many have striven to improve the…

Computation and Language · Computer Science 2023-04-25 Tim van Dam , Maliheh Izadi , Arie van Deursen

Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget…

Computation and Language · Computer Science 2026-04-21 Zhuo Chen , Yuxuan Miao , Supryadi , Deyi Xiong

Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents…

Computation and Language · Computer Science 2024-05-08 Frances A. Laureano De Leon , Harish Tayyar Madabushi , Mark Lee

The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning…

Code language models have emerged as useful tools for various programming tasks, yet they often struggle when it comes to complex ones. In this paper, we explore the potential of curriculum learning in enhancing the performance of these…

Machine Learning · Computer Science 2024-07-16 Marwa Naïr , Kamel Yamani , Lynda Said Lhadj , Riyadh Baghdadi
‹ Prev 1 2 3 10 Next ›