Related papers: CodeBPE: Investigating Subtokenization Options for…

How Does Code Pretraining Affect Language Model Task Performance?

Large language models are increasingly trained on corpora containing both natural language and non-linguistic data like source code. Aside from aiding programming-related tasks, anecdotal evidence suggests that including code in pretraining…

Computation and Language · Computer Science 2025-02-26 Jackson Petty , Sjoerd van Steenkiste , Tal Linzen

From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution

Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak…

Computation and Language · Computer Science 2026-04-16 Pavel Chizhov , Egor Bogomolov , Ivan P. Yamshchikov

Data-efficient LLM Fine-tuning for Code Generation

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically…

Computation and Language · Computer Science 2025-04-18 Weijie Lv , Xuan Xia , Sheng-Jun Huang

AdaptBPE: From General Purpose to Specialized Tokenizers

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly…

Computation and Language · Computer Science 2026-01-30 Vijini Liyanage , François Yvon

Training Language Models with homotokens Leads to Delayed Overfitting

Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness,…

Computation and Language · Computer Science 2026-01-14 Adrian Cosma , Stefan Ruseti , Emilian Radoi , Mihai Dascalu

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we decouple the effects of subword…

Computation and Language · Computer Science 2026-05-15 Théo Gigant , Bowen Peng , Jeffrey Quesnelle

Can Perplexity Predict Fine-tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali

The impact of subword tokenization on language model performance is well-documented for perplexity, with finer granularity consistently reducing this intrinsic metric. However, research on how different tokenization schemes affect a model's…

Computation and Language · Computer Science 2025-08-12 Nishant Luitel , Nirajan Bekoju , Anand Kumar Sah , Subarna Shakya

Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models

Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are heavy in computational complexity, and quadratically with the length of the input. Toward simplifying the input…

Software Engineering · Computer Science 2024-05-21 Yan Wang , Xiaoning Li , Tien Nguyen , Shaohua Wang , Chao Ni , Ling Ding

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These…

Software Engineering · Computer Science 2022-02-15 Yao Wan , Wei Zhao , Hongyu Zhang , Yulei Sui , Guandong Xu , Hai Jin

The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget

Source code is usually formatted with elements like indentation and newlines to improve readability for human developers. However, these visual aids do not seem to be beneficial for large language models (LLMs) in the same way since the…

Software Engineering · Computer Science 2025-08-21 Dangfeng Pan , Zhensu Sun , Cenyuan Zhang , David Lo , Xiaoning Du

Reducing Sequence Length by Predicting Edit Operations with Large Language Models

Large Language Models (LLMs) have demonstrated remarkable performance in various tasks and gained significant attention. LLMs are also used for local sequence transduction tasks, including grammatical error correction (GEC) and formality…

Computation and Language · Computer Science 2023-10-24 Masahiro Kaneko , Naoaki Okazaki

Getting the most out of your tokenizer for pre-training and domain adaptation

Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize…

Computation and Language · Computer Science 2024-02-08 Gautier Dagan , Gabriel Synnaeve , Baptiste Rozière

On the Effect of Token Merging on Pre-trained Models for Code

Tokenization is a fundamental component of language models for code. It involves breaking down the input into units that are later passed to the language model stack to learn high-dimensional representations used in various contexts, from…

Software Engineering · Computer Science 2025-07-22 Mootez Saad , Hao Li , Tushar Sharma , Ahmed E. Hassan

Better Language Models of Code through Self-Improvement

Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is…

Computation and Language · Computer Science 2023-05-11 Hung Quoc To , Nghi D. Q. Bui , Jin Guo , Tien N. Nguyen

Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond

Recently, fine-tuning pre-trained code models such as CodeBERT on downstream tasks has achieved great success in many software testing and analysis tasks. While effective and prevalent, fine-tuning the pre-trained parameters incurs a large…

Software Engineering · Computer Science 2023-04-12 Ensheng Shi , Yanlin Wang , Hongyu Zhang , Lun Du , Shi Han , Dongmei Zhang , Hongbin Sun

LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models

Large Language Models for code often entail significant computational complexity, which grows significantly with the length of the input code sequence. We propose LeanCode for code simplification to reduce training and prediction time,…

Software Engineering · Computer Science 2026-02-06 Yan Wang , Ling Ding , Tien N Nguyen , Shaohua Wang , Yanan Zheng

Enhancing Code Generation for Low-Resource Languages: No Silver Bullet

The advent of Large Language Models (LLMs) has significantly advanced the field of automated code generation. LLMs rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource…

Software Engineering · Computer Science 2025-02-03 Alessandro Giagnorio , Alberto Martin-Lopez , Gabriele Bavota

Token Sugar: Making Source Code Sweeter for LLMs through Token-Efficient Shorthand

Large language models (LLMs) have shown exceptional performance in code generation and understanding tasks, yet their high computational costs hinder broader adoption. One important factor is the inherent verbosity of programming languages,…

Software Engineering · Computer Science 2025-12-10 Zhensu Sun , Chengran Yang , Xiaoning Du , Zhou Yang , Li Li , David Lo

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques…

Software Engineering · Computer Science 2019-03-15 Rafael-Michael Karampatsis , Charles Sutton

Compressed code: the hidden effects of quantization and distillation on programming tokens

Large Language Models (LLMs) have demonstrated exceptional code generation capabilities, yet their token-level mechanisms remain underexplored, particularly in compressed models. Through systematic analysis of programming language token…

Software Engineering · Computer Science 2026-02-10 Viacheslav Siniaev , Iaroslav Chelombitko , Aleksey Komissarov