Related papers: Language Models Without a Trainable Input Embeddin…

HashFormers: Towards Vocabulary-independent Pre-trained Transformers

Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding. This one-to-one mapping results into embedding matrices that occupy a lot of memory (i.e. millions of…

Computation and Language · Computer Science 2022-11-01 Huiyin Xue , Nikolaos Aletras

Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models

Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic…

Computation and Language · Computer Science 2026-05-29 Rohan Shravan

Language Model Memory and Memory Models for Language

The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically…

Computation and Language · Computer Science 2026-05-20 Benjamin L. Badger

Learning to Recall with Transformers Beyond Orthogonal Embeddings

Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training…

Machine Learning · Statistics 2026-03-18 Nuri Mert Vural , Alberto Bietti , Mahdi Soltanolkotabi , Denny Wu

On the Role of Pre-trained Embeddings in Binary Code Analysis

Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing…

Machine Learning · Computer Science 2025-02-14 Alwin Maier , Felix Weissberg , Konrad Rieck

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to…

Computation and Language · Computer Science 2026-05-14 Abraham Toluwase Owodunni , Orevaoghene Ahia , Sachin Kumar

Leviathan: Decoupling Input and Output Representations in Language Models

Modern language models use a single matrix for input embedding and output projection. This couples two distinct objectives: token representation and discrimination over a vocabulary. This work introduces Leviathan, a Transformer…

Computation and Language · Computer Science 2026-05-08 Reza T. Batley , Sourav Saha

Spelling Bee Embeddings for Language Modeling

We introduce a simple modification to the embedding layer. The key change is to infuse token embeddings with information about their spelling. Models trained with these embeddings improve not only on spelling, but also across standard…

Machine Learning · Computer Science 2026-01-27 Markus N. Rabe , Judith Clymo , Zheren Dong

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons…

Computation and Language · Computer Science 2022-05-19 Jonathan H. Clark , Dan Garrette , Iulia Turc , John Wieting

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

We study a constrained training regime for decoder-only Transformers in which the token interface is fixed, previously trained dense blocks are not reopened, and the active trainable parameter set is kept approximately constant as depth…

Machine Learning · Computer Science 2026-05-05 A. Bochkov

Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens

Standard pretrained language models operate on sequences of subword tokens without direct access to the characters that compose each token's string representation. We probe the embedding layer of pretrained language models and show that…

Computation and Language · Computer Science 2022-06-09 Itay Itzhak , Omer Levy

Parameter-Efficient Transformer Embeddings

Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which…

Computation and Language · Computer Science 2025-05-06 Henry Ndubuaku , Mouad Talhi

Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations,…

Computation and Language · Computer Science 2021-08-03 Yuval Pinter , Amanda Stent , Mark Dredze , Jacob Eisenstein

Lexinvariant Language Models

Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM). However, lexical symbol meanings can also be determined and even redefined by their structural role in a long…

Computation and Language · Computer Science 2023-05-29 Qian Huang , Eric Zelikman , Sarah Li Chen , Yuhuai Wu , Gregory Valiant , Percy Liang

From Fully Trained to Fully Random Embeddings: Improving Neural Machine Translation with Compact Word Embedding Tables

Embedding matrices are key components in neural natural language processing (NLP) models that are responsible to provide numerical representations of input tokens.\footnote{In this paper words and subwords are referred to as \textit{tokens}…

Computation and Language · Computer Science 2022-04-19 Krtin Kumar , Peyman Passban , Mehdi Rezagholizadeh , Yiu Sing Lau , Qun Liu

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the…

Computation and Language · Computer Science 2026-02-04 Brian Siyuan Zheng , Alisa Liu , Orevaoghene Ahia , Jonathan Hayase , Yejin Choi , Noah A. Smith

Bridging the Gap for Tokenizer-Free Language Models

Purely character-based language models (LMs) have been lagging in quality on large scale datasets, and current state-of-the-art LMs rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the…

Computation and Language · Computer Science 2019-08-28 Dokook Choe , Rami Al-Rfou , Mandy Guo , Heeyoung Lee , Noah Constant

Tree-Regularized Tabular Embeddings

Tabular neural network (NN) has attracted remarkable attentions and its recent advances have gradually narrowed the performance gap with respect to tree-based models on many public datasets. While the mainstreams focus on calibrating NN to…

Machine Learning · Computer Science 2024-03-05 Xuan Li , Yun Wang , Bo Li

Embedding Inversion via Conditional Masked Diffusion Language Models

We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation. A masked diffusion language model is conditioned on the target…

Computation and Language · Computer Science 2026-02-19 Han Xiao

Understanding and Mitigating Tokenization Bias in Language Models

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction.…

Computation and Language · Computer Science 2024-07-09 Buu Phan , Marton Havasi , Matthew Muckley , Karen Ullrich