Related papers: Kronecker Embeddings: Byte-Level Structured Token …

Kronecker Decomposition for Knowledge Graph Embeddings

Knowledge graph embedding research has mainly focused on learning continuous representations of entities and relations tailored towards the link prediction problem. Recent results indicate an ever increasing predictive ability of current…

Machine Learning · Computer Science 2022-05-16 Caglar Demir , Julian Lienen , Axel-Cyrille Ngonga Ngomo

Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes

Trainable input embedding tables are a standard component of modern language models. We ask whether they are actually necessary at the input interface. For a vocabulary of size $V$, exact token identity requires only $K=\lceil \log_2…

Computation and Language · Computer Science 2026-05-12 A. Bochkov

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Word-Level Representation From Bytes For Language Modeling

Modern language models mostly take sub-words as input, a design that balances the trade-off between vocabulary size, number of parameters, and performance. However, sub-word tokenization still has disadvantages like not being robust to…

Computation and Language · Computer Science 2022-11-24 Chu-Tak Lee , Qipeng Guo , Xipeng Qiu

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding…

Computation and Language · Computer Science 2025-05-05 Bharath Raj , Garvit Suri , Vikrant Dewangan , Raghav Sonavane

Near-lossless Binarization of Word Embeddings

Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of…

Computation and Language · Computer Science 2020-01-23 Julien Tissier , Christophe Gravier , Amaury Habrard

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing…

Computation and Language · Computer Science 2025-04-15 Buu Phan , Brandon Amos , Itai Gat , Marton Havasi , Matthew Muckley , Karen Ullrich

KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

The development of over-parameterized pre-trained language models has made a significant contribution toward the success of natural language processing. While over-parameterization of these models is the key to their generalization power,…

Computation and Language · Computer Science 2021-09-15 Marzieh S. Tahaei , Ella Charlaix , Vahid Partovi Nia , Ali Ghodsi , Mehdi Rezagholizadeh

Elementwise Language Representation

We propose a new technique for computational language representation called elementwise embedding, in which a material (semantic unit) is abstracted into a horizontal concatenation of lower-dimensional element (character) embeddings. While…

Computation and Language · Computer Science 2023-02-28 Dunam Kim , Jeeeun Kim

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and…

Computation and Language · Computer Science 2025-06-18 Mathurin Videau , Badr Youbi Idrissi , Alessandro Leite , Marc Schoenauer , Olivier Teytaud , David Lopez-Paz

Parameter-Efficient Transformer Embeddings

Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which…

Computation and Language · Computer Science 2025-05-06 Henry Ndubuaku , Mouad Talhi

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

Modeling Order in Neural Word Embeddings at Scale

Natural Language Processing (NLP) systems commonly leverage bag-of-words co-occurrence techniques to capture semantic and syntactic word relationships. The resulting word-level distributed representations often ignore morphological…

Computation and Language · Computer Science 2015-06-12 Andrew Trask , David Gilmore , Matthew Russell

MorphTok: Morphologically Grounded Tokenization for Indian Languages

Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm…

Computation and Language · Computer Science 2025-11-10 Maharaj Brahma , N J Karthika , Atul Singh , Devaraj Adiga , Smruti Bhate , Ganesh Ramakrishnan , Rohit Saluja , Maunendra Sankar Desarkar

Spelling Bee Embeddings for Language Modeling

We introduce a simple modification to the embedding layer. The key change is to infuse token embeddings with information about their spelling. Models trained with these embeddings improve not only on spelling, but also across standard…

Machine Learning · Computer Science 2026-01-27 Markus N. Rabe , Judith Clymo , Zheren Dong

Neural Machine Translation with Byte-Level Subwords

Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can…

Computation and Language · Computer Science 2019-12-09 Changhan Wang , Kyunghyun Cho , Jiatao Gu

Neural Machine Translation without Embeddings

Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes…

Computation and Language · Computer Science 2021-04-13 Uri Shaham , Omer Levy

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme…

Computation and Language · Computer Science 2025-02-04 Ehsaneddin Asgari , Yassine El Kheir , Mohammad Ali Sadraei Javaheri

Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such…

Computation and Language · Computer Science 2025-06-23 Yifan Hu , Frank Liang , Dachuan Zhao , Jonathan Geuter , Varshini Reddy , Craig W. Schmidt , Chris Tanner

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in…

Computation and Language · Computer Science 2026-03-05 Chunyuan Deng , Sanket Lokegaonkar , Colin Lockard , Besnik Fetahu , Nasser Zalmout , Xian Li