English
Related papers

Related papers: Kronecker Embeddings: Byte-Level Structured Token …

200 papers

Knowledge graph embedding research has mainly focused on learning continuous representations of entities and relations tailored towards the link prediction problem. Recent results indicate an ever increasing predictive ability of current…

Machine Learning · Computer Science 2022-05-16 Caglar Demir , Julian Lienen , Axel-Cyrille Ngonga Ngomo

Trainable input embedding tables are a standard component of modern language models. We ask whether they are actually necessary at the input interface. For a vocabulary of size $V$, exact token identity requires only $K=\lceil \log_2…

Computation and Language · Computer Science 2026-05-12 A. Bochkov

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Modern language models mostly take sub-words as input, a design that balances the trade-off between vocabulary size, number of parameters, and performance. However, sub-word tokenization still has disadvantages like not being robust to…

Computation and Language · Computer Science 2022-11-24 Chu-Tak Lee , Qipeng Guo , Xipeng Qiu

Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding…

Computation and Language · Computer Science 2025-05-05 Bharath Raj , Garvit Suri , Vikrant Dewangan , Raghav Sonavane

Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of…

Computation and Language · Computer Science 2020-01-23 Julien Tissier , Christophe Gravier , Amaury Habrard

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing…

Computation and Language · Computer Science 2025-04-15 Buu Phan , Brandon Amos , Itai Gat , Marton Havasi , Matthew Muckley , Karen Ullrich

The development of over-parameterized pre-trained language models has made a significant contribution toward the success of natural language processing. While over-parameterization of these models is the key to their generalization power,…

Computation and Language · Computer Science 2021-09-15 Marzieh S. Tahaei , Ella Charlaix , Vahid Partovi Nia , Ali Ghodsi , Mehdi Rezagholizadeh

We propose a new technique for computational language representation called elementwise embedding, in which a material (semantic unit) is abstracted into a horizontal concatenation of lower-dimensional element (character) embeddings. While…

Computation and Language · Computer Science 2023-02-28 Dunam Kim , Jeeeun Kim

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and…

Computation and Language · Computer Science 2025-06-18 Mathurin Videau , Badr Youbi Idrissi , Alessandro Leite , Marc Schoenauer , Olivier Teytaud , David Lopez-Paz

Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which…

Computation and Language · Computer Science 2025-05-06 Henry Ndubuaku , Mouad Talhi

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

Natural Language Processing (NLP) systems commonly leverage bag-of-words co-occurrence techniques to capture semantic and syntactic word relationships. The resulting word-level distributed representations often ignore morphological…

Computation and Language · Computer Science 2015-06-12 Andrew Trask , David Gilmore , Matthew Russell

Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm…

Computation and Language · Computer Science 2025-11-10 Maharaj Brahma , N J Karthika , Atul Singh , Devaraj Adiga , Smruti Bhate , Ganesh Ramakrishnan , Rohit Saluja , Maunendra Sankar Desarkar

We introduce a simple modification to the embedding layer. The key change is to infuse token embeddings with information about their spelling. Models trained with these embeddings improve not only on spelling, but also across standard…

Machine Learning · Computer Science 2026-01-27 Markus N. Rabe , Judith Clymo , Zheren Dong

Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can…

Computation and Language · Computer Science 2019-12-09 Changhan Wang , Kyunghyun Cho , Jiatao Gu

Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes…

Computation and Language · Computer Science 2021-04-13 Uri Shaham , Omer Levy

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme…

Computation and Language · Computer Science 2025-02-04 Ehsaneddin Asgari , Yassine El Kheir , Mohammad Ali Sadraei Javaheri

Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such…

Computation and Language · Computer Science 2025-06-23 Yifan Hu , Frank Liang , Dachuan Zhao , Jonathan Geuter , Varshini Reddy , Craig W. Schmidt , Chris Tanner

Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in…

Computation and Language · Computer Science 2026-03-05 Chunyuan Deng , Sanket Lokegaonkar , Colin Lockard , Besnik Fetahu , Nasser Zalmout , Xian Li
‹ Prev 1 2 3 10 Next ›