English
Related papers

Related papers: Unsupervised Morphological Tree Tokenizer

200 papers

We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings…

Computation and Language · Computer Science 2024-10-04 Jindřich Libovický , Jindřich Helcl

As opposed to general English, many concepts in biomedical terminology have been designed in recent history by biomedical professionals with the goal of being precise and concise. This is often achieved by concatenating meaningful…

Computation and Language · Computer Science 2023-07-11 Bernal Jiménez Gutiérrez , Huan Sun , Yu Su

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be…

Computation and Language · Computer Science 2023-03-28 Cagri Toraman , Eyup Halit Yilmaz , Furkan Şahinuç , Oguzhan Ozcelik

The best performing transformer-based language models use subword tokenization techniques, such as Byte-Pair-Encoding (BPE). However, these approaches often overlook linguistic principles, such as morphological segmentation, which we…

Computation and Language · Computer Science 2025-04-03 Mikkel Wildner Kildeberg , Emil Allerslev Schledermann , Nicolaj Larsen , Rob van der Goot

Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. I propose a…

Computation and Language · Computer Science 2024-02-06 Haris Jabbar

Canonical morphological segmentation is the process of analyzing words into the standard (aka underlying) forms of their constituent morphemes. This is a core task in language documentation, and NLP systems have the potential to…

Computation and Language · Computer Science 2024-10-16 Enora Rice , Ali Marashian , Luke Gessler , Alexis Palmer , Katharina von der Wense

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often…

Computation and Language · Computer Science 2015-08-19 Jan A. Botha

Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes.…

Computation and Language · Computer Science 2023-05-31 Li Sun , Florian Luisier , Kayhan Batmanghelich , Dinei Florencio , Cha Zhang

Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair…

Computation and Language · Computer Science 2024-06-11 Yanis Labrak , Adrien Bazoge , Beatrice Daille , Mickael Rouvier , Richard Dufour

This paper presents a joint model for performing unsupervised morphological analysis on words, and learning a character-level composition function from morphemes to word embeddings. Our model splits individual words into segments, and…

Computation and Language · Computer Science 2016-06-09 Kris Cao , Marek Rei

Large language model (LLM) tokenizers act as structured compressors: by mapping text to discrete token sequences, they determine token count (and thus compute and context usage) and the statistical structure seen by downstream models.…

Information Theory · Computer Science 2026-01-15 Mete Erdogan , Abhiram Gorle , Shubham Chandak , Mert Pilanci , Tsachy Weissman

Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model…

Computation and Language · Computer Science 2025-04-04 Juan Luis Gastaldi , John Terilla , Luca Malagutti , Brian DuSell , Tim Vieira , Ryan Cotterell

Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness,…

Computation and Language · Computer Science 2026-01-14 Adrian Cosma , Stefan Ruseti , Emilian Radoi , Mihai Dascalu

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme…

Computation and Language · Computer Science 2025-02-04 Ehsaneddin Asgari , Yassine El Kheir , Mohammad Ali Sadraei Javaheri

Token representations influence the efficiency and adaptability of language models, yet conventional tokenization strategies impose rigid segmentation boundaries that do not adjust dynamically to evolving contextual relationships. The…

Computation and Language · Computer Science 2025-08-11 Alistair Dombrowski , Beatrix Engelhardt , Dimitri Fairbrother , Henry Evidail

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of…

Computation and Language · Computer Science 2024-10-07 Yekun Chai , Yewei Fang , Qiwei Peng , Xuhong Li

Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. While learned subword tokenizers have become the de-facto standard, they present challenges such as large…

Computation and Language · Computer Science 2025-01-22 Pit Neitemeier , Björn Deiseroth , Constantin Eichenberg , Lukas Balles

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and…

Computation and Language · Computer Science 2025-06-18 Mathurin Videau , Badr Youbi Idrissi , Alessandro Leite , Marc Schoenauer , Olivier Teytaud , David Lopez-Paz

Tokenization is an important preprocessing step in the training and inference of large language models (LLMs). While there has been extensive research on the expressive power of the neural achitectures used in LLMs, the impact of…

Computation and Language · Computer Science 2024-12-05 Saibo Geng , Sankalp Gambhir , Chris Wendler , Robert West
‹ Prev 1 2 3 10 Next ›