Computation and Language · Computer Science
A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models
Jimin Sun, Patrick Fernandes, Xinyi Wang, Graham Neubig
2022-10-14
Computation and Language · Computer Science
Beyond Text Compression: Evaluating Tokenizers Across Scales
Jonas F. Lotz, António V. Lopes, Stephan Peitz, Hendra Setiawan +1
2025-06-04
Computation and Language · Computer Science
Understanding and Mitigating Tokenization Bias in Language Models
Buu Phan, Marton Havasi, Matthew Muckley, Karen Ullrich
2024-07-09
Computation and Language · Computer Science
A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning
Md Mofijul Islam, Gustavo Aguilar, Pragaash Ponnusamy, Clint Solomon Mathialagan +2
2022-04-25
Computation and Language · Computer Science
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng +3
2025-05-26
Computation and Language · Computer Science
Beyond Literal Token Overlap: Token Alignability for Multilinguality
Katharina Hämmerl, Tomasz Limisiewicz, Jindřich Libovický, Alexander Fraser
2025-02-11
Computation and Language · Computer Science
Unsupervised Cross-Lingual Transfer of Structured Predictors without Source Data
Kemal Kurniawan, Lea Frermann, Philip Schulz, Trevor Cohn
2021-10-11
Computation and Language · Computer Science
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase +2
2026-02-04
Computation and Language · Computer Science
T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting +1
2025-01-08
Computation and Language · Computer Science
Explaining and Mitigating Crosslingual Tokenizer Inequities
Catherine Arnett, Tyler A. Chang, Stella Biderman, Benjamin K. Bergen
2025-10-28
Computation and Language · Computer Science
To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer
Md Mushfiqur Rahman, Fardin Ahsan Sakib, Fahim Faisal, Antonios Anastasopoulos
2023-10-13
Computation and Language · Computer Science
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models
Julie Kallini, Dan Jurafsky, Christopher Potts, Martijn Bartelds
2025-09-26
Computation and Language · Computer Science
On Systematic Style Differences between Unsupervised and Supervised MT and an Application for High-Resource Machine Translation
Kelly Marchisio, Markus Freitag, David Grangier
2022-04-15
Computation and Language · Computer Science
Overcoming Vocabulary Constraints with Pixel-level Fallback
Jonas F. Lotz, Hendra Setiawan, Stephan Peitz, Yova Kementchedjhieva
2025-08-12
Computation and Language · Computer Science
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Avijit Thawani, Saurabh Ghanekar, Xiaoyuan Zhu, Jay Pujara
2023-10-19
Computation and Language · Computer Science
Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages
Verena Blaschke, Hinrich Schütze, Barbara Plank
2023-04-21
Computation and Language · Computer Science
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles
Buu Phan, Brandon Amos, Itai Gat, Marton Havasi +2
2025-04-15