Related papers: Duncode Characters Shorter

Making compression algorithms for Unicode text

The majority of online content is written in languages other than English, and is most commonly encoded in UTF-8, the world's dominant Unicode character encoding. Traditional compression algorithms typically operate on individual bytes.…

Information Theory · Computer Science 2017-01-17 Adam Gleave , Christian Steinruecken

Back to Bytes: Revisiting Tokenization Through UTF-8

We present UTF8Tokenizer, a minimalist byte-level tokenizer that maps text exactly to IDs corresponding to the bytes underlying the text's UTF-8 encoding (e.g., byte x09 is token ID 9). Unlike prior byte-level approaches (Xue et al., 2021;…

Computation and Language · Computer Science 2025-10-21 Amit Moryossef , Clara Meister , Pavel Stepachev , Desmond Elliott

Unicode at Gigabytes per Second

We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating…

Programming Languages · Computer Science 2023-05-23 Daniel Lemire

MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

Byte-based machine translation systems have shown significant potential in massively multilingual settings. Unicode encoding, which maps each character to specific byte(s), eliminates the emergence of unknown words, even in new languages.…

Computation and Language · Computer Science 2025-02-10 Langlin Huang , Mengyu Bu , Yang Feng

Transcoding Billions of Unicode Characters per Second with SIMD Instructions

In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-16 Daniel Lemire , Wojciech Muła

Transcoding Unicode Characters with AVX-512 Instructions

Intel includes in its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction sets. We leverage these…

Data Structures and Algorithms · Computer Science 2024-08-06 Robert Clausecker , Daniel Lemire

Encryption by using base-n systems with many characters

It is possible to interpret text as numbers (and vice versa) if one interpret letters and other characters as digits and assume that they have an inherent immutable ordering. This is demonstrated by the conventional digit set of the…

Cryptography and Security · Computer Science 2023-06-06 Armin Hoenen

Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. This is often done with Byte Pair Encoding. In the…

Computer Vision and Pattern Recognition · Computer Science 2024-11-18 Tim Elsner , Paula Usinger , Julius Nehring-Wirxel , Gregor Kobsik , Victor Czech , Yanjiang He , Isaak Lim , Leif Kobbelt

Neural Machine Translation with Characters and Hierarchical Encoding

Most existing Neural Machine Translation models use groups of characters or whole words as their unit of input and output. We propose a model with a hierarchical char2word encoder, that takes individual characters both as input and output.…

Computation and Language · Computer Science 2016-10-21 Alexander Rosenberg Johansen , Jonas Meinertz Hansen , Elias Khazen Obeid , Casper Kaae Sønderby , Ole Winther

Treatment of Unicode canoncal decomposition among operating systems

This article shows how the text characters that have multiple representations under the Unicode standard are treated by popular operating systems. Whilst most characters have a unique representation in Unicode, some characters such as the…

Other Computer Science · Computer Science 2017-11-30 Efstratios Rappos

Local Byte Fusion for Neural Machine Translation

Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in…

Computation and Language · Computer Science 2023-06-29 Makesh Narsimhan Sreedhar , Xiangpeng Wan , Yu Cheng , Junjie Hu

Sub-Character Tokenization for Chinese Pretrained Language Models

Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system…

Computation and Language · Computer Science 2023-02-16 Chenglei Si , Zhengyan Zhang , Yingfa Chen , Fanchao Qi , Xiaozhi Wang , Zhiyuan Liu , Yasheng Wang , Qun Liu , Maosong Sun

A Character-Level Approach to the Text Normalization Problem Based on a New Causal Encoder

Text normalization is a ubiquitous process that appears as the first step of many Natural Language Processing problems. However, previous Deep Learning approaches have suffered from so-called silly errors, which are undetectable on…

Computation and Language · Computer Science 2019-03-08 Adrián Javaloy Bornás , Ginés García Mateos

BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition

Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its…

Computation and Language · Computer Science 2026-02-03 Hyunsik Kim , Haeri Kim , Munhak Lee , Kyungmin Lee

Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?

This article offers an empirical study on the different ways of encoding Chinese, Japanese, Korean (CJK) and English languages for text classification. Different encoding levels are studied, including UTF-8 bytes, characters, words,…

Computation and Language · Computer Science 2017-08-18 Xiang Zhang , Yann LeCun

Discrete Cosine Transform as Universal Sentence Encoder

Modern sentence encoders are used to generate dense vector representations that capture the underlying linguistic characteristics for a sequence of words, including phrases, sentences, or paragraphs. These kinds of representations are ideal…

Computation and Language · Computer Science 2021-06-03 Nada Almarwani , Mona Diab

FontCode: Embedding Information in Text Documents using Glyph Perturbation

We introduce FontCode, an information embedding technique for text documents. Provided a text document with specific fonts, our method embeds user-specified information in the text by perturbing the glyphs of text characters while…

Computer Vision and Pattern Recognition · Computer Science 2019-06-12 Chang Xiao , Cheng Zhang , Changxi Zheng

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias…

Computation and Language · Computer Science 2024-11-12 Tomasz Limisiewicz , Terra Blevins , Hila Gonen , Orevaoghene Ahia , Luke Zettlemoyer

SAFE: Scale Aware Feature Encoder for Scene Text Recognition

In this paper, we address the problem of having characters with different scales in scene text recognition. We propose a novel scale aware feature encoder (SAFE) that is designed specifically for encoding characters with different scales.…

Computer Vision and Pattern Recognition · Computer Science 2019-01-18 Wei Liu , Chaofeng Chen , Kwan-Yee K. Wong

Adaptive Decoding of LDPC Codes with Binary Messages

A novel adaptive binary decoding algorithm for LDPC codes is proposed, which reduces the decoding complexity while having a comparable or even better performance than corresponding non-adaptive alternatives. In each iteration the variable…

Information Theory · Computer Science 2009-04-24 Ingmar Land , Gottfried Lechner , Lars K. Rasmussen