Related papers: Constructing a BPE Tokenization DFA

Tokenization as Finite-State Transduction

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode…

Computation and Language · Computer Science 2024-10-22 Marco Cognetta , Naoaki Okazaki

A closer look at TDFA

We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm,…

Formal Languages and Automata Theory · Computer Science 2026-03-31 Angelo Borsotti , Ulya Trafimovich

On Complementation of Nondeterministic Finite Automata without Full Determinization (Technical Report)

Complementation of finite automata is a basic operation used in numerous applications. The standard way to complement a nondeterministic finite automaton (NFA) is to transform it into an equivalent deterministic finite automaton (DFA) and…

Formal Languages and Automata Theory · Computer Science 2025-07-16 Lukáš Holík , Ondřej Lengál , Juraj Major , Adéla Štěpková , Jan Strejček

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

Formalizing BPE Tokenization

In this paper, we formalize practical byte pair encoding tokenization as it is used in large language models and other NLP systems, in particular we formally define and investigate the semantics of the SentencePiece and HuggingFace…

Formal Languages and Automata Theory · Computer Science 2023-09-19 Martin Berglund , Brink van der Merwe

On the Complexity of Flanked Finite State Automata

We define a new subclass of nondeterministic finite automata for prefix-closed languages called Flanked Finite Automata (FFA). We show that this class enjoys good complexity properties while preserving the succinctness of nondeterministic…

Formal Languages and Automata Theory · Computer Science 2015-09-23 Florent Avellaneda , Silvano Dal Zilio , Jean-Baptiste Raclet

Sublinear Matching With Finite Automata Using Reverse Suffix Scanning

We give algorithms to accelerate the computation of deterministic finite automata (DFA) by calculating the state of a DFA n positions ahead utilizing a reverse scan of the next n characters. Often this requires scanning fewer than n…

Data Structures and Algorithms · Computer Science 2015-01-16 Steven M. Kearns

Finding the Minimal DFA of Very Large Finite State Automata with an Application to Token Passing Networks

Finite state automata (FSA) are ubiquitous in computer science. Two of the most important algorithms for FSA processing are the conversion of a non-deterministic finite automaton (NFA) to a deterministic finite automaton (DFA), and then the…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-03-30 Vlad Slavici , Daniel Kunkle , Gene Cooperman , Stephen Linton

Minimal Synthesis of String To String Functions From Examples

We study the problem of synthesizing string to string transformations from a set of input/output examples. The transformations we consider are expressed using deterministic finite automata (DFA) that read pairs of letters, one letter from…

Formal Languages and Automata Theory · Computer Science 2018-06-06 Jad Hamza , Viktor Kunčak

Ordering Regular Languages and Automata: Complexity

Given an order of the underlying alphabet we can lift it to the states of a finite deterministic automaton: to compare states we use the order of the strings reaching them. When the order on strings is the co-lexicographic one \emph{and}…

Formal Languages and Automata Theory · Computer Science 2022-03-24 Giovanna D'Agostino , Davide Martincigh , Alberto Policriti

A Formal Framework for the Explanation of Finite Automata Decisions

Finite automata (FA) are a fundamental computational abstraction that is widely used in practice for various tasks in computer science, linguistics, biology, electrical engineering, and artificial intelligence. Given an input word, an FA…

Artificial Intelligence · Computer Science 2026-04-22 Jaime Cuartas Granada , Alexey Ignatiev , Peter J. Stuckey

Some new Features and Algorithms for the Study of DFA

The work presents some new algorithms realized recently in the package TESTAS. They decide whether or not deterministic finite automaton (DFA) is synchronizing, several procedures find relatively short synchronizing words and a…

Formal Languages and Automata Theory · Computer Science 2020-11-12 Avraham N. Trahtman

Deciding minimal distinguishing DFAs is NP-complete

In this paper, we present a proof of the NP-completeness of computing the smallest Deterministic Finite Automaton (DFA) that distinguishes two given regular languages as DFAs. A distinguishing DFA is an automaton that recognizes a language…

Formal Languages and Automata Theory · Computer Science 2023-06-07 Jan Martens

Generating Tokenizers with Flat Automata

We introduce flat automata for automatic generation of tokenizers. Flat automata are a simple representation of standard finite automata. Using the flat representation, automata can be easily constructed, combined and printed. Due to the…

Formal Languages and Automata Theory · Computer Science 2022-09-22 Hans de Nivelle , Dina Muktubayeva

Synthesising Asynchronous Automata from Fair Specifications

Asynchronous automata are a model of distributed finite state processes synchronising on shared actions. A celebrated result by Zielonka shows how a deterministic asynchronous automaton (AA) can be synthesised, starting from two inputs: a…

Formal Languages and Automata Theory · Computer Science 2026-02-02 Béatrice Bérard , Benjamin Monmege , B Srivathsan , Arnab Sur

Efficient Decomposition Identification of Deterministic Finite Automata from Examples

The identification of deterministic finite automata (DFAs) from labeled examples is a cornerstone of automata learning, yet traditional methods focus on learning monolithic DFAs, which often yield a large DFA lacking simplicity and…

Software Engineering · Computer Science 2025-10-14 Junjie Meng , Jie An , Yong Li , Andrea Turrini , Fanjiang Xu , Naijun Zhan , Miaomiao Zhang

Batching BPE Tokenization Merges

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training…

Computation and Language · Computer Science 2024-08-12 Alexander P. Morgan

From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare…

Computation and Language · Computer Science 2025-10-20 Rares Dolga , Lucas Maystre , Tudor Berariu , David Barber

On the Equivalence Checking Problem for Deterministic Top-Down Tree Automata

We present an efficient algorithm for checking language equivalence of states in top-down deterministic finite tree automata (DFTAs). Unlike string automata, tree automata operate over hierarchical structures, posing unique challenges for…

Formal Languages and Automata Theory · Computer Science 2025-04-08 Zhibo Deng , Vladimir A. Zakharov

Polynomially Ambiguous Probabilistic Automata on Restricted Languages

We consider the computability and complexity of decision questions for Probabilistic Finite Automata (PFA) with sub-exponential ambiguity. We show that the emptiness problem for strict and non-strict cut-points of polynomially ambiguous…

Formal Languages and Automata Theory · Computer Science 2020-07-30 Paul C. Bell