Related papers: Constructing a BPE Tokenization DFA
Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode…
We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm,…
Complementation of finite automata is a basic operation used in numerous applications. The standard way to complement a nondeterministic finite automaton (NFA) is to transform it into an equivalent deterministic finite automaton (DFA) and…
Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…
In this paper, we formalize practical byte pair encoding tokenization as it is used in large language models and other NLP systems, in particular we formally define and investigate the semantics of the SentencePiece and HuggingFace…
We define a new subclass of nondeterministic finite automata for prefix-closed languages called Flanked Finite Automata (FFA). We show that this class enjoys good complexity properties while preserving the succinctness of nondeterministic…
We give algorithms to accelerate the computation of deterministic finite automata (DFA) by calculating the state of a DFA n positions ahead utilizing a reverse scan of the next n characters. Often this requires scanning fewer than n…
Finite state automata (FSA) are ubiquitous in computer science. Two of the most important algorithms for FSA processing are the conversion of a non-deterministic finite automaton (NFA) to a deterministic finite automaton (DFA), and then the…
We study the problem of synthesizing string to string transformations from a set of input/output examples. The transformations we consider are expressed using deterministic finite automata (DFA) that read pairs of letters, one letter from…
Given an order of the underlying alphabet we can lift it to the states of a finite deterministic automaton: to compare states we use the order of the strings reaching them. When the order on strings is the co-lexicographic one \emph{and}…
Finite automata (FA) are a fundamental computational abstraction that is widely used in practice for various tasks in computer science, linguistics, biology, electrical engineering, and artificial intelligence. Given an input word, an FA…
The work presents some new algorithms realized recently in the package TESTAS. They decide whether or not deterministic finite automaton (DFA) is synchronizing, several procedures find relatively short synchronizing words and a…
In this paper, we present a proof of the NP-completeness of computing the smallest Deterministic Finite Automaton (DFA) that distinguishes two given regular languages as DFAs. A distinguishing DFA is an automaton that recognizes a language…
We introduce flat automata for automatic generation of tokenizers. Flat automata are a simple representation of standard finite automata. Using the flat representation, automata can be easily constructed, combined and printed. Due to the…
Asynchronous automata are a model of distributed finite state processes synchronising on shared actions. A celebrated result by Zielonka shows how a deterministic asynchronous automaton (AA) can be synthesised, starting from two inputs: a…
The identification of deterministic finite automata (DFAs) from labeled examples is a cornerstone of automata learning, yet traditional methods focus on learning monolithic DFAs, which often yield a large DFA lacking simplicity and…
The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training…
Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare…
We present an efficient algorithm for checking language equivalence of states in top-down deterministic finite tree automata (DFTAs). Unlike string automata, tree automata operate over hierarchical structures, posing unique challenges for…
We consider the computability and complexity of decision questions for Probabilistic Finite Automata (PFA) with sub-exponential ambiguity. We show that the emptiness problem for strict and non-strict cut-points of polynomially ambiguous…