Yuval Pinter — Scifaro

Tokenization with Split Trees

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed…

Computation and Language · Computer Science 2026-05-28 Craig W. Schmidt , Michael Krumdick , Adam Wiemerslage , Seth Ebner , Varshini Reddy , Yuval Pinter , Chris Tanner

Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark

While multilingual language models promise to bring the benefits of LLMs to speakers of many languages, gold-standard evaluation benchmarks in most languages to interrogate these assumptions remain scarce. The Universal NER project, now…

Computation and Language · Computer Science 2026-04-15 Terra Blevins , Stephen Mayhew , Marek Šuppa , Hila Gonen , Shachar Mirkin , Vasile Pais , Kaja Dobrovoljc , Voula Giouli , Jun Kevin , Eugene Jang , Eungseo Kim , Jeongyeon Seo , Xenophon Gialis , Yuval Pinter

Which Pieces Does Unigram Tokenization Really Need?

The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece…

Computation and Language · Computer Science 2026-04-13 Sander Land , Yuval Pinter

Faster Superword Tokenization

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend…

Computation and Language · Computer Science 2026-04-08 Craig W. Schmidt , Chris Tanner , Yuval Pinter

CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models' reliance on subword units, rather than…

Computation and Language · Computer Science 2026-04-08 Omri Uzan , Yuval Pinter

The Degree of Language Diacriticity and Its Effect on Tasks

Diacritics are orthographic marks that clarify pronunciation, distinguish similar words, or alter meaning. They play a central role in many writing systems, yet their impact on language technology has not been systematically quantified…

Computation and Language · Computer Science 2026-03-31 Adi Cohen , Yuval Pinter

Hebrew Diacritics Restoration using Visual Representation

Diacritics restoration in Hebrew is a fundamental task for ensuring accurate word pronunciation and disambiguating textual meaning. Despite the language's high degree of ambiguity when unvocalized, recent machine learning approaches have…

Computation and Language · Computer Science 2026-02-05 Yair Elboher , Yuval Pinter

The Effect of Scripts and Formats on LLM Numeracy

Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical…

Computation and Language · Computer Science 2026-01-22 Varshini Reddy , Craig W. Schmidt , Seth Ebner , Adam Wiemerslage , Yuval Pinter , Chris Tanner

Leveraging NTPs for Efficient Hallucination Detection in VLMs

Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to…

Computer Vision and Pattern Recognition · Computer Science 2025-11-17 Ofir Azachi , Kfir Eliyahu , Eyal El Ani , Rom Himelstein , Roi Reichart , Yuval Pinter , Nitay Calderon

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as…

Computation and Language · Computer Science 2025-10-03 Craig W. Schmidt , Varshini Reddy , Chris Tanner , Yuval Pinter

How Much is Enough? The Diminishing Returns of Tokenization Training Data

Tokenization, a crucial initial step in natural language processing, is governed by several key parameters, such as the tokenization algorithm, vocabulary size, pre-tokenization strategy, inference strategy, and training data corpus. This…

Computation and Language · Computer Science 2025-06-17 Varshini Reddy , Craig W. Schmidt , Yuval Pinter , Chris Tanner

Splintering Nonconcatenative Languages for Better Tokenization

Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in…

Computation and Language · Computer Science 2025-06-04 Bar Gazit , Shaltiel Shmidman , Avi Shmidman , Yuval Pinter

Probing Subphonemes in Morphology Models

Transformers have achieved state-of-the-art performance in morphological inflection tasks, yet their ability to generalize across languages and morphological rules remains limited. One possible explanation for this behavior can be the…

Computation and Language · Computer Science 2025-06-03 Gal Astrach , Yuval Pinter

Token-Level Privacy in Large Language Models

The use of language models as remote services requires transmitting private information to external providers, raising significant privacy concerns. This process not only risks exposing sensitive data to untrusted service providers but also…

Computation and Language · Computer Science 2025-03-06 Re'em Harel , Niv Gilboa , Yuval Pinter

Information Types in Product Reviews

Information in text is communicated in a way that supports a goal for its reader. Product reviews, for example, contain opinions, tips, product descriptions, and many other types of information that provide both direct insights, as well as…

Computation and Language · Computer Science 2025-02-21 Ori Shapira , Yuval Pinter

Don't Touch My Diacritics

The common practice of preprocessing text before feeding it into NLP models introduces many decision points which have unintended consequences on model performance. In this opinion piece, we focus on the handling of diacritics in texts…

Computation and Language · Computer Science 2025-02-20 Kyle Gorman , Yuval Pinter

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

OMPar: Automatic Parallelization with AI-Driven Source-to-Source Compilation

Manual parallelization of code remains a significant challenge due to the complexities of modern software systems and the widespread adoption of multi-core architectures. This paper introduces OMPar, an AI-driven tool designed to automate…

Computation and Language · Computer Science 2024-09-24 Tal Kadosh , Niranjan Hasabnis , Prema Soundararajan , Vy A. Vo , Mihai Capota , Nesreen Ahmed , Yuval Pinter , Gal Oren

MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks

With easier access to powerful compute resources, there is a growing trend in AI for software development to develop large language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the…

Programming Languages · Computer Science 2024-09-23 Tal Kadosh , Niranjan Hasabnis , Vy A. Vo , Nadav Schneider , Neva Krien , Mihai Capota , Abdul Wasay , Nesreen Ahmed , Ted Willke , Guy Tamir , Yuval Pinter , Timothy Mattson , Gal Oren

Protecting Privacy in Classifiers by Token Manipulation

Using language models as a remote service entails sending private information to an untrusted provider. In addition, potential eavesdroppers can intercept the messages, thereby exposing the information. In this work, we explore the…

Computation and Language · Computer Science 2024-07-04 Re'em Harel , Yair Elboher , Yuval Pinter