Related papers: Optimized Tokenization for Transcribed Error Corre…

Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition

With recent advances in speech synthesis, synthetic data is becoming a viable alternative to real data for training speech recognition models. However, machine learning with synthetic data is not trivial due to the gap between the synthetic…

Audio and Speech Processing · Electrical Eng. & Systems 2021-10-25 Ting-Yao Hu , Mohammadreza Armandpour , Ashish Shrivastava , Jen-Hao Rick Chang , Hema Koppula , Oncel Tuzel

Using Synthetic Data to estimate the True Error is theoretically and practically doable

Accurately evaluating model performance is crucial for deploying machine learning systems in real-world applications. Traditional methods often require a sufficiently large labeled test set to ensure a reliable evaluation. However, in many…

Machine Learning · Computer Science 2025-11-04 Hai Hoang Thanh , Duy-Tung Nguyen , Hung The Tran , Khoat Than

Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms

Semi-supervised learning that leverages synthetic data for training has been widely adopted for developing automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create…

Computation and Language · Computer Science 2024-06-04 Wonkee Lee , Seong-Hwan Heo , Jong-Hyeok Lee

Improving Grammatical Error Correction via Contextual Data Augmentation

Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase…

Computation and Language · Computer Science 2024-06-26 Yixuan Wang , Baoxin Wang , Yijun Liu , Qingfu Zhu , Dayong Wu , Wanxiang Che

From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution

Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak…

Computation and Language · Computer Science 2026-04-16 Pavel Chizhov , Egor Bogomolov , Ivan P. Yamshchikov

AdaptBPE: From General Purpose to Specialized Tokenizers

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly…

Computation and Language · Computer Science 2026-01-30 Vijini Liyanage , François Yvon

Towards Improved Speech Recognition through Optimized Synthetic Data Generation

Supervised training of speech recognition models requires access to transcribed audio data, which often is not possible due to confidentiality issues. Our approach to this problem is to generate synthetic audio from a text-only corpus using…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-01 Yanis Perrin , Gilles Boulianne

Training Text-to-Speech Model with Purely Synthetic Data: Feasibility, Sensitivity, and Generalization Capability

The potential of synthetic data in text-to-speech (TTS) model training has gained increasing attention, yet its rationality and effectiveness require systematic validation. In this study, we systematically investigate the feasibility of…

Sound · Computer Science 2025-12-22 Tingxiao Zhou , Leying Zhang , Zhengyang Chen , Yanmin Qian

How to Synthesize Text Data without Model Collapse?

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem.…

Computation and Language · Computer Science 2025-05-29 Xuekai Zhu , Daixuan Cheng , Hengli Li , Kaiyan Zhang , Ermo Hua , Xingtai Lv , Ning Ding , Zhouhan Lin , Zilong Zheng , Bowen Zhou

Training on Synthetic Noise Improves Robustness to Natural Noise in Machine Translation

We consider the problem of making machine translation more robust to character-level variation at the source side, such as typos. Existing methods achieve greater coverage by applying subword models such as byte-pair encoding (BPE) and…

Computation and Language · Computer Science 2019-02-06 Vladimir Karpukhin , Omer Levy , Jacob Eisenstein , Marjan Ghazvininejad

Towards Selection of Text-to-speech Data to Augment ASR Training

This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network,…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-05 Shuo Liu , Leda Sarı , Chunyang Wu , Gil Keren , Yuan Shangguan , Jay Mahadeokar , Ozlem Kalinli

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal…

Computation and Language · Computer Science 2024-09-10 Pavel Chizhov , Catherine Arnett , Elizaveta Korotkova , Ivan P. Yamshchikov

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding…

Computation and Language · Computer Science 2025-05-05 Bharath Raj , Garvit Suri , Vikrant Dewangan , Raghav Sonavane

Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-28 Karren Yang , Ting-Yao Hu , Jen-Hao Rick Chang , Hema Swetha Koppula , Oncel Tuzel

Controllable Data Synthesis Method for Grammatical Error Correction

Due to the lack of parallel data in current Grammatical Error Correction (GEC) task, models based on Sequence to Sequence framework cannot be adequately trained to obtain higher performance. We propose two data synthesis methods which can…

Computation and Language · Computer Science 2021-12-28 Liner Yang , Chencheng Wang , Yun Chen , Yongping Du , Erhong Yang

Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-21 Vahid Noroozi , Zhehuai Chen , Somshubra Majumdar , Steve Huang , Jagadeesh Balam , Boris Ginsburg

Bias-Corrected Data Synthesis for Imbalanced Learning

Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to…

Machine Learning · Statistics 2026-02-17 Pengfei Lyu , Zhengchi Ma , Linjun Zhang , Anru R. Zhang

Real-Fake: Effective Training Data Synthesis Through Distribution Matching

Synthetic training data has gained prominence in numerous learning tasks and scenarios, offering advantages such as dataset augmentation, generalization evaluation, and privacy preservation. Despite these benefits, the efficiency of…

Machine Learning · Computer Science 2024-03-21 Jianhao Yuan , Jie Zhang , Shuyang Sun , Philip Torr , Bo Zhao

Acoustic BPE for Speech Generation with Discrete Tokens

Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of…

Sound · Computer Science 2024-01-17 Feiyu Shen , Yiwei Guo , Chenpeng Du , Xie Chen , Kai Yu