Related papers: A Unified Transformer-based Framework for Duplex T…

Improving Robustness of Neural Inverse Text Normalization via Data-Augmentation, Semi-Supervised Learning, and Post-Aligning Method

Inverse text normalization (ITN) is crucial for converting spoken-form into written-form, especially in the context of automatic speech recognition (ASR). While most downstream tasks of ASR rely on written-form, ASR systems often output…

Computation and Language · Computer Science 2023-09-19 Juntae Kim , Minkyu Lim , Seokjin Hong

Improving Data Driven Inverse Text Normalization using Data Augmentation

Inverse text normalization (ITN) is used to convert the spoken form output of an automatic speech recognition (ASR) system to a written form. Traditional handcrafted ITN rules can be complex to transcribe and maintain. Meanwhile neural…

Computation and Language · Computer Science 2022-07-21 Laxmi Pandey , Debjyoti Paul , Pooja Chitkara , Yutong Pang , Xuedong Zhang , Kjell Schubert , Mark Chou , Shu Liu , Yatharth Saraf

Thutmose Tagger: Single-pass neural model for Inverse Text Normalization

Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition (ASR). It converts numbers, dates, abbreviations, and other semiotic classes from the spoken form generated by ASR to their written forms.…

Computation and Language · Computer Science 2022-08-02 Alexandra Antonova , Evelina Bakhturina , Boris Ginsburg

Neural Inverse Text Normalization

While there have been several contributions exploring state of the art techniques for text normalization, the problem of inverse text normalization (ITN) remains relatively unexplored. The best known approaches leverage finite state…

Computation and Language · Computer Science 2021-02-15 Monica Sunkara , Chaitanya Shivade , Sravan Bodapati , Katrin Kirchhoff

Four-in-One: A Joint Approach to Inverse Text Normalization, Punctuation, Capitalization, and Disfluency for Automatic Speech Recognition

Features such as punctuation, capitalization, and formatting of entities are important for readability, understanding, and natural language processing tasks. However, Automatic Speech Recognition (ASR) systems produce spoken-form text…

Computation and Language · Computer Science 2022-10-28 Sharman Tan , Piyush Behre , Nick Kibre , Issac Alphonso , Shuangyu Chang

MultiSpeech: Multi-Speaker Text to Speech with Transformer

Transformer-based text to speech (TTS) model (e.g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e.g.,…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-04 Mingjian Chen , Xu Tan , Yi Ren , Jin Xu , Hao Sun , Sheng Zhao , Tao Qin , Tie-Yan Liu

PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech

Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering…

Computation and Language · Computer Science 2025-11-06 Michel Wong , Ali Alshehri , Sophia Kao , Haotian He

Language Agnostic Data-Driven Inverse Text Normalization

With the emergence of automatic speech recognition (ASR) models, converting the spoken form text (from ASR) to the written form is in urgent need. This inverse text normalization (ITN) problem attracts the attention of researchers from…

Computation and Language · Computer Science 2023-01-25 Szu-Jui Chen , Debjyoti Paul , Yutong Pang , Peng Su , Xuedong Zhang

Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Developing Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard. We propose a novel architecture to facilitate it for multiple languages while using data less than 3% of the size of the data used by the state of…

Computation and Language · Computer Science 2021-04-19 Shubhi Tyagi , Antonio Bonafonte , Jaime Lorenzo-Trueba , Javier Latorre

Streaming, fast and accurate on-device Inverse Text Normalization for Automatic Speech Recognition

Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted…

Computation and Language · Computer Science 2022-11-08 Yashesh Gaur , Nick Kibre , Jian Xue , Kangyuan Shu , Yuhui Wang , Issac Alphanso , Jinyu Li , Yifan Gong

Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization

Inverse Text Normalization (ITN) is crucial for converting spoken Automatic Speech Recognition (ASR) outputs into well-formatted written text, enhancing both readability and usability. Despite its importance, the integration of streaming…

Computation and Language · Computer Science 2025-06-02 Luong Ho , Khanh Le , Vinh Pham , Bao Nguyen , Tan Tran , Duc Chau

RNN Approaches to Text Normalization: A Challenge

This paper presents a challenge to the community: given a large corpus of written text aligned to its normalized spoken form, train an RNN to learn the correct normalization function. We present a data set of general text where the…

Computation and Language · Computer Science 2017-01-26 Richard Sproat , Navdeep Jaitly

Transformer-based Models of Text Normalization for Speech Applications

Text normalization, or the process of transforming text into a consistent, canonical form, is crucial for speech applications such as text-to-speech synthesis (TTS). In TTS, the system must decide whether to verbalize "1995" as "nineteen…

Machine Learning · Computer Science 2022-02-02 Jae Hun Ro , Felix Stahlberg , Ke Wu , Shankar Kumar

Multilevel Text Normalization with Sequence-to-Sequence Networks and Multisource Learning

We define multilevel text normalization as sequence-to-sequence processing that transforms naturally noisy text into a sequence of normalized units of meaning (morphemes) in three steps: 1) writing normalization, 2) lemmatization, 3)…

Computation and Language · Computer Science 2019-04-01 Tatyana Ruzsics , Tanja Samardžić

NeMo Inverse Text Normalization: From Development To Production

Inverse text normalization (ITN) converts spoken-domain automatic speech recognition (ASR) output into written-domain text to improve the readability of the ASR output. Many state-of-the-art ITN systems use hand-written weighted…

Computation and Language · Computer Science 2021-05-18 Yang Zhang , Evelina Bakhturina , Kyle Gorman , Boris Ginsburg

DeepNorm-A Deep Learning Approach to Text Normalization

This paper presents an simple yet sophisticated approach to the challenge by Sproat and Jaitly (2016)- given a large corpus of written text aligned to its normalized spoken form, train an RNN to learn the correct normalization function.…

Computation and Language · Computer Science 2017-12-20 Maryam Zare , Shaurya Rohatgi

Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection…

Audio and Speech Processing · Electrical Eng. & Systems 2022-08-17 Junrui Ni , Liming Wang , Heting Gao , Kaizhi Qian , Yang Zhang , Shiyu Chang , Mark Hasegawa-Johnson

Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the {\em…

Computation and Language · Computer Science 2020-10-08 Mingbo Ma , Baigong Zheng , Kaibo Liu , Renjie Zheng , Hairong Liu , Kainan Peng , Kenneth Church , Liang Huang

TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We…

Computation and Language · Computer Science 2025-06-09 Minsu Kim , Jee-weon Jung , Hyeongseop Rha , Soumi Maiti , Siddhant Arora , Xuankai Chang , Shinji Watanabe , Yong Man Ro

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS…

Sound · Computer Science 2023-01-11 Haogeng Liu , Tao Wang , Ruibo Fu , Jiangyan Yi , Zhengqi Wen , Jianhua Tao