Related papers: Noisy Parallel Data Alignment

Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR…

Computation and Language · Computer Science 2021-07-16 Guowei Xu , Wenbiao Ding , Weiping Fu , Zhongqin Wu , Zitao Liu

Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling

Despite recent advances, standard sequence labeling systems often fail when processing noisy user-generated text or consuming the output of an Optical Character Recognition (OCR) process. In this paper, we improve the noise-aware training…

Computation and Language · Computer Science 2021-05-26 Marcin Namysl , Sven Behnke , Joachim Köhler

Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data

Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors - imperfect extraction of text, including character insertion, deletion, and substitution can significantly impact…

Computation and Language · Computer Science 2025-09-22 Bhawna Piryani , Jamshid Mozafari , Abdelrahman Abdallah , Antoine Doucet , Adam Jatowt

An Assessment of the Impact of OCR Noise on Language Models

Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless,…

Computation and Language · Computer Science 2022-02-02 Konstantin Todorov , Giovanni Colavizza

Robustification of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining

Advances in neural modeling have achieved state-of-the-art (SOTA) results on public natural language processing (NLP) benchmarks, at times surpassing human performance. However, there is a gap between public benchmarks and real-world…

Computation and Language · Computer Science 2023-02-14 Asa Cooper Stickland , Sailik Sengupta , Jason Krone , Saab Mansour , He He

Understanding Model Robustness to User-generated Noisy Texts

Sensitivity of deep-neural models to input noise is known to be a challenging problem. In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. To mitigate this issue, models may leverage…

Computation and Language · Computer Science 2021-11-18 Jakub Náplava , Martin Popel , Milan Straka , Jana Straková

How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation

The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first introduce a…

Computation and Language · Computer Science 2025-02-10 Yan Meng , Di Wu , Christof Monz

Robust Neural Machine Translation: Modeling Orthographic and Interpunctual Variation

Neural machine translation systems typically are trained on curated corpora and break when faced with non-standard orthography or punctuation. Resilience to spelling mistakes and typos, however, is crucial as machine translation systems are…

Computation and Language · Computer Science 2020-09-15 Toms Bergmanis , Artūrs Stafanovičs , Mārcis Pinnis

Building a Noisy Audio Dataset to Evaluate Machine Learning Approaches for Automatic Speech Recognition Systems

Automatic speech recognition systems are part of people's daily lives, embedded in personal assistants and mobile phones, helping as a facilitator for human-machine interaction while allowing access to information in a practically intuitive…

Sound · Computer Science 2021-10-05 Julio Cesar Duarte , Sérgio Colcher

Learning Noise-Invariant Representations for Robust Speech Recognition

Despite rapid advances in speech recognition, current models remain brittle to superficial perturbations to their inputs. Small amounts of noise can destroy the performance of an otherwise state-of-the-art model. To harden models against…

Audio and Speech Processing · Electrical Eng. & Systems 2018-07-19 Davis Liang , Zhiheng Huang , Zachary C. Lipton

Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding

Recently, deep end-to-end learning has been studied for intent classification in Spoken Language Understanding (SLU). However, end-to-end models require a large amount of speech data with intent labels, and highly optimized models are…

Computation and Language · Computer Science 2024-05-27 Suyoung Kim , Jiyeon Hwang , Ho-Young Jung

Frustratingly Easy Noise-aware Training of Acoustic Models

Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it…

Audio and Speech Processing · Electrical Eng. & Systems 2021-02-03 Desh Raj , Jesus Villalba , Daniel Povey , Sanjeev Khudanpur

Resilience of Large Language Models for Noisy Instructions

As the rapidly advancing domain of natural language processing (NLP), large language models (LLMs) have emerged as powerful tools for interpreting human commands and generating text across various tasks. Nonetheless, the resilience of LLMs…

Computation and Language · Computer Science 2024-10-04 Bin Wang , Chengwei Wei , Zhengyuan Liu , Geyu Lin , Nancy F. Chen

Visual Cues and Error Correction for Translation Robustness

Neural Machine Translation models are sensitive to noise in the input texts, such as misspelled words and ungrammatical constructions. Existing robustness techniques generally fail when faced with unseen types of noise and their performance…

Computation and Language · Computer Science 2022-05-03 Zhenhao Li , Marek Rei , Lucia Specia

Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning

For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition…

Audio and Speech Processing · Electrical Eng. & Systems 2019-03-19 Ladislav Mošner , Minhua Wu , Anirudh Raju , Sree Hari Krishnan Parthasarathi , Kenichi Kumatani , Shiva Sundaram , Roland Maas , Björn Hoffmeister

Robust Neural Machine Translation for Clean and Noisy Speech Transcripts

Neural machine translation models have shown to achieve high quality when trained and fed with well structured and punctuated input texts. Unfortunately, the latter condition is not met in spoken language translation, where the input is…

Computation and Language · Computer Science 2019-10-24 Mattia Antonino Di Gangi , Robert Enyedi , Alessandra Brusadin , Marcello Federico

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

A large fraction of textual data available today contains various types of 'noise', such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as…

Information Retrieval · Computer Science 2021-01-12 Anurag Roy , Shalmoli Ghosh , Kripabandhu Ghosh , Saptarshi Ghosh

VAIS ASR: Building a conversational speech recognition system using language model combination

Automatic Speech Recognition (ASR) systems have been evolving quickly and reaching human parity in certain cases. The systems usually perform pretty well on reading style and clean speech, however, most of the available systems suffer from…

Computation and Language · Computer Science 2019-10-15 Quang Minh Nguyen , Thai Binh Nguyen , Ngoc Phuong Pham , The Loc Nguyen

PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning

Cross-modal retrieval aims to align different modalities via semantic similarity. However, existing methods often assume that image-text pairs are perfectly aligned, overlooking Noisy Correspondences in real data. These misaligned pairs…

Computer Vision and Pattern Recognition · Computer Science 2026-04-09 Zhuoyao Liu , Yang Liu , Wentao Feng , Shudong Huang

A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

Modern neural networks have greatly improved performance across speech recognition benchmarks. However, gains are often driven by frequent words with limited semantic weight, which can obscure meaningful differences in word error rate, the…

Computation and Language · Computer Science 2026-04-21 Lasse Borgholt , Jakob Havtorn , Christian Igel , Lars Maaløe , Zheng-Hua Tan