Related papers: Normalizing Text using Language Modelling based on…

Neural text normalization leveraging similarities of strings and sounds

We propose neural models that can normalize text by considering the similarities of word strings and sounds. We experimentally compared a model that considers the similarities of both word strings and sounds, a model that considers only the…

Computation and Language · Computer Science 2020-11-05 Riku Kawamura , Tatsuya Aoki , Hidetaka Kamigaito , Hiroya Takamura , Manabu Okumura

Adapting Sequence to Sequence models for Text Normalization in Social Media

Social media offer an abundant source of valuable raw data, however informal writing can quickly become a bottleneck for many natural language processing (NLP) tasks. Off-the-shelf tools are usually trained on formal text and cannot…

Computation and Language · Computer Science 2019-04-15 Ismini Lourentzou , Kabir Manghnani , ChengXiang Zhai

Improving Text Normalization by Optimizing Nearest Neighbor Matching

Text normalization is an essential task in the processing and analysis of social media that is dominated with informal writing. It aims to map informal words to their intended standard forms. Previously proposed text normalization…

Computation and Language · Computer Science 2017-12-29 Salman Ahmad Ansari , Usman Zafar , Asim Karim

Iterative Mask Filling: An Effective Text Augmentation Method Using Masked Language Modeling

Data augmentation is an effective technique for improving the performance of machine learning models. However, it has not been explored as extensively in natural language processing (NLP) as it has in computer vision. In this paper, we…

Computation and Language · Computer Science 2024-01-04 Himmet Toprak Kesgin , Mehmet Fatih Amasyali

A Chat About Boring Problems: Studying GPT-based text normalization

Text normalization - the conversion of text from written to spoken form - is traditionally assumed to be an ill-formed task for language models. In this work, we argue otherwise. We empirically show the capacity of Large-Language Models…

Computation and Language · Computer Science 2024-01-18 Yang Zhang , Travis M. Bartley , Mariana Graterol-Fuenmayor , Vitaly Lavrukhin , Evelina Bakhturina , Boris Ginsburg

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication. This discrepancy has led to severe performance degradation of…

Computation and Language · Computer Science 2021-10-13 Ana-Maria Bucur , Adrian Cosma , Liviu P. Dinu

Naturalization of Text by the Insertion of Pauses and Filler Words

In this article, we introduce a set of methods to naturalize text based on natural human speech. Voice-based interactions provide a natural way of interfacing with electronic systems and are seeing a widespread adaptation of late. These…

Computation and Language · Computer Science 2020-11-10 Richa Sharma , Parth Vipul Shah , Ashwini M. Joshi

Text normalization using memory augmented neural networks

We perform text normalization, i.e. the transformation of words from the written to the spoken form, using a memory augmented neural network. With the addition of dynamic memory access and storage mechanism, we present a neural architecture…

Computation and Language · Computer Science 2019-04-05 Subhojeet Pramanik , Aman Hussain

SocialBERT -- Transformers for Online SocialNetwork Language Modelling

The ubiquity of the contemporary language understanding tasks gives relevance to the development of generalized, yet highly efficient models that utilize all knowledge, provided by the data source. In this work, we present SocialBERT - the…

Computation and Language · Computer Science 2021-11-16 Ilia Karpov , Nick Kartashev

Enhancing Hate Speech Detection on Social Media: A Comparative Analysis of Machine Learning Models and Text Transformation Approaches

The proliferation of hate speech on social media platforms has necessitated the development of effective detection and moderation tools. This study evaluates the efficacy of various machine learning models in identifying hate speech and…

Computation and Language · Computer Science 2026-02-25 Saurabh Mishra , Shivani Thakur , Radhika Mamidi

Adversarial Text Normalization

Text-based adversarial attacks are becoming more commonplace and accessible to general internet users. As these attacks proliferate, the need to address the gap in model robustness becomes imminent. While retraining on adversarial data may…

Computation and Language · Computer Science 2022-06-10 Joanna Bitton , Maya Pavlova , Ivan Evtimov

Two Spelling Normalization Approaches Based on Large Language Models

The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue,…

Computation and Language · Computer Science 2025-07-01 Miguel Domingo , Francisco Casacuberta

SimpleBERT: A Pre-trained Model That Learns to Generate Simple Words

Pre-trained models are widely used in the tasks of natural language processing nowadays. However, in the specific field of text simplification, the research on improving pre-trained models is still blank. In this work, we propose a…

Computation and Language · Computer Science 2022-04-19 Renliang Sun , Xiaojun Wan

RNN Approaches to Text Normalization: A Challenge

This paper presents a challenge to the community: given a large corpus of written text aligned to its normalized spoken form, train an RNN to learn the correct normalization function. We present a data set of general text where the…

Computation and Language · Computer Science 2017-01-26 Richard Sproat , Navdeep Jaitly

Research on Violent Text Detection System Based on BERT-fasttext Model

In the digital age of today, the internet has become an indispensable platform for people's lives, work, and information exchange. However, the problem of violent text proliferation in the network environment has arisen, which has brought…

Computation and Language · Computer Science 2024-12-24 Yongsheng Yang , Xiaoying Wang

Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics

Recent advances in pre-trained language modeling have facilitated significant progress across various natural language processing (NLP) tasks. Word masking during model training constitutes a pivotal component of language modeling in…

Computation and Language · Computer Science 2024-02-27 Anas Belfathi , Ygor Gallina , Nicolas Hernandez , Richard Dufour , Laura Monceaux

Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages

Previous studies have shown that health reports in social media, such as DailyStrength and Twitter, have potential for monitoring health conditions (e.g. adverse drug reactions, infectious diseases) in particular communities. However, in…

Computation and Language · Computer Science 2015-08-11 Nut Limsopatham , Nigel Collier

Text Detoxification using Large Pre-trained Neural Models

We present two novel unsupervised methods for eliminating toxicity in text. Our first method combines two recent ideas: (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models…

Computation and Language · Computer Science 2021-11-04 David Dale , Anton Voronov , Daryna Dementieva , Varvara Logacheva , Olga Kozlova , Nikita Semenov , Alexander Panchenko

Historical German Text Normalization Using Type- and Token-Based Language Modeling

Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic…

Computation and Language · Computer Science 2025-02-26 Anton Ehrmanntraut

On the performance of phonetic algorithms in microtext normalization

User-generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing…

Computation and Language · Computer Science 2024-02-06 Yerai Doval , Manuel Vilares , Jesús Vilares