Related papers: Sequence-to-Sequence Lexical Normalization with Mu…

Adapting Sequence to Sequence models for Text Normalization in Social Media

Social media offer an abundant source of valuable raw data, however informal writing can quickly become a bottleneck for many natural language processing (NLP) tasks. Off-the-shelf tools are usually trained on formal text and cannot…

Computation and Language · Computer Science 2019-04-15 Ismini Lourentzou , Kabir Manghnani , ChengXiang Zhai

Multilevel Text Normalization with Sequence-to-Sequence Networks and Multisource Learning

We define multilevel text normalization as sequence-to-sequence processing that transforms naturally noisy text into a sequence of normalized units of meaning (morphemes) in three steps: 1) writing normalization, 2) lemmatization, 3)…

Computation and Language · Computer Science 2019-04-01 Tatyana Ruzsics , Tanja Samardžić

Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

Text normalization is an important enabling technology for several NLP tasks. Recently, neural-network-based approaches have outperformed well-established models in this task. However, in languages other than English, there has been little…

Computation and Language · Computer Science 2018-09-06 Daniel Watson , Nasser Zalmout , Nizar Habash

Normalizing Text using Language Modelling based on Phonetics and String Similarity

Social media networks and chatting platforms often use an informal version of natural text. Adversarial spelling attacks also tend to alter the input text by modifying the characters in the text. Normalizing these texts is an essential step…

Computation and Language · Computer Science 2020-06-26 Fenil Doshi , Jimit Gandhi , Deep Gosalia , Sudhir Bagul

MultiLexNorm++: A Unified Benchmark and a Generative Model for Lexical Normalization for Asian Languages

Social media data has been of interest to Natural Language Processing (NLP) practitioners for over a decade, because of its richness in information, but also challenges for automatic processing. Since language use is more informal,…

Computation and Language · Computer Science 2026-01-26 Weerayut Buaphet , Thanh-Nhi Nguyen , Risa Kondo , Tomoyuki Kajiwara , Yumin Kim , Jimin Lee , Hwanhee Lee , Holy Lovenia , Peerat Limkonchotiwat , Sarana Nutanong , Rob Van der Goot

Sequence-to-Sequence Learning with Latent Neural Grammars

Sequence-to-sequence learning with neural networks has become the de facto standard for sequence prediction tasks. This approach typically models the local distribution over the next word with a powerful neural network that can condition on…

Computation and Language · Computer Science 2021-11-17 Yoon Kim

BERTwich: Extending BERT's Capabilities to Model Dialectal and Noisy Text

Real-world NLP applications often deal with nonstandard text (e.g., dialectal, informal, or misspelled text). However, language models like BERT deteriorate in the face of dialect variation or noise. How do we push BERT's modeling…

Computation and Language · Computer Science 2023-11-02 Aarohi Srivastava , David Chiang

Lexicon Learning for Few-Shot Neural Sequence Modeling

Sequence-to-sequence transduction is the core problem in language processing applications as diverse as semantic parsing, machine translation, and instruction following. The neural network models that provide the dominant solution to these…

Computation and Language · Computer Science 2021-06-09 Ekin Akyürek , Jacob Andreas

Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?

What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages to ascertain this, considering five factors: (1) the…

Computation and Language · Computer Science 2022-05-03 En-Shiun Annie Lee , Sarubi Thillainathan , Shravan Nayak , Surangika Ranathunga , David Ifeoluwa Adelani , Ruisi Su , Arya D. McCarthy

Advancements in Natural Language Processing: Exploring Transformer-Based Architectures for Text Understanding

Natural Language Processing (NLP) has witnessed a transformative leap with the advent of transformer-based architectures, which have significantly enhanced the ability of machines to understand and generate human-like text. This paper…

Computation and Language · Computer Science 2025-03-27 Tianhao Wu , Yu Wang , Ngoc Quach

Automatic Textual Normalization for Hate Speech Detection

Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language…

Computation and Language · Computer Science 2024-07-26 Anh Thi-Hoang Nguyen , Dung Ha Nguyen , Nguyet Thi Nguyen , Khanh Thanh-Duy Ho , Kiet Van Nguyen

Word-level Lexical Normalisation using Context-Dependent Embeddings

Lexical normalisation (LN) is the process of correcting each word in a dataset to its canonical form so that it may be more easily and more accurately analysed. Most lexical normalisation systems operate at the character-level, while…

Computation and Language · Computer Science 2019-11-15 Michael Stewart , Wei Liu , Rachel Cardell-Oliver

A Chat About Boring Problems: Studying GPT-based text normalization

Text normalization - the conversion of text from written to spoken form - is traditionally assumed to be an ill-formed task for language models. In this work, we argue otherwise. We empirically show the capacity of Large-Language Models…

Computation and Language · Computer Science 2024-01-18 Yang Zhang , Travis M. Bartley , Mariana Graterol-Fuenmayor , Vitaly Lavrukhin , Evelina Bakhturina , Boris Ginsburg

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard…

Computation and Language · Computer Science 2019-10-31 Mike Lewis , Yinhan Liu , Naman Goyal , Marjan Ghazvininejad , Abdelrahman Mohamed , Omer Levy , Ves Stoyanov , Luke Zettlemoyer

Neural Transition-based Syntactic Linearization

The task of linearization is to find a grammatical order given a set of words. Traditional models use statistical methods. Syntactic linearization systems, which generate a sentence along with its syntactic tree, have shown state-of-the-art…

Computation and Language · Computer Science 2018-10-24 Linfeng Song , Yue Zhang , Daniel Gildea

Improving Lemmatization of Non-Standard Languages with Joint Learning

Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to…

Computation and Language · Computer Science 2019-03-19 Enrique Manjavacas , Ákos Kádár , Mike Kestemont

Symmetric Regularization based BERT for Pair-wise Semantic Reasoning

The ability of semantic reasoning over the sentence pair is essential for many natural language understanding tasks, e.g., natural language inference and machine reading comprehension. A recent significant improvement in these tasks comes…

Computation and Language · Computer Science 2021-06-18 Weidi Xu , Xingyi Cheng , Kunlong Chen , Wei Wang , Bin Bi , Ming Yan , Chen Wu , Luo Si , Wei Chu , Taifeng Wang

Unsupervised Neural Machine Translation with SMT as Posterior Regularization

Without real bilingual corpus available, unsupervised Neural Machine Translation (NMT) typically requires pseudo parallel data generated with the back-translation method for the model training. However, due to weak supervision, the pseudo…

Computation and Language · Computer Science 2019-01-15 Shuo Ren , Zhirui Zhang , Shujie Liu , Ming Zhou , Shuai Ma

Domain Specific Fine-tuning of Denoising Sequence-to-Sequence Models for Natural Language Summarization

Summarization of long-form text data is a problem especially pertinent in knowledge economy jobs such as medicine and finance, that require continuously remaining informed on a sophisticated and evolving body of knowledge. As such,…

Computation and Language · Computer Science 2022-04-22 Brydon Parker , Alik Sokolov , Mahtab Ahmed , Matt Kalebic , Sedef Akinli Kocak , Ofer Shai

A Sequence-to-Sequence Approach for Arabic Pronoun Resolution

This paper proposes a sequence-to-sequence learning approach for Arabic pronoun resolution, which explores the effectiveness of using advanced natural language processing (NLP) techniques, specifically Bi-LSTM and the BERT pre-trained…

Computation and Language · Computer Science 2023-05-22 Hanan S. Murayshid , Hafida Benhidour , Said Kerrache