Related papers: A Fast Randomized Algorithm for Massive Text Norma…

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

A large fraction of textual data available today contains various types of 'noise', such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as…

Information Retrieval · Computer Science 2021-01-12 Anurag Roy , Shalmoli Ghosh , Kripabandhu Ghosh , Saptarshi Ghosh

HAAN: A Holistic Approach for Accelerating Normalization Operations in Large Language Models

Large language models (LLMs) have revolutionized natural language processing (NLP) tasks by achieving state-of-the-art performance across a range of benchmarks. Central to the success of these models is the integration of sophisticated…

Hardware Architecture · Computer Science 2025-02-18 Tianfan Peng , Jiajun Qin , Tianhua Xia , Sai Qian Zhang

Natural language processing (NLP) task has achieved excellent performance in many fields, including semantic understanding, automatic summarization, image recognition and so on. However, most of the neural network models for NLP extract the…

Computation and Language · Computer Science 2022-02-08 Peiying Zhang , Xingzhe Huang , Yaqi Wang , Chunxiao Jiang , Shuqing He , Haifeng Wang

Gradient Multi-Normalization for Stateless and Scalable LLM Training

Training large language models (LLMs) typically relies on adaptive optimizers like Adam (Kingma & Ba, 2015) which store additional state information to accelerate convergence but incur significant memory overhead. Recent efforts, such as…

Machine Learning · Computer Science 2025-02-11 Meyer Scetbon , Chao Ma , Wenbo Gong , Edward Meeds

Adaptable and Reliable Text Classification using Large Language Models

Text classification is fundamental in Natural Language Processing (NLP), and the advent of Large Language Models (LLMs) has revolutionized the field. This paper introduces an adaptable and reliable text classification paradigm, which…

Computation and Language · Computer Science 2024-12-10 Zhiqiang Wang , Yiran Pang , Yanbin Lin , Xingquan Zhu

Improving Text Normalization by Optimizing Nearest Neighbor Matching

Text normalization is an essential task in the processing and analysis of social media that is dominated with informal writing. It aims to map informal words to their intended standard forms. Previously proposed text normalization…

Computation and Language · Computer Science 2017-12-29 Salman Ahmad Ansari , Usman Zafar , Asim Karim

Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

Large language models (LLMs) trained with canonical tokenization exhibit surprising robustness to non-canonical inputs such as character-level tokenization, yet the mechanisms underlying this robustness remain unclear. We study this…

Computation and Language · Computer Science 2026-03-12 Zhipeng Yang , Shu Yang , Lijie Hu , Di Wang

Efficient Classification of Multi-Labelled Text Streams by Clashing

We present a method for the classification of multi-labelled text documents explicitly designed for data stream applications that require to process a virtually infinite sequence of data using constant memory and constant processing time.…

Artificial Intelligence · Computer Science 2016-04-13 Ricardo Ñanculef , Ilias Flaounas , Nello Cristianini

RNN Approaches to Text Normalization: A Challenge

This paper presents a challenge to the community: given a large corpus of written text aligned to its normalized spoken form, train an RNN to learn the correct normalization function. We present a data set of general text where the…

Computation and Language · Computer Science 2017-01-26 Richard Sproat , Navdeep Jaitly

Automated Data Curation for Robust Language Model Fine-Tuning

Large Language Models have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses.…

Computation and Language · Computer Science 2024-03-20 Jiuhai Chen , Jonas Mueller

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network…

Computation and Language · Computer Science 2022-02-17 Hao Zhang , You-Chi Cheng , Shankar Kumar , W. Ronny Huang , Mingqing Chen , Rajiv Mathews

Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning

Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens…

Computation and Language · Computer Science 2026-03-12 Jinlong Pang , Na Di , Zhaowei Zhu , Jiaheng Wei , Hao Cheng , Chen Qian , Yang Liu

Text Alignment Is An Efficient Unified Model for Massive NLP Tasks

Large language models (LLMs), typically designed as a function of next-word prediction, have excelled across extensive NLP tasks. Despite the generality, next-word prediction is often not an efficient formulation for many of the tasks,…

Computation and Language · Computer Science 2023-11-03 Yuheng Zha , Yichi Yang , Ruichen Li , Zhiting Hu

Can LLMs Help Localize Fake Words in Partially Fake Speech?

Large language models (LLMs), trained on large-scale text, have recently attracted significant attention for their strong performance across many tasks. Motivated by this, we investigate whether a text-trained LLM can help localize fake…

Audio and Speech Processing · Electrical Eng. & Systems 2026-03-13 Lin Zhang , Thomas Thebaud , Zexin Cai , Sanjeev Khudanpur , Daniel Povey , Leibny Paola García-Perera , Matthew Wiesner , Nicholas Andrews

Adapting Sequence to Sequence models for Text Normalization in Social Media

Social media offer an abundant source of valuable raw data, however informal writing can quickly become a bottleneck for many natural language processing (NLP) tasks. Off-the-shelf tools are usually trained on formal text and cannot…

Computation and Language · Computer Science 2019-04-15 Ismini Lourentzou , Kabir Manghnani , ChengXiang Zhai

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require…

Data Structures and Algorithms · Computer Science 2018-07-04 Yiqiu Wang , Anshumali Shrivastava , Jonathan Wang , Junghee Ryu

Neural text normalization leveraging similarities of strings and sounds

We propose neural models that can normalize text by considering the similarities of word strings and sounds. We experimentally compared a model that considers the similarities of both word strings and sounds, a model that considers only the…

Computation and Language · Computer Science 2020-11-05 Riku Kawamura , Tatsuya Aoki , Hidetaka Kamigaito , Hiroya Takamura , Manabu Okumura

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction…

Computation and Language · Computer Science 2024-02-21 Haoran Li , Qingxiu Dong , Zhengyang Tang , Chaojun Wang , Xingxing Zhang , Haoyang Huang , Shaohan Huang , Xiaolong Huang , Zeqiang Huang , Dongdong Zhang , Yuxian Gu , Xin Cheng , Xun Wang , Si-Qing Chen , Li Dong , Wei Lu , Zhifang Sui , Benyou Wang , Wai Lam , Furu Wei

The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster…

Data Structures and Algorithms · Computer Science 2021-03-09 Marc Bury , Chris Schwiegelshohn , Mara Sorella

Normalizing Text using Language Modelling based on Phonetics and String Similarity

Social media networks and chatting platforms often use an informal version of natural text. Adversarial spelling attacks also tend to alter the input text by modifying the characters in the text. Normalizing these texts is an essential step…

Computation and Language · Computer Science 2020-06-26 Fenil Doshi , Jimit Gandhi , Deep Gosalia , Sudhir Bagul