Related papers: Improving Grammatical Error Correction via Context…

Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to generate the broad range of grammatical errors made by…

Computation and Language · Computer Science 2021-05-28 Felix Stahlberg , Shankar Kumar

Controllable Data Synthesis Method for Grammatical Error Correction

Due to the lack of parallel data in current Grammatical Error Correction (GEC) task, models based on Sequence to Sequence framework cannot be adequately trained to obtain higher performance. We propose two data synthesis methods which can…

Computation and Language · Computer Science 2021-12-28 Liner Yang , Chencheng Wang , Yun Chen , Yongping Du , Erhong Yang

Evaluation of large-scale synthetic data for Grammar Error Correction

Grammar Error Correction(GEC) mainly relies on the availability of high quality of large amount of synthetic parallel data of grammatically correct and erroneous sentence pairs. The quality of the synthetic data is evaluated on how well the…

Computation and Language · Computer Science 2022-11-01 Vanya Bannihatti Kumar

Judge a Sentence by Its Content to Generate Grammatical Errors

Data sparsity is a well-known problem for grammatical error correction (GEC). Generating synthetic training data is one widely proposed solution to this problem, and has allowed models to achieve state-of-the-art (SOTA) performance in…

Computation and Language · Computer Science 2022-08-23 Chowdhury Rafeed Rahman

Data Augmentation for Spoken Grammatical Error Correction

While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate…

Computation and Language · Computer Science 2025-07-28 Penny Karanasou , Mengjie Qian , Stefano Bannò , Mark J. F. Gales , Kate M. Knill

Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

We propose a novel data augmentation for labeled sentences called contextual augmentation. We assume an invariance that sentences are natural even if the words in the sentences are replaced with other words with paradigmatic relations. We…

Computation and Language · Computer Science 2018-05-17 Sosuke Kobayashi

Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation

In this paper, we explore the artificial generation of typographical errors based on real-world statistics. We first draw on a small set of annotated data to compute spelling error statistics. These are then invoked to introduce errors into…

Computation and Language · Computer Science 2020-05-05 Kshitij Shah , Gerard de Melo

SDA: Improving Text Generation with Self Data Augmentation

Data augmentation has been widely used to improve deep neural networks in many research fields, such as computer vision. However, less work has been done in the context of text, partially due to its discrete nature and the complexity of…

Computation and Language · Computer Science 2021-01-12 Ping Yu , Ruiyi Zhang , Yang Zhao , Yizhe Zhang , Chunyuan Li , Changyou Chen

Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule

Progress in neural grammatical error correction (GEC) is hindered by the lack of annotated training data. Sufficient amounts of high-quality manually annotated data are not available, so recent research has relied on generating synthetic…

Computation and Language · Computer Science 2023-11-21 Andrey Bout , Alexander Podolskiy , Sergey Nikolenko , Irina Piontkovskaya

Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection

Grammatical error correction, like other machine learning tasks, greatly benefits from large quantities of high quality training data, which is typically expensive to produce. While writing a program to automatically generate realistic…

Computation and Language · Computer Science 2018-10-02 Sudhanshu Kasewa , Pontus Stenetorp , Sebastian Riedel

MixEdit: Revisiting Data Augmentation and Beyond for Grammatical Error Correction

Data Augmentation through generating pseudo data has been proven effective in mitigating the challenge of data scarcity in the field of Grammatical Error Correction (GEC). Various augmentation strategies have been widely explored, most of…

Computation and Language · Computer Science 2023-10-19 Jingheng Ye , Yinghui Li , Yangning Li , Hai-Tao Zheng

Self-Compositional Data Augmentation for Scientific Keyphrase Generation

State-of-the-art models for keyphrase generation require large amounts of training data to achieve good performance. However, obtaining keyphrase-labeled documents can be challenging and costly. To address this issue, we present a…

Computation and Language · Computer Science 2024-11-07 Mael Houbre , Florian Boudin , Beatrice Daille , Akiko Aizawa

Grammatical Error Generation Based on Translated Fragments

We perform neural machine translation of sentence fragments in order to create large amounts of training data for English grammatical error correction. Our method aims at simulating mistakes made by second language learners, and produces a…

Computation and Language · Computer Science 2021-04-21 Eetu Sjöblom , Mathias Creutz , Teemu Vahtola

Optimized Tokenization for Transcribed Error Correction

The challenges facing speech recognition systems, such as variations in pronunciations, adverse audio conditions, and the scarcity of labeled data, emphasize the necessity for a post-processing step that corrects recurring errors. Previous…

Computation and Language · Computer Science 2023-10-18 Tomer Wullach , Shlomo E. Chazan

R&D: Balancing Reliability and Diversity in Synthetic Data Augmentation for Semantic Segmentation

Collecting and annotating datasets for pixel-level semantic segmentation tasks are highly labor-intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real-world data collection.…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Huy Che , Dinh-Duy Phan , Duc-Khai Lam

GenAug: Data Augmentation for Finetuning Text Generators

In this paper, we investigate data augmentation for text generation, which we call GenAug. Text generation and language modeling are important tasks within natural language processing, and are especially challenging for low-data regimes. We…

Computation and Language · Computer Science 2020-10-13 Steven Y. Feng , Varun Gangal , Dongyeop Kang , Teruko Mitamura , Eduard Hovy

GASE: Generatively Augmented Sentence Encoding

We propose a training-free approach to improve sentence embeddings leveraging test-time compute by applying generative text models for data augmentation at inference time. Unlike conventional data augmentation that utilises synthetic…

Computation and Language · Computer Science 2025-09-09 Manuel Frank , Haithem Afli

Towards Active Synthetic Data Generation for Finetuning Language Models

A common and effective means for improving language model capabilities involves finetuning a ``student'' language model's parameters on generations from a more proficient ``teacher'' model. Termed ``synthetic data'', these generations are…

Machine Learning · Computer Science 2026-02-10 Samuel Kessler , Menglin Xia , Daniel Madrigal Diaz , Dongge Han , Helia Heshemi , Saravan Rajmohan , Victor Ruehle , Jordan T. Ash

Good-Enough Compositional Data Augmentation

We propose a simple data augmentation protocol aimed at providing a compositional inductive bias in conditional and unconditional sequence models. Under this protocol, synthetic training examples are constructed by taking real training…

Computation and Language · Computer Science 2020-05-20 Jacob Andreas

DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

Data augmentation techniques have been widely used to improve machine learning performance as they enhance the generalization capability of models. In this work, to generate high quality synthetic data for low-resource tagging tasks, we…

Computation and Language · Computer Science 2020-11-04 Bosheng Ding , Linlin Liu , Lidong Bing , Canasai Kruengkrai , Thien Hai Nguyen , Shafiq Joty , Luo Si , Chunyan Miao