Related papers: Controllable Data Synthesis Method for Grammatical…

Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to generate the broad range of grammatical errors made by…

Computation and Language · Computer Science 2021-05-28 Felix Stahlberg , Shankar Kumar

Evaluation of large-scale synthetic data for Grammar Error Correction

Grammar Error Correction(GEC) mainly relies on the availability of high quality of large amount of synthetic parallel data of grammatically correct and erroneous sentence pairs. The quality of the synthetic data is evaluated on how well the…

Computation and Language · Computer Science 2022-11-01 Vanya Bannihatti Kumar

Judge a Sentence by Its Content to Generate Grammatical Errors

Data sparsity is a well-known problem for grammatical error correction (GEC). Generating synthetic training data is one widely proposed solution to this problem, and has allowed models to achieve state-of-the-art (SOTA) performance in…

Computation and Language · Computer Science 2022-08-23 Chowdhury Rafeed Rahman

Improving Grammatical Error Correction via Contextual Data Augmentation

Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase…

Computation and Language · Computer Science 2024-06-26 Yixuan Wang , Baoxin Wang , Yijun Liu , Qingfu Zhu , Dayong Wu , Wanxiang Che

Improving Grammatical Error Correction with Machine Translation Pairs

We propose a novel data synthesis method to generate diverse error-corrected sentence pairs for improving grammatical error correction, which is based on a pair of machine translation models of different qualities (i.e., poor and good). The…

Computation and Language · Computer Science 2020-11-03 Wangchunshu Zhou , Tao Ge , Chang Mu , Ke Xu , Furu Wei , Ming Zhou

A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model

Synthetic data construction of Grammatical Error Correction (GEC) for non-English languages relies heavily on human-designed and language-specific rules, which produce limited error-corrected patterns. In this paper, we propose a generic…

Computation and Language · Computer Science 2022-01-27 Xin Sun , Tao Ge , Shuming Ma , Jingjing Li , Furu Wei , Houfeng Wang

Grammatical Error Generation Based on Translated Fragments

We perform neural machine translation of sentence fragments in order to create large amounts of training data for English grammatical error correction. Our method aims at simulating mistakes made by second language learners, and produces a…

Computation and Language · Computer Science 2021-04-21 Eetu Sjöblom , Mathias Creutz , Teemu Vahtola

Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection

Grammatical error correction, like other machine learning tasks, greatly benefits from large quantities of high quality training data, which is typically expensive to produce. While writing a program to automatically generate realistic…

Computation and Language · Computer Science 2018-10-02 Sudhanshu Kasewa , Pontus Stenetorp , Sebastian Riedel

How to Synthesize Text Data without Model Collapse?

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem.…

Computation and Language · Computer Science 2025-05-29 Xuekai Zhu , Daixuan Cheng , Hengli Li , Kaiyan Zhang , Ermo Hua , Xingtai Lv , Ning Ding , Zhouhan Lin , Zilong Zheng , Bowen Zhou

A Simple Recipe for Multilingual Grammatical Error Correction

This paper presents a simple recipe to train state-of-the-art multilingual Grammatical Error Correction (GEC) models. We achieve this by first proposing a language-agnostic method to generate a large number of synthetic examples. The second…

Computation and Language · Computer Science 2022-08-10 Sascha Rothe , Jonathan Mallinson , Eric Malmi , Sebastian Krause , Aliaksei Severyn

Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule

Progress in neural grammatical error correction (GEC) is hindered by the lack of annotated training data. Sufficient amounts of high-quality manually annotated data are not available, so recent research has relied on generating synthetic…

Computation and Language · Computer Science 2023-11-21 Andrey Bout , Alexander Podolskiy , Sergey Nikolenko , Irina Piontkovskaya

Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in Grammatical Error Detection

Grammatical Error Detection (GED) methods rely heavily on human annotated error corpora. However, these annotations are unavailable in many low-resource languages. In this paper, we investigate GED in this context. Leveraging the zero-shot…

Computation and Language · Computer Science 2024-07-17 Gaetan Lopez Latouche , Marc-André Carbonneau , Ben Swanson

Type-Driven Multi-Turn Corrections for Grammatical Error Correction

Grammatical Error Correction (GEC) aims to automatically detect and correct grammatical errors. In this aspect, dominant models are trained by one-iteration learning while performing multiple iterations of corrections during inference.…

Computation and Language · Computer Science 2022-03-18 Shaopeng Lai , Qingyu Zhou , Jiali Zeng , Zhongli Li , Chao Li , Yunbo Cao , Jinsong Su

Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to…

Computation and Language · Computer Science 2023-06-27 Chanjun Park , Seonmin Koo , Seolhwa Lee , Jaehyung Seo , Sugyeong Eo , Hyeonseok Moon , Heuiseok Lim

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled…

Computation and Language · Computer Science 2026-04-16 Joel Niklaus , Atsuki Yamaguchi , Michal Štefánik , Guilherme Penedo , Hynek Kydlíček , Elie Bakouch , Lewis Tunstall , Edward Emanuel Beeching , Thibaud Frere , Colin Raffel , Leandro von Werra , Thomas Wolf

GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages.…

Computation and Language · Computer Science 2023-09-21 Atakan Kara , Farrin Marouf Sofian , Andrew Bond , Gözde Gül Şahin

Bias-Corrected Data Synthesis for Imbalanced Learning

Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to…

Machine Learning · Statistics 2026-02-17 Pengfei Lyu , Zhengchi Ma , Linjun Zhang , Anru R. Zhang

Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-21 Vahid Noroozi , Zhehuai Chen , Somshubra Majumdar , Steve Huang , Jagadeesh Balam , Boris Ginsburg

Exploiting N-Best Hypotheses to Improve an SMT Approach to Grammatical Error Correction

Grammatical error correction (GEC) is the task of detecting and correcting grammatical errors in texts written by second language learners. The statistical machine translation (SMT) approach to GEC, in which sentences written by second…

Computation and Language · Computer Science 2016-06-02 Duc Tam Hoang , Shamil Chollampatt , Hwee Tou Ng

The Unbearable Weight of Generating Artificial Errors for Grammatical Error Correction

In recent years, sequence-to-sequence models have been very effective for end-to-end grammatical error correction (GEC). As creating human-annotated parallel corpus for GEC is expensive and time-consuming, there has been work on artificial…

Computation and Language · Computer Science 2019-07-23 Phu Mon Htut , Joel Tetreault