Related papers: Efficient Embedding-based Synthetic Data Generatio…

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of…

Computation and Language · Computer Science 2024-06-24 Lin Long , Rui Wang , Ruixuan Xiao , Junbo Zhao , Xiao Ding , Gang Chen , Haobo Wang

Synthetic Embedding-based Data Generation Methods for Student Performance

Given the inherent class imbalance issue within student performance datasets, samples belonging to the edges of the target class distribution pose a challenge for predictive machine learning algorithms to learn. In this paper, we introduce…

Machine Learning · Computer Science 2021-01-05 Dom Huh

Synthetic Data Generation Using Large Language Models: Advances in Text and Code

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment…

Computation and Language · Computer Science 2025-11-21 Mihai Nadas , Laura Diosan , Andreea Tomescu

Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources…

Machine Learning · Computer Science 2025-10-28 Amal Abed , Ivan Lukic , Jörg K. H. Franke , Frank Hutter

Enhancing Table Representations with LLM-powered Synthetic Data Generation

In the era of data-driven decision-making, accurate table-level representations and efficient table recommendation systems are becoming increasingly crucial for improving table management, discovery, and analysis. However, existing…

Machine Learning · Computer Science 2024-11-07 Dayu Yang , Natawut Monaikul , Amanda Ding , Bozhao Tan , Kishore Mosaliganti , Giri Iyengar

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to…

Computation and Language · Computer Science 2024-10-31 Yung-Chieh Chan , George Pu , Apaar Shanker , Parth Suresh , Penn Jenks , John Heyer , Sam Denton

Generative AI for Synthetic Data Generation: Methods, Challenges and the Future

The recent surge in research focused on generating synthetic data from large language models (LLMs), especially for scenarios with limited data availability, marks a notable shift in Generative Artificial Intelligence (AI). Their ability to…

Machine Learning · Computer Science 2024-03-08 Xu Guo , Yiqiang Chen

Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative…

Computation and Language · Computer Science 2023-05-25 Veniamin Veselovsky , Manoel Horta Ribeiro , Akhil Arora , Martin Josifoski , Ashton Anderson , Robert West

A Survey on Data Synthesis and Augmentation for Large Language Models

The success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the…

Computation and Language · Computer Science 2024-10-18 Ke Wang , Jiahui Zhu , Minjie Ren , Zeming Liu , Shiwei Li , Zongye Zhang , Chenkai Zhang , Xiaoyu Wu , Qiqi Zhan , Qingjie Liu , Yunhong Wang

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

The collection and curation of high-quality training data is crucial for developing text classification models with superior performance, but it is often associated with significant costs and time investment. Researchers have recently…

Computation and Language · Computer Science 2023-10-16 Zhuoyan Li , Hangxiao Zhu , Zhuoran Lu , Ming Yin

Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models

The in-context learning ability of large language models (LLMs) enables them to generalize to novel downstream tasks with relatively few labeled examples. However, they require enormous computational resources to be deployed. Alternatively,…

Computation and Language · Computer Science 2024-01-09 Jean Kaddour , Qi Liu

Comprehensive Exploration of Synthetic Data Generation: A Survey

Recent years have witnessed a surge in the popularity of Machine Learning (ML), applied across diverse domains. However, progress is impeded by the scarcity of training data due to expensive acquisition and privacy legislation. Synthetic…

Machine Learning · Computer Science 2024-02-05 André Bauer , Simon Trapp , Michael Stenger , Robert Leppich , Samuel Kounev , Mark Leznik , Kyle Chard , Ian Foster

Little Giants: Synthesizing High-Quality Embedding Data at Scale

Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples,…

Computation and Language · Computer Science 2024-11-05 Haonan Chen , Liang Wang , Nan Yang , Yutao Zhu , Ziliang Zhao , Furu Wei , Zhicheng Dou

Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data

This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in…

Machine Learning · Computer Science 2025-04-04 Waris Gill , Justin Cechmanek , Tyler Hutcherson , Srijith Rajamohan , Jen Agarwal , Muhammad Ali Gulzar , Manvinder Singh , Benoit Dion

Synthetic Feature Augmentation Improves Generalization Performance of Language Models

Training and fine-tuning deep learning models, especially large language models (LLMs), on limited and imbalanced datasets poses substantial challenges. These issues often result in poor generalization, where models overfit to dominant…

Computation and Language · Computer Science 2025-01-14 Ashok Choudhary , Cornelius Thiels , Hojjat Salehinejad

FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs

Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing…

Machine Learning · Computer Science 2025-07-22 Anh Nguyen , Sam Schafft , Nicholas Hale , John Alfaro

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is…

Computation and Language · Computer Science 2024-07-23 Yinheng Li , Rogerio Bonatti , Sara Abdali , Justin Wagle , Kazuhito Koishida

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data…

Machine Learning · Computer Science 2025-03-25 Hsun-Yu Kuo , Yin-Hsiang Liao , Yu-Chieh Chao , Wei-Yun Ma , Pu-Jen Cheng

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would…

Machine Learning · Statistics 2026-02-10 Ryumei Nakada , Yichen Xu , Lexin Li , Linjun Zhang

Synthetic Data Generation with LLM for Improved Depression Prediction

Automatic detection of depression is a rapidly growing field of research at the intersection of psychology and machine learning. However, with its exponential interest comes a growing concern for data privacy and scarcity due to the…

Machine Learning · Computer Science 2024-11-27 Andrea Kang , Jun Yu Chen , Zoe Lee-Youngzie , Shuhao Fu