English
Related papers

Related papers: Efficient Embedding-based Synthetic Data Generatio…

200 papers

Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of…

Computation and Language · Computer Science 2024-06-24 Lin Long , Rui Wang , Ruixuan Xiao , Junbo Zhao , Xiao Ding , Gang Chen , Haobo Wang

Given the inherent class imbalance issue within student performance datasets, samples belonging to the edges of the target class distribution pose a challenge for predictive machine learning algorithms to learn. In this paper, we introduce…

Machine Learning · Computer Science 2021-01-05 Dom Huh

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment…

Computation and Language · Computer Science 2025-11-21 Mihai Nadas , Laura Diosan , Andreea Tomescu

Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources…

Machine Learning · Computer Science 2025-10-28 Amal Abed , Ivan Lukic , Jörg K. H. Franke , Frank Hutter

In the era of data-driven decision-making, accurate table-level representations and efficient table recommendation systems are becoming increasingly crucial for improving table management, discovery, and analysis. However, existing…

Machine Learning · Computer Science 2024-11-07 Dayu Yang , Natawut Monaikul , Amanda Ding , Bozhao Tan , Kishore Mosaliganti , Giri Iyengar

As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to…

Computation and Language · Computer Science 2024-10-31 Yung-Chieh Chan , George Pu , Apaar Shanker , Parth Suresh , Penn Jenks , John Heyer , Sam Denton

The recent surge in research focused on generating synthetic data from large language models (LLMs), especially for scenarios with limited data availability, marks a notable shift in Generative Artificial Intelligence (AI). Their ability to…

Machine Learning · Computer Science 2024-03-08 Xu Guo , Yiqiang Chen

Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative…

Computation and Language · Computer Science 2023-05-25 Veniamin Veselovsky , Manoel Horta Ribeiro , Akhil Arora , Martin Josifoski , Ashton Anderson , Robert West

The success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the…

Computation and Language · Computer Science 2024-10-18 Ke Wang , Jiahui Zhu , Minjie Ren , Zeming Liu , Shiwei Li , Zongye Zhang , Chenkai Zhang , Xiaoyu Wu , Qiqi Zhan , Qingjie Liu , Yunhong Wang

The collection and curation of high-quality training data is crucial for developing text classification models with superior performance, but it is often associated with significant costs and time investment. Researchers have recently…

Computation and Language · Computer Science 2023-10-16 Zhuoyan Li , Hangxiao Zhu , Zhuoran Lu , Ming Yin

The in-context learning ability of large language models (LLMs) enables them to generalize to novel downstream tasks with relatively few labeled examples. However, they require enormous computational resources to be deployed. Alternatively,…

Computation and Language · Computer Science 2024-01-09 Jean Kaddour , Qi Liu

Recent years have witnessed a surge in the popularity of Machine Learning (ML), applied across diverse domains. However, progress is impeded by the scarcity of training data due to expensive acquisition and privacy legislation. Synthetic…

Machine Learning · Computer Science 2024-02-05 André Bauer , Simon Trapp , Michael Stenger , Robert Leppich , Samuel Kounev , Mark Leznik , Kyle Chard , Ian Foster

Synthetic data generation has become an increasingly popular way of training models without the need for large, manually labeled datasets. For tasks like text embedding, synthetic data offers diverse and scalable training examples,…

Computation and Language · Computer Science 2024-11-05 Haonan Chen , Liang Wang , Nan Yang , Yutao Zhu , Ziliang Zhao , Furu Wei , Zhicheng Dou

This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in…

Training and fine-tuning deep learning models, especially large language models (LLMs), on limited and imbalanced datasets poses substantial challenges. These issues often result in poor generalization, where models overfit to dominant…

Computation and Language · Computer Science 2025-01-14 Ashok Choudhary , Cornelius Thiels , Hojjat Salehinejad

Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing…

Machine Learning · Computer Science 2025-07-22 Anh Nguyen , Sam Schafft , Nicholas Hale , John Alfaro

Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is…

Computation and Language · Computer Science 2024-07-23 Yinheng Li , Rogerio Bonatti , Sara Abdali , Justin Wagle , Kazuhito Koishida

Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data…

Machine Learning · Computer Science 2025-03-25 Hsun-Yu Kuo , Yin-Hsiang Liao , Yu-Chieh Chao , Wei-Yun Ma , Pu-Jen Cheng

Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would…

Machine Learning · Statistics 2026-02-10 Ryumei Nakada , Yichen Xu , Lexin Li , Linjun Zhang

Automatic detection of depression is a rapidly growing field of research at the intersection of psychology and machine learning. However, with its exponential interest comes a growing concern for data privacy and scarcity due to the…

Machine Learning · Computer Science 2024-11-27 Andrea Kang , Jun Yu Chen , Zoe Lee-Youngzie , Shuhao Fu
‹ Prev 1 2 3 10 Next ›