Related papers: Data-Augmentation-Based Dialectal Adaptation for L…

LLM-powered Data Augmentation for Enhanced Cross-lingual Performance

This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited. To achieve this, we utilise several…

Computation and Language · Computer Science 2023-10-24 Chenxi Whitehouse , Monojit Choudhury , Alham Fikri Aji

Diversity-oriented Data Augmentation with Large Language Models

Data augmentation is an essential technique in natural language processing (NLP) for enriching training datasets by generating diverse samples. This process is crucial for improving the robustness and generalization capabilities of NLP…

Computation and Language · Computer Science 2025-10-16 Zaitian Wang , Jinghan Zhang , Xinhao Zhang , Kunpeng Liu , Pengfei Wang , Yuanchun Zhou

Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges

In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This…

Computation and Language · Computer Science 2024-07-03 Bosheng Ding , Chengwei Qin , Ruochen Zhao , Tianze Luo , Xinze Li , Guizhen Chen , Wenhan Xia , Junjie Hu , Anh Tuan Luu , Shafiq Joty

Enhancing SLM via ChatGPT and Dataset Augmentation

This paper explores the enhancement of small language models through strategic dataset augmentation via ChatGPT-3.5-Turbo, in the domain of Natural Language Inference (NLI). By employing knowledge distillation-based techniques and synthetic…

Computation and Language · Computer Science 2024-09-20 Tom Pieper , Mohamad Ballout , Ulf Krumnack , Gunther Heidemann , Kai-Uwe Kühnberger

DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models

Most of the world's languages and dialects are low-resource, and lack support in mainstream machine translation (MT) models. However, many of them have a closely-related high-resource language (HRL) neighbor, and differ in linguistically…

Computation and Language · Computer Science 2025-10-22 Niyati Bafna , Emily Chang , Nathaniel R. Robinson , David R. Mortensen , Kenton Murray , David Yarowsky , Hale Sirin

Multimodal Large Language Models for Image, Text, and Speech Data Augmentation: A Survey

In the past five years, research has shifted from traditional Machine Learning (ML) and Deep Learning (DL) approaches to leveraging Large Language Models (LLMs) , including multimodality, for data augmentation to enhance generalization, and…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Ranjan Sapkota , Shaina Raza , Maged Shoman , Achyut Paudel , Manoj Karkee

Data Augmentation Integrating Dialogue Flow and Style to Adapt Spoken Dialogue Systems to Low-Resource User Groups

This study addresses the interaction challenges encountered by spoken dialogue systems (SDSs) when engaging with users who exhibit distinct conversational behaviors, particularly minors, in scenarios where data are scarce. We propose a…

Computation and Language · Computer Science 2024-08-21 Zhiyang Qi , Michimasa Inaba

Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding

Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages, especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation.…

Computation and Language · Computer Science 2026-02-11 Abdulhai Alali , Abderrahmane Issam

Fine-Tuning LLMs for Low-Resource Dialect Translation: The Case of Lebanese

This paper examines the effectiveness of Large Language Models (LLMs) in translating the low-resource Lebanese dialect, focusing on the impact of culturally authentic data versus larger translated datasets. We compare three fine-tuning…

Computation and Language · Computer Science 2025-05-02 Silvana Yakhni , Ali Chehab

Improving Small Language Models on PubMedQA via Generative Data Augmentation

Large Language Models (LLMs) have made remarkable advancements in the field of natural language processing. However, their increasing size poses challenges in terms of computational cost. On the other hand, Small Language Models (SLMs) are…

Computation and Language · Computer Science 2023-08-03 Zhen Guo , Peiqi Wang , Yanwei Wang , Shangdi Yu

Quantifying the Dialect Gap and its Correlates Across Languages

Historically, researchers and consumers have noticed a decrease in quality when applying NLP tools to minority variants of languages (i.e. Puerto Rican Spanish or Swiss German), but studies exploring this have been limited to a select few…

Computation and Language · Computer Science 2023-10-24 Anjali Kantharuban , Ivan Vulić , Anna Korhonen

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of…

Computation and Language · Computer Science 2024-07-16 Nicholas Lee , Thanakul Wattanawong , Sehoon Kim , Karttikeya Mangalam , Sheng Shen , Gopala Anumanchipalli , Michael W. Mahoney , Kurt Keutzer , Amir Gholami

Adaptive Augmentation Policy Optimization with LLM Feedback

Data augmentation is a critical component of deep learning pipelines, enhancing model generalization by increasing dataset diversity. Traditional augmentation strategies rely on manually designed transformations, stochastic sampling, or…

Computer Vision and Pattern Recognition · Computer Science 2025-08-06 Ant Duru , Alptekin Temizel

Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data

Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data…

Computation and Language · Computer Science 2025-10-31 Zhenqing Ling , Daoyuan Chen , Liuyi Yao , Qianli Shen , Yaliang Li , Ying Shen

Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages

Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data. While current state-of-the-art large language models (LLMs) still struggle with LRLs, smaller multilingual models (mLMs)…

Computation and Language · Computer Science 2025-02-17 Daniil Gurgurov , Ivan Vykopal , Josef van Genabith , Simon Ostermann

CultureLLM: Incorporating Cultural Differences into Large Language Models

Large language models (LLMs) are reported to be partial to certain cultures owing to the training data dominance from the English corpora. Since multilingual cultural data are often expensive to collect, existing efforts handle this by…

Computation and Language · Computer Science 2024-12-04 Cheng Li , Mengzhou Chen , Jindong Wang , Sunayana Sitaram , Xing Xie

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on…

Computation and Language · Computer Science 2026-04-21 H S V N S Kowndinya Renduchintala , Sumit Bhatia

Empowering Large Language Models for Textual Data Augmentation

With the capabilities of understanding and executing natural language instructions, Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation. However, the quality of augmented data depends heavily on…

Computation and Language · Computer Science 2024-04-30 Yichuan Li , Kaize Ding , Jianling Wang , Kyumin Lee

Controllable and Diverse Data Augmentation with Large Language Model for Low-Resource Open-Domain Dialogue Generation

Data augmentation (DA) is crucial to mitigate model training instability and over-fitting problems in low-resource open-domain dialogue generation. However, traditional DA methods often neglect semantic data diversity, restricting the…

Computation and Language · Computer Science 2024-04-02 Zhenhua Liu , Tong Zhu , Jianxiang Xiang , Wenliang Chen

LPC Augment: An LPC-Based ASR Data Augmentation Algorithm for Low and Zero-Resource Children's Dialects

This paper proposes a novel linear prediction coding-based data aug-mentation method for children's low and zero resource dialect ASR. The data augmentation procedure consists of perturbing the formant peaks of the LPC spectrum during LPC…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-23 Alexander Johnson , Ruchao Fan , Robin Morris , Abeer Alwan