Related papers: Task Oriented In-Domain Data Augmentation

Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could…

Computation and Language · Computer Science 2025-02-03 Yaping Chai , Haoran Xie , Joe S. Qin

Multi-Stage Pre-training for Low-Resource Domain Adaptation

Transfer learning techniques are particularly useful in NLP tasks where a sizable amount of high-quality annotated data is difficult to obtain. Current approaches directly adapt a pre-trained language model (LM) on in-domain text before…

Computation and Language · Computer Science 2020-10-13 Rong Zhang , Revanth Gangi Reddy , Md Arafat Sultan , Vittorio Castelli , Anthony Ferritto , Radu Florian , Efsun Sarioglu Kayi , Salim Roukos , Avirup Sil , Todd Ward

TAIA: Large Language Models are Out-of-Distribution Data Learners

Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or…

Computation and Language · Computer Science 2024-10-18 Shuyang Jiang , Yusheng Liao , Ya Zhang , Yanfeng Wang , Yu Wang

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the…

Computation and Language · Computer Science 2020-05-07 Suchin Gururangan , Ana Marasović , Swabha Swayamdipta , Kyle Lo , Iz Beltagy , Doug Downey , Noah A. Smith

Empowering Large Language Models for Textual Data Augmentation

With the capabilities of understanding and executing natural language instructions, Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation. However, the quality of augmented data depends heavily on…

Computation and Language · Computer Science 2024-04-30 Yichuan Li , Kaize Ding , Jianling Wang , Kyumin Lee

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies indicate that excessive IT data can degrade LLMs performance, while carefully selecting…

Computation and Language · Computer Science 2026-03-16 Xin Chen , Junchao Wu , Shu Yang , Runzhe Zhan , Zeyu Wu , Min Yang , Shujian Huang , Lidia S. Chao , Derek F. Wong

TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use

Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with environments, a critical step toward generalized AI. However, the standard supervised fine-tuning (SFT) approach, which relies on large-scale…

Computation and Language · Computer Science 2025-08-27 Junjie Ye , Yilong Wu , Sixian Li , Yuming Yang , Zhiheng Xi , Tao Gui , Qi Zhang , Xuanjing Huang , Peng Wang , Zhongchao Shi , Jianping Fan , Zhengyin Du

Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target…

Computation and Language · Computer Science 2023-07-17 Shahriar Golchin , Mihai Surdeanu , Nazgol Tavabi , Ata Kiapour

MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning

Adapting large language models (LLMs) to unseen tasks with in-context training samples without fine-tuning remains an important research problem. To learn a robust LLM that adapts well to unseen tasks, multiple meta-training approaches have…

Computation and Language · Computer Science 2024-05-21 Sanchit Sinha , Yuguang Yue , Victor Soto , Mayank Kulkarni , Jianhua Lu , Aidong Zhang

Thinking Augmented Pre-training

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an…

Computation and Language · Computer Science 2025-10-20 Liang Wang , Nan Yang , Shaohan Huang , Li Dong , Furu Wei

A Compact Pretraining Approach for Neural Language Models

Domain adaptation for large neural language models (NLMs) is coupled with massive amounts of unstructured data in the pretraining phase. In this study, however, we show that pretrained NLMs learn in-domain information more effectively and…

Computation and Language · Computer Science 2022-08-30 Shahriar Golchin , Mihai Surdeanu , Nazgol Tavabi , Ata Kiapour

TART: A plug-and-play Transformer module for task-agnostic reasoning

Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the…

Machine Learning · Computer Science 2023-06-14 Kush Bhatia , Avanika Narayan , Christopher De Sa , Christopher Ré

In-Place Test-Time Training

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT)…

Machine Learning · Computer Science 2026-04-08 Guhao Feng , Shengjie Luo , Kai Hua , Ge Zhang , Di He , Wenhao Huang , Tianle Cai

Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance -- A Case Study in Finance

The application of large language models (LLMs) in domain-specific contexts, including finance, has expanded rapidly. Domain-specific LLMs are typically evaluated based on their performance in various downstream tasks relevant to the…

Artificial Intelligence · Computer Science 2024-12-06 Meni Brief , Oded Ovadia , Gil Shenderovitz , Noga Ben Yoash , Rachel Lemberg , Eitam Sheetrit

Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud

Specializing LLMs in various domain-specific tasks has emerged as a critical step towards achieving high performance. However, the construction and annotation of datasets in specific domains are always very costly. Apart from using superior…

Computation and Language · Computer Science 2024-12-09 Yuanhao Yue , Chengyu Wang , Jun Huang , Peng Wang

Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation

Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge…

Computation and Language · Computer Science 2025-09-30 Chaojun Nie , Jun Zhou , Guanxiang Wang , Shisong Wu , Zichen Wang

Fine-tuning Large Language Models for Domain-specific Machine Translation

Large language models (LLMs) have shown great potential in domain-specific machine translation (MT). However, one major issue is that LLMs pre-trained on general domain corpus might not generalize well to specific domains due to the lack of…

Computation and Language · Computer Science 2024-12-18 Jiawei Zheng , Hanghai Hong , Feiyan Liu , Xiaoli Wang , Jingsong Su , Yonggui Liang , Shikai Wu

Test-Time Learning for Large Language Models

While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known…

Computation and Language · Computer Science 2025-05-28 Jinwu Hu , Zhitian Zhang , Guohao Chen , Xutao Wen , Chao Shuai , Wei Luo , Bin Xiao , Yuanqing Li , Mingkui Tan

Rethink the Effectiveness of Text Data Augmentation: An Empirical Analysis

In recent years, language models (LMs) have made remarkable progress in advancing the field of natural language processing (NLP). However, the impact of data augmentation (DA) techniques on the fine-tuning (FT) performance of these LMs has…

Computation and Language · Computer Science 2023-06-14 Zhengxiang Shi , Aldo Lipani

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos