Related papers: Capacity-Aware Mixture Law Enables Efficient LLM D…

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do…

Computation and Language · Computer Science 2026-05-05 Fengze Liu , Weidong Zhou , Binbin Liu , Ping Guo , Zijun Wang , Bingni Zhang , Yifan Zhang , Yifeng Yu , Xiaohuan Zhou , Taifeng Wang

Scaling Laws for Optimal Data Mixtures

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on…

Machine Learning · Computer Science 2025-10-03 Mustafa Shukor , Louis Bethune , Dan Busbridge , David Grangier , Enrico Fini , Alaaeldin El-Nouby , Pierre Ablin

Scaling Laws for Mixture Pretraining Under Data Constraints

As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable…

Machine Learning · Computer Science 2026-05-18 Anastasiia Sedova , Skyler Seto , Natalie Schluter , Pierre Ablin

The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models

Recent advances in large language models (LLMs) have been largely driven by scaling laws for individual models, which predict performance improvements as model parameters and data volume increase. However, the capabilities of any single LLM…

Machine Learning · Computer Science 2026-01-29 Dakuan Lu , Jiaqi Zhang , Cheng Yuan , Jiawei Shao , Xuelong Li

BiMix: A Bivariate Data Mixing Law for Language Model Pretraining

Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly…

Machine Learning · Computer Science 2025-01-28 Ce Ge , Zhijian Ma , Daoyuan Chen , Yaliang Li , Bolin Ding

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Pretraining data of large language models composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or…

Computation and Language · Computer Science 2025-03-21 Jiasheng Ye , Peiju Liu , Tianxiang Sun , Jun Zhan , Yunhua Zhou , Xipeng Qiu

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the…

Computation and Language · Computer Science 2025-10-22 Changxin Tian , Kunlong Chen , Jia Liu , Ziqi Liu , Zhiqiang Zhang , Jun Zhou

Merge to Mix: Mixing Datasets via Model Merging

Mixing datasets for fine-tuning large models (LMs) has become critical for maximizing performance on downstream tasks. However, composing effective dataset mixtures typically relies on heuristics and trial-and-error, often requiring…

Machine Learning · Computer Science 2025-05-23 Zhixu Silvia Tao , Kasper Vinken , Hao-Wei Yeh , Avi Cooper , Xavier Boix

Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

Careful curation of data sources can significantly improve the performance of LLM pre-training, but predominant approaches rely heavily on intuition or costly trial-and-error, making them difficult to generalize across different data…

Machine Learning · Computer Science 2025-03-28 Thomson Yen , Andrew Wei Tung Siah , Haozhe Chen , Tianyi Peng , Daniel Guetta , Hongseok Namkoong

MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging

Optimizing data mixtures is essential for unlocking the full potential of large language models (LLMs), yet identifying the optimal composition remains computationally prohibitive due to reliance on heuristic trials or expensive proxy…

Machine Learning · Computer Science 2026-01-27 Jiapeng Wang , Changxin Tian , Kunlong Chen , Ziqi Liu , Jiaxin Mao , Wayne Xin Zhao , Zhiqiang Zhang , Jun Zhou

Densing Law of LLMs

Large Language Models (LLMs) have emerged as a milestone in artificial intelligence, and their performance can improve as the model size increases. However, this scaling brings great challenges to training and inference efficiency,…

Artificial Intelligence · Computer Science 2024-12-09 Chaojun Xiao , Jie Cai , Weilin Zhao , Guoyang Zeng , Biyuan Lin , Jie Zhou , Zhi Zheng , Xu Han , Zhiyuan Liu , Maosong Sun

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or…

Computation and Language · Computer Science 2024-10-08 Jiawei Gu , Zacc Yang , Chuanghao Ding , Rui Zhao , Fei Tan

AutoMixAlign: Adaptive Data Mixing for Multi-Task Preference Optimization in LLMs

When aligning large language models (LLMs), their performance on various tasks (such as being helpful, harmless, and honest) depends heavily on the composition of their training data. However, selecting a data mixture that achieves strong…

Machine Learning · Computer Science 2025-06-03 Nicholas E. Corrado , Julian Katz-Samuels , Adithya Devraj , Hyokun Yun , Chao Zhang , Yi Xu , Yi Pan , Bing Yin , Trishul Chilimbi

Predicting Task Performance with Context-aware Scaling Laws

Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture…

Computation and Language · Computer Science 2025-10-17 Kyle Montgomery , David Park , Jianhong Tu , Michael Bendersky , Beliz Gunel , Dawn Song , Chenguang Wang

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training…

Machine Learning · Computer Science 2026-05-12 Xinran Gu , Kaifeng Lyu , Jiazheng Li , Jingzhao Zhang

A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets

We introduce a scaling law for fine-tuning large language models (LLMs) under fixed compute budgets that explicitly accounts for data composition. Conventional approaches measure training data solely by total tokens, yet the number of…

Computation and Language · Computer Science 2025-06-04 Ryan Lagasse , Aidan Kierans , Avijit Ghosh , Shiri Dori-Hacohen

Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models

We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture via a Mixture of Data Experts (MDE). We use this approximation as a…

Machine Learning · Computer Science 2025-02-25 Lior Belenki , Alekh Agarwal , Tianze Shi , Kristina Toutanova

Cache Management for Mixture-of-Experts LLMs -- extended version

Large language models (LLMs) have demonstrated remarkable capabilities across a variety of tasks. One of the main challenges towards the successful deployment of LLMs is memory management, since they typically involve billions of…

Machine Learning · Computer Science 2025-09-03 Spyros Angelopoulos , Loris Marchal , Adrien Obrecht , Bertrand Simon

Scaling Laws for Upcycling Mixture-of-Experts Language Models

Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train…

Machine Learning · Computer Science 2025-06-17 Seng Pei Liew , Takuya Kato , Sho Takase

APreQEL: Adaptive Mixed Precision Quantization For Edge LLMs

Today, large language models have demonstrated their strengths in various tasks ranging from reasoning, code generation, and complex problem solving. However, this advancement comes with a high computational cost and memory requirements,…

Machine Learning · Computer Science 2026-03-26 Meriem Bouzouad , Yuan-Hao Chang , Jalil Boukhobza