Related papers: Scaling Efficient LLMs

Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Scaling-laws for Large Time-series Models

Scaling laws for large language models (LLMs) have provided useful guidance in training ever larger models for predictable performance gains. Time series forecasting shares a similar sequential structure to language, and is amenable to…

Machine Learning · Computer Science 2025-01-09 Thomas D. P. Edwards , James Alvey , Justin Alsing , Nam H. Nguyen , Benjamin D. Wandelt

Continued AI Scaling Requires Repeated Efficiency Doublings

This paper argues that continued AI scaling requires repeated efficiency doublings. Classical AI scaling laws remain useful because they make progress predictable despite diminishing returns, but the compute variable in those laws is best…

Machine Learning · Computer Science 2026-04-10 Chien-Ping Lu

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter counts and computational costs without significantly…

Computation and Language · Computer Science 2024-11-07 Xiuying Wei , Skander Moalla , Razvan Pascanu , Caglar Gulcehre

On the origin of neural scaling laws: from random graphs to natural language

Scaling laws have played a major role in the modern AI revolution, providing practitioners predictive power over how the model performance will improve with increasing data, compute, and number of model parameters. This has spurred an…

Machine Learning · Computer Science 2026-01-16 Maissam Barkeshli , Alberto Alfarano , Andrey Gromov

Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers

Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their…

Machine Learning · Computer Science 2021-10-07 Narsimha Chilkuri , Eric Hunsberger , Aaron Voelker , Gurshaant Malik , Chris Eliasmith

Parallel Scaling Law for Language Models

It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more…

Machine Learning · Computer Science 2025-05-16 Mouxiang Chen , Binyuan Hui , Zeyu Cui , Jiaxi Yang , Dayiheng Liu , Jianling Sun , Junyang Lin , Zhongxin Liu

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct…

Machine Learning · Computer Science 2025-12-10 Jakub Krajewski , Amitis Shidani , Dan Busbridge , Sam Wiseman , Jason Ramapuram

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant…

Machine Learning · Computer Science 2024-09-05 Adam Ibrahim , Benjamin Thérien , Kshitij Gupta , Mats L. Richter , Quentin Anthony , Timothée Lesort , Eugene Belilovsky , Irina Rish

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple…

Artificial Intelligence · Computer Science 2025-11-11 Jianhao Chen , Zishuo Xun , Bocheng Zhou , Han Qi , Hangfan Zhang , Qiaosheng Zhang , Yang Chen , Wei Hu , Yuzhong Qu , Wanli Ouyang , Shuyue Hu

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This…

Machine Learning · Computer Science 2025-02-18 Jonas Geiping , Sean McLeish , Neel Jain , John Kirchenbauer , Siddharth Singh , Brian R. Bartoldson , Bhavya Kailkhura , Abhinav Bhatele , Tom Goldstein

Language models scale reliably with over-training and on downstream tasks

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are…

Computation and Language · Computer Science 2024-06-18 Samir Yitzhak Gadre , Georgios Smyrnis , Vaishaal Shankar , Suchin Gururangan , Mitchell Wortsman , Rulin Shao , Jean Mercat , Alex Fang , Jeffrey Li , Sedrick Keh , Rui Xin , Marianna Nezhurina , Igor Vasiljevic , Jenia Jitsev , Luca Soldaini , Alexandros G. Dimakis , Gabriel Ilharco , Pang Wei Koh , Shuran Song , Thomas Kollar , Yair Carmon , Achal Dave , Reinhard Heckel , Niklas Muennighoff , Ludwig Schmidt

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong…

Computation and Language · Computer Science 2025-08-14 Weigao Sun , Jiaxi Hu , Yucheng Zhou , Jusen Du , Disen Lan , Kexin Wang , Tong Zhu , Xiaoye Qu , Yu Zhang , Xiaoyu Mo , Daizong Liu , Yuxuan Liang , Wenliang Chen , Guoqi Li , Yu Cheng

Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches

Recently, Large Language Models (LLMs) have achieved remarkable success. A key factor behind this success is the scaling law observed by OpenAI. Specifically, for models with Transformer architecture, the test loss exhibits a power-law…

Machine Learning · Computer Science 2025-03-04 Yifang Chen , Xuyang Guo , Xiaoyu Li , Yingyu Liang , Zhenmei Shi , Zhao Song

Scaling Laws and In-Context Learning: A Unified Theoretical Framework

In-context learning (ICL) enables large language models to adapt to new tasks from demonstrations without parameter updates. Despite extensive empirical studies, a principled understanding of ICL emergence at scale remains more elusive. We…

Machine Learning · Computer Science 2025-11-11 Sushant Mehta , Ishan Gupta

Reversing Large Language Models for Efficient Training and Fine-Tuning

Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In…

Computation and Language · Computer Science 2025-12-05 Eshed Gal , Moshe Eliasof , Javier Turek , Uri Ascher , Eran Treister , Eldad Haber

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals

Transformers have achieved great success in effectively processing sequential data such as text. Their architecture consisting of several attention and feedforward blocks can model relations between elements of a sequence in parallel…

Machine Learning · Computer Science 2025-02-20 Jaemu Heo , Eldor Fozilov , Hyunmin Song , Taehwan Kim

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of…

Machine Learning · Computer Science 2026-05-15 Rishabh Tiwari , Kusha Sareen , Lakshya A Agrawal , Joseph E. Gonzalez , Matei Zaharia , Kurt Keutzer , Inderjit S Dhillon , Rishabh Agarwal , Devvrit Khatri

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

Neural scaling laws define a predictable relationship between a model's parameter count and its performance after training in the form of a power law. However, most research to date has not explicitly investigated whether scaling laws can…

Computation and Language · Computer Science 2022-10-19 Maor Ivgi , Yair Carmon , Jonathan Berant

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

In recent years, Large Language Models (LLMs) have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In…

Computation and Language · Computer Science 2024-10-03 Wenzhen Zheng , Wenbo Pan , Xu Xu , Libo Qin , Li Yue , Ming Zhou