Related papers: How predictable is language model benchmark perfor…

How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench

We investigate the predictability of large language model (LLM) capabilities: given records of past experiments using different model families, numbers of parameters, tasks, and numbers of in-context examples, can we accurately predict LLM…

Computation and Language · Computer Science 2023-11-01 Qinyuan Ye , Harvey Yiyun Fu , Xiang Ren , Robin Jia

General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it…

Artificial Intelligence · Computer Science 2025-03-18 Lexin Zhou , Lorenzo Pacchiardi , Fernando Martínez-Plumed , Katherine M. Collins , Yael Moros-Daval , Seraphina Zhang , Qinlin Zhao , Yitian Huang , Luning Sun , Jonathan E. Prunty , Zongqian Li , Pablo Sánchez-García , Kexin Jiang Chen , Pablo A. M. Casares , Jiyun Zu , John Burden , Behzad Mehrbakhsh , David Stillwell , Manuel Cebrian , Jindong Wang , Peter Henderson , Sherry Tongshuang Wu , Patrick C. Kyllonen , Lucy Cheke , Xing Xie , José Hernández-Orallo

Observational Scaling Laws and the Predictability of Language Model Performance

Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different…

Machine Learning · Computer Science 2024-10-03 Yangjun Ruan , Chris J. Maddison , Tatsunori Hashimoto

A Survey on Large Language Model Benchmarks

In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model…

Computation and Language · Computer Science 2025-08-22 Shiwen Ni , Guhong Chen , Shuaimin Li , Xuanang Chen , Siyi Li , Bingli Wang , Qiyao Wang , Xingjian Wang , Yifan Zhang , Liyang Fan , Chengming Li , Ruifeng Xu , Le Sun , Min Yang

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform…

Computation and Language · Computer Science 2023-06-13 Aarohi Srivastava , Abhinav Rastogi , Abhishek Rao , Abu Awal Md Shoeb , Abubakar Abid , Adam Fisch , Adam R. Brown , Adam Santoro , Aditya Gupta , Adrià Garriga-Alonso , Agnieszka Kluska , Aitor Lewkowycz , Akshat Agarwal , Alethea Power , Alex Ray , Alex Warstadt , Alexander W. Kocurek , Ali Safaya , Ali Tazarv , Alice Xiang , Alicia Parrish , Allen Nie , Aman Hussain , Amanda Askell , Amanda Dsouza , Ambrose Slone , Ameet Rahane , Anantharaman S. Iyer , Anders Andreassen , Andrea Madotto , Andrea Santilli , Andreas Stuhlmüller , Andrew Dai , Andrew La , Andrew Lampinen , Andy Zou , Angela Jiang , Angelica Chen , Anh Vuong , Animesh Gupta , Anna Gottardi , Antonio Norelli , Anu Venkatesh , Arash Gholamidavoodi , Arfa Tabassum , Arul Menezes , Arun Kirubarajan , Asher Mullokandov , Ashish Sabharwal , Austin Herrick , Avia Efrat , Aykut Erdem , Ayla Karakaş , B. Ryan Roberts , Bao Sheng Loe , Barret Zoph , Bartłomiej Bojanowski , Batuhan Özyurt , Behnam Hedayatnia , Behnam Neyshabur , Benjamin Inden , Benno Stein , Berk Ekmekci , Bill Yuchen Lin , Blake Howald , Bryan Orinion , Cameron Diao , Cameron Dour , Catherine Stinson , Cedrick Argueta , César Ferri Ramírez , Chandan Singh , Charles Rathkopf , Chenlin Meng , Chitta Baral , Chiyu Wu , Chris Callison-Burch , Chris Waites , Christian Voigt , Christopher D. Manning , Christopher Potts , Cindy Ramirez , Clara E. Rivera , Clemencia Siro , Colin Raffel , Courtney Ashcraft , Cristina Garbacea , Damien Sileo , Dan Garrette , Dan Hendrycks , Dan Kilman , Dan Roth , Daniel Freeman , Daniel Khashabi , Daniel Levy , Daniel Moseguí González , Danielle Perszyk , Danny Hernandez , Danqi Chen , Daphne Ippolito , Dar Gilboa , David Dohan , David Drakard , David Jurgens , Debajyoti Datta , Deep Ganguli , Denis Emelin , Denis Kleyko , Deniz Yuret , Derek Chen , Derek Tam , Dieuwke Hupkes , Diganta Misra , Dilyar Buzan , Dimitri Coelho Mollo , Diyi Yang , Dong-Ho Lee , Dylan Schrader , Ekaterina Shutova , Ekin Dogus Cubuk , Elad Segal , Eleanor Hagerman , Elizabeth Barnes , Elizabeth Donoway , Ellie Pavlick , Emanuele Rodola , Emma Lam , Eric Chu , Eric Tang , Erkut Erdem , Ernie Chang , Ethan A. Chi , Ethan Dyer , Ethan Jerzak , Ethan Kim , Eunice Engefu Manyasi , Evgenii Zheltonozhskii , Fanyue Xia , Fatemeh Siar , Fernando Martínez-Plumed , Francesca Happé , Francois Chollet , Frieda Rong , Gaurav Mishra , Genta Indra Winata , Gerard de Melo , Germán Kruszewski , Giambattista Parascandolo , Giorgio Mariani , Gloria Wang , Gonzalo Jaimovitch-López , Gregor Betz , Guy Gur-Ari , Hana Galijasevic , Hannah Kim , Hannah Rashkin , Hannaneh Hajishirzi , Harsh Mehta , Hayden Bogar , Henry Shevlin , Hinrich Schütze , Hiromu Yakura , Hongming Zhang , Hugh Mee Wong , Ian Ng , Isaac Noble , Jaap Jumelet , Jack Geissinger , Jackson Kernion , Jacob Hilton , Jaehoon Lee , Jaime Fernández Fisac , James B. Simon , James Koppel , James Zheng , James Zou , Jan Kocoń , Jana Thompson , Janelle Wingfield , Jared Kaplan , Jarema Radom , Jascha Sohl-Dickstein , Jason Phang , Jason Wei , Jason Yosinski , Jekaterina Novikova , Jelle Bosscher , Jennifer Marsh , Jeremy Kim , Jeroen Taal , Jesse Engel , Jesujoba Alabi , Jiacheng Xu , Jiaming Song , Jillian Tang , Joan Waweru , John Burden , John Miller , John U. Balis , Jonathan Batchelder , Jonathan Berant , Jörg Frohberg , Jos Rozen , Jose Hernandez-Orallo , Joseph Boudeman , Joseph Guerr , Joseph Jones , Joshua B. Tenenbaum , Joshua S. Rule , Joyce Chua , Kamil Kanclerz , Karen Livescu , Karl Krauth , Karthik Gopalakrishnan , Katerina Ignatyeva , Katja Markert , Kaustubh D. Dhole , Kevin Gimpel , Kevin Omondi , Kory Mathewson , Kristen Chiafullo , Ksenia Shkaruta , Kumar Shridhar , Kyle McDonell , Kyle Richardson , Laria Reynolds , Leo Gao , Li Zhang , Liam Dugan , Lianhui Qin , Lidia Contreras-Ochando , Louis-Philippe Morency , Luca Moschella , Lucas Lam , Lucy Noble , Ludwig Schmidt , Luheng He , Luis Oliveros Colón , Luke Metz , Lütfi Kerem Şenel , Maarten Bosma , Maarten Sap , Maartje ter Hoeve , Maheen Farooqi , Manaal Faruqui , Mantas Mazeika , Marco Baturan , Marco Marelli , Marco Maru , Maria Jose Ramírez Quintana , Marie Tolkiehn , Mario Giulianelli , Martha Lewis , Martin Potthast , Matthew L. Leavitt , Matthias Hagen , Mátyás Schubert , Medina Orduna Baitemirova , Melody Arnaud , Melvin McElrath , Michael A. Yee , Michael Cohen , Michael Gu , Michael Ivanitskiy , Michael Starritt , Michael Strube , Michał Swędrowski , Michele Bevilacqua , Michihiro Yasunaga , Mihir Kale , Mike Cain , Mimee Xu , Mirac Suzgun , Mitch Walker , Mo Tiwari , Mohit Bansal , Moin Aminnaseri , Mor Geva , Mozhdeh Gheini , Mukund Varma T , Nanyun Peng , Nathan A. Chi , Nayeon Lee , Neta Gur-Ari Krakover , Nicholas Cameron , Nicholas Roberts , Nick Doiron , Nicole Martinez , Nikita Nangia , Niklas Deckers , Niklas Muennighoff , Nitish Shirish Keskar , Niveditha S. Iyer , Noah Constant , Noah Fiedel , Nuan Wen , Oliver Zhang , Omar Agha , Omar Elbaghdadi , Omer Levy , Owain Evans , Pablo Antonio Moreno Casares , Parth Doshi , Pascale Fung , Paul Pu Liang , Paul Vicol , Pegah Alipoormolabashi , Peiyuan Liao , Percy Liang , Peter Chang , Peter Eckersley , Phu Mon Htut , Pinyu Hwang , Piotr Miłkowski , Piyush Patil , Pouya Pezeshkpour , Priti Oli , Qiaozhu Mei , Qing Lyu , Qinlang Chen , Rabin Banjade , Rachel Etta Rudolph , Raefer Gabriel , Rahel Habacker , Ramon Risco , Raphaël Millière , Rhythm Garg , Richard Barnes , Rif A. Saurous , Riku Arakawa , Robbe Raymaekers , Robert Frank , Rohan Sikand , Roman Novak , Roman Sitelew , Ronan LeBras , Rosanne Liu , Rowan Jacobs , Rui Zhang , Ruslan Salakhutdinov , Ryan Chi , Ryan Lee , Ryan Stovall , Ryan Teehan , Rylan Yang , Sahib Singh , Saif M. Mohammad , Sajant Anand , Sam Dillavou , Sam Shleifer , Sam Wiseman , Samuel Gruetter , Samuel R. Bowman , Samuel S. Schoenholz , Sanghyun Han , Sanjeev Kwatra , Sarah A. Rous , Sarik Ghazarian , Sayan Ghosh , Sean Casey , Sebastian Bischoff , Sebastian Gehrmann , Sebastian Schuster , Sepideh Sadeghi , Shadi Hamdan , Sharon Zhou , Shashank Srivastava , Sherry Shi , Shikhar Singh , Shima Asaadi , Shixiang Shane Gu , Shubh Pachchigar , Shubham Toshniwal , Shyam Upadhyay , Shyamolima , Debnath , Siamak Shakeri , Simon Thormeyer , Simone Melzi , Siva Reddy , Sneha Priscilla Makini , Soo-Hwan Lee , Spencer Torene , Sriharsha Hatwar , Stanislas Dehaene , Stefan Divic , Stefano Ermon , Stella Biderman , Stephanie Lin , Stephen Prasad , Steven T. Piantadosi , Stuart M. Shieber , Summer Misherghi , Svetlana Kiritchenko , Swaroop Mishra , Tal Linzen , Tal Schuster , Tao Li , Tao Yu , Tariq Ali , Tatsu Hashimoto , Te-Lin Wu , Théo Desbordes , Theodore Rothschild , Thomas Phan , Tianle Wang , Tiberius Nkinyili , Timo Schick , Timofei Kornev , Titus Tunduny , Tobias Gerstenberg , Trenton Chang , Trishala Neeraj , Tushar Khot , Tyler Shultz , Uri Shaham , Vedant Misra , Vera Demberg , Victoria Nyamai , Vikas Raunak , Vinay Ramasesh , Vinay Uday Prabhu , Vishakh Padmakumar , Vivek Srikumar , William Fedus , William Saunders , William Zhang , Wout Vossen , Xiang Ren , Xiaoyu Tong , Xinran Zhao , Xinyi Wu , Xudong Shen , Yadollah Yaghoobzadeh , Yair Lakretz , Yangqiu Song , Yasaman Bahri , Yejin Choi , Yichi Yang , Yiding Hao , Yifu Chen , Yonatan Belinkov , Yu Hou , Yufang Hou , Yuntao Bai , Zachary Seid , Zhuoye Zhao , Zijian Wang , Zijie J. Wang , Zirui Wang , Ziyi Wu

Scaling Performance of Large Language Model Pretraining

Large language models (LLMs) show best-in-class performance across a wide range of natural language processing applications. Training these models is an extremely computationally expensive task; frontier Artificial Intelligence (AI)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-10 Alexander Interrante-Grant , Carla Varela-Rosa , Suhaas Narayan , Chris Connelly , Albert Reuther

Don't Make Your LLM an Evaluation Benchmark Cheater

Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for…

Computation and Language · Computer Science 2023-11-06 Kun Zhou , Yutao Zhu , Zhipeng Chen , Wentong Chen , Wayne Xin Zhao , Xu Chen , Yankai Lin , Ji-Rong Wen , Jiawei Han

Third-Party Language Model Performance Prediction from Instruction

Language model-based instruction-following systems have lately shown increasing performance on many benchmark tasks, demonstrating the capability of adapting to a broad variety of instructions. However, such systems are often not designed…

Computation and Language · Computer Science 2024-03-20 Rahul Nadkarni , Yizhong Wang , Noah A. Smith

LLMPerf: GPU Performance Modeling meets Large Language Models

Performance modeling, a pivotal domain in program cost analysis, currently relies on manually crafted models constrained by various program and hardware limitations, especially in the intricate landscape of GPGPU. Meanwhile, Large Language…

Performance · Computer Science 2025-03-17 Khoi N. M. Nguyen , Hoang Duy Nguyen Do , Huyen Thao Le , Thanh Tuan Dao

How Benchmark Prediction from Fewer Data Misses the Mark

Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset…

Machine Learning · Computer Science 2025-06-10 Guanhua Zhang , Florian E. Dorner , Moritz Hardt

Large Scale Language Modeling in Automatic Speech Recognition

Large language models have been proven quite beneficial for a variety of automatic speech recognition tasks in Google. We summarize results on Voice Search and a few YouTube speech transcription tasks to highlight the impact that one can…

Computation and Language · Computer Science 2012-11-01 Ciprian Chelba , Dan Bikel , Maria Shugrina , Patrick Nguyen , Shankar Kumar

Predicting Language Models' Success at Zero-Shot Probabilistic Prediction

Recent work has investigated the capabilities of large language models (LLMs) as zero-shot models for generating individual-level characteristics (e.g., to serve as risk models or augment survey datasets). However, when should a user have…

Machine Learning · Computer Science 2025-09-22 Kevin Ren , Santiago Cortes-Gomez , Carlos Miguel Patiño , Ananya Joshi , Ruiqi Lyu , Jingjing Tang , Alistair Turcan , Khurram Yamin , Steven Wu , Bryan Wilder

Language models scale reliably with over-training and on downstream tasks

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are…

Computation and Language · Computer Science 2024-06-18 Samir Yitzhak Gadre , Georgios Smyrnis , Vaishaal Shankar , Suchin Gururangan , Mitchell Wortsman , Rulin Shao , Jean Mercat , Alex Fang , Jeffrey Li , Sedrick Keh , Rui Xin , Marianna Nezhurina , Igor Vasiljevic , Jenia Jitsev , Luca Soldaini , Alexandros G. Dimakis , Gabriel Ilharco , Pang Wei Koh , Shuran Song , Thomas Kollar , Yair Carmon , Achal Dave , Reinhard Heckel , Niklas Muennighoff , Ludwig Schmidt

Evaluating Large Language Models on Controlled Generation Tasks

While recent studies have looked into the abilities of large language models in various benchmark tasks, including question generation, reading comprehension, multilingual and etc, there have been few studies looking into the…

Computation and Language · Computer Science 2023-10-24 Jiao Sun , Yufei Tian , Wangchunshu Zhou , Nan Xu , Qian Hu , Rahul Gupta , John Frederick Wieting , Nanyun Peng , Xuezhe Ma

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and…

Performance · Computer Science 2023-12-04 Longteng Zhang , Xiang Liu , Zeyu Li , Xinglin Pan , Peijie Dong , Ruibo Fan , Rui Guo , Xin Wang , Qiong Luo , Shaohuai Shi , Xiaowen Chu

Latent Performance Profiling of Large Language Models

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data…

Computation and Language · Computer Science 2026-05-29 Tanmoy Chakraborty , Ayan Sengupta , Suparna Bhattacharya , Partha Pratim Chakrabarti , Amlan Chakrabarti , Supratik Chakraborty , Partha Pratim Das , Lipika Dey , Richa Singh , Mayank Vatsa

Generalization Ability of Feature-based Performance Prediction Models: A Statistical Analysis across Benchmarks

This study examines the generalization ability of algorithm performance prediction models across various benchmark suites. Comparing the statistical similarity between the problem collections with the accuracy of performance prediction…

Machine Learning · Computer Science 2024-05-22 Ana Nikolikj , Ana Kostovska , Gjorgjina Cenikj , Carola Doerr , Tome Eftimov

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

What makes a good Large Language Model (LLM)? That it performs well on the relevant benchmarks -- which hopefully measure, with some validity, the presence of capabilities that are also challenged in real application. But what makes the…

Computation and Language · Computer Science 2024-06-21 Nidhir Bhavsar , Jonathan Jordan , Sherzod Hakimov , David Schlangen

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

Larger language models have higher accuracy on average, but are they better on every single instance (datapoint)? Some work suggests larger models have higher out-of-distribution robustness, while other work suggests they have lower…

Computation and Language · Computer Science 2021-05-14 Ruiqi Zhong , Dhruba Ghosh , Dan Klein , Jacob Steinhardt

Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's…

Computation and Language · Computer Science 2023-11-16 Gregory Yauney , Emily Reif , David Mimno