Related papers: Grokking Modular Polynomials

Grokking modular arithmetic

We present a simple neural network that can learn modular arithmetic tasks and exhibits a sudden jump in generalization known as ``grokking''. Concretely, we present (i) fully-connected two-layer networks that exhibit grokking on various…

Machine Learning · Computer Science 2023-01-10 Andrey Gromov

Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Grokking is the intriguing phenomenon where a model learns to generalize long after it has fit the training data. We show both analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks in a…

Machine Learning · Statistics 2024-02-06 Noam Levi , Alon Beck , Yohai Bar-Sinai

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great…

Machine Learning · Computer Science 2022-01-07 Alethea Power , Yuri Burda , Harri Edwards , Igor Babuschkin , Vedant Misra

NeuralGrok: Accelerate Grokking by Neural Gradient Transformation

Grokking is proposed and widely studied as an intricate phenomenon in which generalization is achieved after a long-lasting period of overfitting. In this work, we propose NeuralGrok, a novel gradient-based approach that learns an optimal…

Machine Learning · Computer Science 2025-04-28 Xinyu Zhou , Simin Fan , Martin Jaggi , Jie Fu

Explaining grokking through circuit efficiency

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation. We propose that grokking…

Machine Learning · Computer Science 2023-09-06 Vikrant Varma , Rohin Shah , Zachary Kenton , János Kramár , Ramana Kumar

To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

Robust generalization is a major challenge in deep learning, particularly when the number of trainable parameters is very large. In general, it is very difficult to know if the network has memorized a particular set of examples or…

Machine Learning · Computer Science 2024-03-06 Darshil Doshi , Aritra Das , Tianyu He , Andrey Gromov

Deep Grokking: Would Deep Neural Networks Generalize Better?

Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization behaviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set,…

Machine Learning · Computer Science 2024-05-31 Simin Fan , Razvan Pascanu , Martin Jaggi

Latent Algorithmic Structure Precedes Grokking: A Mechanistic Study of ReLU MLPs on Modular Arithmetic

Grokking-the phenomenon where validation accuracy of neural networks on modular addition of two integers rises long after training data has been memorized-has been characterized in previous works as producing sinusoidal input weight…

Machine Learning · Computer Science 2026-03-26 Anand Swaroop

Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

We present a theoretical explanation of the ``grokking'' phenomenon, where a model generalizes long after overfitting,for the originally-studied problem of modular addition. First, we show that early in gradient descent, when the ``kernel…

Machine Learning · Computer Science 2024-07-18 Mohamad Amin Mohamadi , Zhiyuan Li , Lei Wu , Danica J. Sutherland

Mechanistic Insights into Grokking from the Embedding Layer

Grokking, a delayed generalization in neural networks after perfect training performance, has been observed in Transformers and MLPs, but the components driving it remain underexplored. We show that embeddings are central to grokking:…

Machine Learning · Computer Science 2025-05-22 H. V. AlquBoj , Hilal AlQuabeh , Velibor Bojkovic , Munachiso Nwadike , Kentaro Inui

Breaking Neural Network Scaling Laws with Modularity

Modular neural networks outperform nonmodular neural networks on tasks ranging from visual question answering to robotics. These performance improvements are thought to be due to modular networks' superior ability to model the compositional…

Machine Learning · Computer Science 2025-03-12 Akhilan Boopathy , Sunshine Jiang , William Yue , Jaedong Hwang , Abhiram Iyer , Ila Fiete

Understanding Grokking Through A Robustness Viewpoint

Recently, an interesting phenomenon called grokking has gained much attention, where generalization occurs long after the models have initially overfitted the training data. We try to understand this seemingly strange phenomenon through the…

Machine Learning · Computer Science 2024-02-05 Zhiquan Tan , Weiran Huang

Pruned Neural Networks are Surprisingly Modular

The learned weights of a neural network are often considered devoid of scrutable internal structure. To discern structure in these weights, we introduce a measurable notion of modularity for multi-layer perceptrons (MLPs), and investigate…

Neural and Evolutionary Computing · Computer Science 2022-02-09 Daniel Filan , Shlomi Hod , Cody Wild , Andrew Critch , Stuart Russell

Grokking as Compression: A Nonlinear Complexity Perspective

We attribute grokking, the phenomenon where generalization is much delayed after memorization, to compression. To do so, we define linear mapping number (LMN) to measure network complexity, which is a generalized version of linear region…

Machine Learning · Computer Science 2023-10-10 Ziming Liu , Ziqian Zhong , Max Tegmark

Grokking in the Ising Model

Delayed generalization, termed grokking, in a machine learning calculation occurs when the increase in test accuracy is delayed relative to the training accuracy. This paper examines grokking in the context of a dense neural network trained…

Disordered Systems and Neural Networks · Physics 2026-02-06 Karolina Hutchison , David Yevick

Omnigrok: Grokking Beyond Algorithmic Data

Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive. We aim to understand grokking by analyzing the loss landscapes of neural networks,…

Machine Learning · Computer Science 2023-03-24 Ziming Liu , Eric J. Michaud , Max Tegmark

Memorize or generalize? Searching for a compositional RNN in a haystack

Neural networks are very powerful learning systems, but they do not readily generalize from one task to the other. This is partly due to the fact that they do not learn in a compositional way, that is, by discovering skills that are shared…

Artificial Intelligence · Computer Science 2018-07-27 Adam Liška , Germán Kruszewski , Marco Baroni

Grokking Beyond the Euclidean Norm of Model Parameters

Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or…

Machine Learning · Computer Science 2025-07-14 Pascal Jr Tikeng Notsawo , Guillaume Dumas , Guillaume Rabusseau

Learning words in groups: fusion algebras, tensor ranks and grokking

In this work, we demonstrate that a simple two-layer neural network with standard activation functions can learn an arbitrary word operation in any finite group, provided sufficient width is available and exhibits grokking while doing so.…

Machine Learning · Computer Science 2025-09-09 Maor Shutman , Oren Louidor , Ran Tessler

Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity

In some settings neural networks exhibit a phenomenon known as \textit{grokking}, where they achieve perfect or near-perfect accuracy on the validation set long after the same performance has been achieved on the training set. In this…

Machine Learning · Computer Science 2024-04-02 Jack Miller , Charles O'Neill , Thang Bui