Related papers: Local Grammar-Based Coding Revisited

On vocabulary size of grammar-based codes

We discuss inequalities holding between the vocabulary size, i.e., the number of distinct nonterminal symbols in a grammar-based compression for a string, and the excess length of the respective universal code, i.e., the code-based analog…

Information Theory · Computer Science 2020-03-11 Lukasz Debowski

Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models

We study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this…

Computation and Language · Computer Science 2025-11-25 Vladimir Berman

On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

The article presents a new interpretation for Zipf-Mandelbrot's law in natural language which rests on two areas of information theory. Firstly, we construct a new class of grammar-based codes and, secondly, we investigate properties of…

Information Theory · Computer Science 2020-03-11 Łukasz Dębowski

Compression and the origins of Zipf's law of abbreviation

Languages across the world exhibit Zipf's law of abbreviation, namely more frequent words tend to be shorter. The generalized version of the law - an inverse relationship between the frequency of a unit and its magnitude - holds also for…

Information Theory · Computer Science 2016-05-05 R. Ferrer-i-Cancho , C. Bentz , C. Seguin

How (Non-)Optimal is the Lexicon?

The mapping of lexical meanings to wordforms is a major feature of natural languages. While usage pressures might assign short words to frequent meanings (Zipf's law of abbreviation), the need for a productive and open-ended vocabulary,…

Computation and Language · Computer Science 2021-05-04 Tiago Pimentel , Irene Nikkarinen , Kyle Mahowald , Ryan Cotterell , Damián Blasi

A decentralized route to the origins of scaling in human language

The Zipf's law establishes that if the words of a (large) text are ordered by decreasing frequency, the frequency versus the rank decreases as a power law with exponent close to $-1$. Previous work has stressed that this pattern arises from…

Physics and Society · Physics 2019-04-03 Felipe Urbina , Javier Vera

Direct and indirect evidence of compression of word lengths. Zipf's law of abbreviation revisited

Zipf's law of abbreviation, the tendency of more frequent words to be shorter, is one of the most solid candidates for a linguistic universal, in the sense that it has the potential for being exceptionless or with a number of exceptions…

Computation and Language · Computer Science 2023-10-13 Sonia Petrini , Antoni Casas-i-Muñoz , Jordi Cluet-i-Martinell , Mengxue Wang , Chris Bentz , Ramon Ferrer-i-Cancho

Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?

Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code…

Programming Languages · Computer Science 2025-12-11 Qingyuan Liang , Zhao Zhang , Zeyu Sun , Zheng Lin , Qi Luo , Yueyi Xiao , Yizhou Chen , Yuqun Zhang , Haotian Zhang , Lu Zhang , Bin Chen , Yingfei Xiong

Encoding Arguments

Many proofs in discrete mathematics and theoretical computer science are based on the probabilistic method. To prove the existence of a good object, we pick a random object and show that it is bad with low probability. This method is…

Information Theory · Computer Science 2017-08-01 Pat Morin , Wolfgang Mulzer , Tommy Reddad

Zipf's law is a consequence of coherent language production

The task of text segmentation may be undertaken at many levels in text analysis---paragraphs, sentences, words, or even letters. Here, we focus on a relatively fine scale of segmentation, hypothesizing it to be in accord with a stochastic…

Computation and Language · Computer Science 2016-08-09 Jake Ryland Williams , James P. Bagrow , Andrew J. Reagan , Sharon E. Alajajian , Christopher M. Danforth , Peter Sheridan Dodds

Locally Typical Sampling

Today's probabilistic language generators fall short when it comes to producing coherent and fluent text despite the fact that the underlying models perform well under standard metrics, e.g., perplexity. This discrepancy has puzzled the…

Computation and Language · Computer Science 2025-06-06 Clara Meister , Tiago Pimentel , Gian Wiher , Ryan Cotterell

Entropic Bounds on the Average Length of Codes with a Space

We consider the problem of constructing prefix-free codes in which a designated symbol, a space, can only appear at the end of codewords. We provide a linear-time algorithm to construct almost-optimal codes with this property, meaning that…

Information Theory · Computer Science 2024-05-13 Roberto Bruno , Ugo Vaccaro

Optimal coding and the origins of Zipfian laws

The problem of compression in standard information theory consists of assigning codes as short as possible to numbers. Here we consider the problem of optimal coding -- under an arbitrary coding scheme -- and show that it predicts Zipf's…

Computation and Language · Computer Science 2020-09-24 Ramon Ferrer-i-Cancho , Christian Bentz , Caio Seguin

An Upper Bound On the Size of Locally Recoverable Codes

In a {\em locally recoverable} or {\em repairable} code, any symbol of a codeword can be recovered by reading only a small (constant) number of other symbols. The notion of local recoverability is important in the area of distributed…

Information Theory · Computer Science 2016-11-17 Viveck Cadambe , Arya Mazumdar

Language Modeling at Scale

We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine…

Computation and Language · Computer Science 2018-10-25 Mostofa Patwary , Milind Chabbi , Heewoo Jun , Jiaji Huang , Gregory Diamos , Kenneth Church

On the Spectral Flattening of Quantized Embeddings

Training Large Language Models (LLMs) at ultra-low precision is critically impeded by instability rooted in the conflict between discrete quantization constraints and the intrinsic heavy-tailed spectral nature of linguistic data. By…

Machine Learning · Computer Science 2026-02-03 Junlin Huang , Wenyi Fang , Zhenheng Tang , Yuxin Wang , Xueze Kang , Yang Zheng , Bo Li , Xiaowen Chu

The brevity law as a scaling law, and a possible origin of Zipf's law for word frequencies

An important body of quantitative linguistics is constituted by a series of statistical laws about language usage. Despite the importance of these linguistic laws, some of them are poorly formulated, and, more importantly, there is no…

Physics and Society · Physics 2020-11-09 Alvaro Corral , Isabel Serra

Local stability and robustness of sparse dictionary learning in the presence of noise

A popular approach within the signal processing and machine learning communities consists in modelling signals as sparse linear combinations of atoms selected from a learned dictionary. While this paradigm has led to numerous empirical…

Machine Learning · Statistics 2012-10-03 Rodolphe Jenatton , Rémi Gribonval , Francis Bach

On the Local Correctness of L^1 Minimization for Dictionary Learning

The idea that many important classes of signals can be well-represented by linear combinations of a small set of atoms selected from a given dictionary has had dramatic impact on the theory and practice of signal processing. For practical…

Information Theory · Computer Science 2015-03-18 Quan Geng , Huan Wang , John Wright

A Universal Grammar-Based Code For Lossless Compression of Binary Trees

We consider the problem of lossless compression of binary trees, with the aim of reducing the number of code bits needed to store or transmit such trees. A lossless grammar-based code is presented which encodes each binary tree into a…

Information Theory · Computer Science 2013-04-30 Jie Zhang , En-hui Yang , John C. Kieffer