English
Related papers

Related papers: Local Grammar-Based Coding Revisited

200 papers

We discuss inequalities holding between the vocabulary size, i.e., the number of distinct nonterminal symbols in a grammar-based compression for a string, and the excess length of the respective universal code, i.e., the code-based analog…

Information Theory · Computer Science 2020-03-11 Lukasz Debowski

We study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this…

Computation and Language · Computer Science 2025-11-25 Vladimir Berman

The article presents a new interpretation for Zipf-Mandelbrot's law in natural language which rests on two areas of information theory. Firstly, we construct a new class of grammar-based codes and, secondly, we investigate properties of…

Information Theory · Computer Science 2020-03-11 Łukasz Dębowski

Languages across the world exhibit Zipf's law of abbreviation, namely more frequent words tend to be shorter. The generalized version of the law - an inverse relationship between the frequency of a unit and its magnitude - holds also for…

Information Theory · Computer Science 2016-05-05 R. Ferrer-i-Cancho , C. Bentz , C. Seguin

The mapping of lexical meanings to wordforms is a major feature of natural languages. While usage pressures might assign short words to frequent meanings (Zipf's law of abbreviation), the need for a productive and open-ended vocabulary,…

Computation and Language · Computer Science 2021-05-04 Tiago Pimentel , Irene Nikkarinen , Kyle Mahowald , Ryan Cotterell , Damián Blasi

The Zipf's law establishes that if the words of a (large) text are ordered by decreasing frequency, the frequency versus the rank decreases as a power law with exponent close to $-1$. Previous work has stressed that this pattern arises from…

Physics and Society · Physics 2019-04-03 Felipe Urbina , Javier Vera

Zipf's law of abbreviation, the tendency of more frequent words to be shorter, is one of the most solid candidates for a linguistic universal, in the sense that it has the potential for being exceptionless or with a number of exceptions…

Computation and Language · Computer Science 2023-10-13 Sonia Petrini , Antoni Casas-i-Muñoz , Jordi Cluet-i-Martinell , Mengxue Wang , Chris Bentz , Ramon Ferrer-i-Cancho

Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code…

Programming Languages · Computer Science 2025-12-11 Qingyuan Liang , Zhao Zhang , Zeyu Sun , Zheng Lin , Qi Luo , Yueyi Xiao , Yizhou Chen , Yuqun Zhang , Haotian Zhang , Lu Zhang , Bin Chen , Yingfei Xiong

Many proofs in discrete mathematics and theoretical computer science are based on the probabilistic method. To prove the existence of a good object, we pick a random object and show that it is bad with low probability. This method is…

Information Theory · Computer Science 2017-08-01 Pat Morin , Wolfgang Mulzer , Tommy Reddad

The task of text segmentation may be undertaken at many levels in text analysis---paragraphs, sentences, words, or even letters. Here, we focus on a relatively fine scale of segmentation, hypothesizing it to be in accord with a stochastic…

Today's probabilistic language generators fall short when it comes to producing coherent and fluent text despite the fact that the underlying models perform well under standard metrics, e.g., perplexity. This discrepancy has puzzled the…

Computation and Language · Computer Science 2025-06-06 Clara Meister , Tiago Pimentel , Gian Wiher , Ryan Cotterell

We consider the problem of constructing prefix-free codes in which a designated symbol, a space, can only appear at the end of codewords. We provide a linear-time algorithm to construct almost-optimal codes with this property, meaning that…

Information Theory · Computer Science 2024-05-13 Roberto Bruno , Ugo Vaccaro

The problem of compression in standard information theory consists of assigning codes as short as possible to numbers. Here we consider the problem of optimal coding -- under an arbitrary coding scheme -- and show that it predicts Zipf's…

Computation and Language · Computer Science 2020-09-24 Ramon Ferrer-i-Cancho , Christian Bentz , Caio Seguin

In a {\em locally recoverable} or {\em repairable} code, any symbol of a codeword can be recovered by reading only a small (constant) number of other symbols. The notion of local recoverability is important in the area of distributed…

Information Theory · Computer Science 2016-11-17 Viveck Cadambe , Arya Mazumdar

We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine…

Computation and Language · Computer Science 2018-10-25 Mostofa Patwary , Milind Chabbi , Heewoo Jun , Jiaji Huang , Gregory Diamos , Kenneth Church

Training Large Language Models (LLMs) at ultra-low precision is critically impeded by instability rooted in the conflict between discrete quantization constraints and the intrinsic heavy-tailed spectral nature of linguistic data. By…

Machine Learning · Computer Science 2026-02-03 Junlin Huang , Wenyi Fang , Zhenheng Tang , Yuxin Wang , Xueze Kang , Yang Zheng , Bo Li , Xiaowen Chu

An important body of quantitative linguistics is constituted by a series of statistical laws about language usage. Despite the importance of these linguistic laws, some of them are poorly formulated, and, more importantly, there is no…

Physics and Society · Physics 2020-11-09 Alvaro Corral , Isabel Serra

A popular approach within the signal processing and machine learning communities consists in modelling signals as sparse linear combinations of atoms selected from a learned dictionary. While this paradigm has led to numerous empirical…

Machine Learning · Statistics 2012-10-03 Rodolphe Jenatton , Rémi Gribonval , Francis Bach

The idea that many important classes of signals can be well-represented by linear combinations of a small set of atoms selected from a given dictionary has had dramatic impact on the theory and practice of signal processing. For practical…

Information Theory · Computer Science 2015-03-18 Quan Geng , Huan Wang , John Wright

We consider the problem of lossless compression of binary trees, with the aim of reducing the number of code bits needed to store or transmit such trees. A lossless grammar-based code is presented which encodes each binary tree into a…

Information Theory · Computer Science 2013-04-30 Jie Zhang , En-hui Yang , John C. Kieffer
‹ Prev 1 2 3 10 Next ›