Related papers: Toward a statistical mechanics of four letter word…
Beyond the local constraints imposed by grammar, words concatenated in long sequences carrying a complex message show statistical regularities that may reflect their linguistic role in the message. In this paper, we perform a systematic…
The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average…
We build models for the distribution of social states in Twitter communities. States can be defined by the participation vs silence of individuals in conversations that surround key words, and we approximate the joint distribution of these…
The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80…
The crossword-like patterns of tiles in Scrabble form connected graphs of occupied sites on a square lattice. We find the most structureless description that reproduces means and covariances observed in real Scrabble games by adapting a…
As is the case of many signals produced by complex systems, language presents a statistical structure that is balanced between order and disorder. Here we review and extend recent results from quantitative characterisations of the degree of…
The word-frequency distribution provides the fundamental building blocks that generate discourse in language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf's law, at…
We show that the predictability of letters in written English texts depends strongly on their position in the word. The first letters are usually the least easy to predict. This agrees with the intuitive notion that words are well defined…
We use large language models (LLMs) to uncover long-ranged structure in English texts from a variety of sources. The conditional entropy or code length in many cases continues to decrease with context length at least to $N\sim 10^4$…
The problem addressed concerns the determination of the average number of successive attempts of guessing a word of a certain length consisting of letters with given probabilities of occurrence. Both first- and second-order approximations…
Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in…
The principle of maximum entropy provides a useful method for inferring statistical mechanics models from observations in correlated systems, and is widely used in a variety of fields where accurate data are available. While the assumptions…
We review recent progress in understanding the meaning of mutual information in natural language. Let us define words in a text as strings that occur sufficiently often. In a few previous papers, we have shown that a power-law distribution…
Maximum entropy models are the least structured probability distributions that exactly reproduce a chosen set of statistics measured in an interacting network. Here we use this principle to construct probabilistic models which describe the…
Recently long range correlations were detected in nucleotide sequences and in human writings by several authors. We undertake here a systematic investigation of two books, Moby Dick by H. Melville and Grimm's tales, with respect to the…
We investigated long range correlations in two literary texts, Moby Dick by H. Melville and Grimm's tales. The analysis is based on the calculation of entropy like quantities as the mutual information for pairs of letters and the entropy,…
Among the several findings deriving from the application of complex network formalism to the investigation of natural phenomena, the fact that linguistic constructions follow power laws presents special interest for its potential…
A simple method for finding the entropy and redundancy of a reasonable long sample of English text by direct computer processing and from first principles according to Shannon theory is presented. As an example, results on the entropy of…
The frequency with which the letters of the English alphabet appear in writings has been applied to the field of cryptography, the development of keyboard mechanics, and the study of linguistics. We expanded on the statistical analysis of…
We define two words in a language to be connected if they express similar concepts. The network of connections among the many thousands of words that make up a language is important not only for the study of the structure and evolution of…