Related papers: On the Entropy of Written Spanish

Complexity measurement of natural and artificial languages

We compared entropy for texts written in natural languages (English, Spanish) and artificial languages (computer software) based on a simple expression for the entropy as a function of message length and specific word diversity. Code text…

Computation and Language · Computer Science 2015-12-03 Gerardo Febres , Klaus Jaffe , Carlos Gershenson

A New Look at the Classical Entropy of Written English

A simple method for finding the entropy and redundancy of a reasonable long sample of English text by direct computer processing and from first principles according to Shannon theory is presented. As an example, results on the entropy of…

Computation and Language · Computer Science 2009-11-19 Fabio G. Guerrero

Quantifying literature quality using complexity criteria

We measured entropy and symbolic diversity for English and Spanish texts including literature Nobel laureates and other famous authors. Entropy, symbol diversity and symbol frequency profiles were compared for these four groups. We also…

Computation and Language · Computer Science 2017-01-17 Gerardo Febres , Klaus Jaffe

Entropy analysis of word-length series of natural language texts: Effects of text language and genre

We estimate the $n$-gram entropies of natural language texts in word-length representation and find that these are sensitive to text language and genre. We attribute this sensitivity to changes in the probability distribution of the lengths…

Computation and Language · Computer Science 2014-01-20 Maria Kalimeri , Vassilios Constantoudis , Constantinos Papadimitriou , Kostantinos Karamanos , Fotis K. Diakonos , Haris Papageorgiou

Entropy and type-token ratio in gigaword corpora

There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six…

Computation and Language · Computer Science 2025-07-16 Pablo Rosillo-Rodes , Maxi San Miguel , David Sanchez

The word entropy of natural languages

The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average…

Computation and Language · Computer Science 2016-06-23 Christian Bentz , Dimitrios Alikaniotis

Semantic Chunking and the Entropy of Natural Language

The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80…

Computation and Language · Computer Science 2026-02-19 Weishun Zhong , Doron Sivan , Tankut Can , Mikhail Katkov , Misha Tsodyks

Entropy in Large Language Models

In this study, the output of large language models (LLM) is considered an information source generating an unlimited sequence of symbols drawn from a finite alphabet. Given the probabilistic nature of modern LLMs, we assume a probabilistic…

Computation and Language · Computer Science 2026-02-24 Marco Scharringhausen

Exact Probability Distribution versus Entropy

The problem addressed concerns the determination of the average number of successive attempts of guessing a word of a certain length consisting of letters with given probabilities of occurrence. Both first- and second-order approximations…

Information Theory · Computer Science 2015-06-19 Kerstin Andersson

Translation Entropy: A Statistical Framework for Evaluating Translation Systems

The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures,…

Computation and Language · Computer Science 2025-11-18 Ronit D. Gross , Yanir Harel , Ido Kanter

An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling

The IMPACT-es diachronic corpus of historical Spanish compiles over one hundred books --containing approximately 8 million words-- in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the…

Computation and Language · Computer Science 2013-07-01 Felipe Sánchez-Martínez , Isabel Martínez-Sempere , Xavier Ivars-Ribes , Rafael C. Carrasco

Quantitative Entropy Study of Language Complexity

We study the entropy of Chinese and English texts, based on characters in case of Chinese texts and based on words for both languages. Significant differences are found between the languages and between different personal styles of debating…

Computation and Language · Computer Science 2017-01-17 R. R. Xie , W. B. Deng , D. J. Wang , L. P. Csernai

Entropy of Ukrainian

In natural language processing, the entropy of a language is a measure of its unpredictability and complexity. The first study on this subject was conducted by Claude Shannon in 1951. By having participants predict the next character in a…

Computation and Language · Computer Science 2026-05-01 Anton Lavreniuk , Mykyta Mudryi , Markiian Chaklosh

Calculating entropy at different scales among diverse communication systems

We evaluated the impact of changing the observation scale over the entropy measures for text descriptions. MIDI coded Music, computer code and two human natural languages were studied at the scale of characters, words, and at the…

Information Theory · Computer Science 2017-01-13 Gerardo Febres , Klaus Jaffe

Complexity-entropy analysis at different levels of organization in written language

Written language is complex. A written text can be considered an attempt to convey a meaningful message which ends up being constrained by language rules, context dependence and highly redundant in its use of resources. Despite all these…

Computation and Language · Computer Science 2019-05-20 E. Estevez-Rams , A. Mesa Rodriguez , D. Estevez-Moya

Entropic analysis of the role of words in literary texts

Beyond the local constraints imposed by grammar, words concatenated in long sequences carrying a complex message show statistical regularities that may reflect their linguistic role in the message. In this paper, we perform a systematic…

Statistical Mechanics · Physics 2007-05-23 Marcelo A. Montemurro , Damian H. Zanette

Toward a statistical mechanics of four letter words

We consider words as a network of interacting letters, and approximate the probability distribution of states taken on by this network. Despite the intuition that the rules of English spelling are highly combinatorial (and arbitrary), we…

Neurons and Cognition · Quantitative Biology 2025-02-13 Greg J. Stephens , William Bialek

Estimating the Entropy of Linguistic Distributions

Shannon entropy is often a quantity of interest to linguists studying the communicative capacity of human language. However, entropy must typically be estimated from observed data because researchers do not have access to the underlying…

Computation and Language · Computer Science 2022-04-06 Aryaman Arora , Clara Meister , Ryan Cotterell

Determining the Number of Samples Required to Estimate Entropy in Natural Sequences

Calculating the Shannon entropy for symbolic sequences has been widely considered in many fields. For descriptive statistical problems such as estimating the N-gram entropy of English language text, a common approach is to use as much data…

Information Theory · Computer Science 2018-05-24 Andrew D. Back , Daniel Angus , Janet Wiles

Generaci\'on autom\'atica de frases literarias en espa\~nol

In this work we present a state of the art in the area of Computational Creativity (CC). In particular, we address the automatic generation of literary sentences in Spanish. We propose three models of text generation based mainly on…

Computation and Language · Computer Science 2020-01-31 Luis-Gil Moreno-Jiménez , Juan-Manuel Torres-Moreno , Roseli S. Wedemann