Related papers: On the Entropy of Written Spanish
We compared entropy for texts written in natural languages (English, Spanish) and artificial languages (computer software) based on a simple expression for the entropy as a function of message length and specific word diversity. Code text…
A simple method for finding the entropy and redundancy of a reasonable long sample of English text by direct computer processing and from first principles according to Shannon theory is presented. As an example, results on the entropy of…
We measured entropy and symbolic diversity for English and Spanish texts including literature Nobel laureates and other famous authors. Entropy, symbol diversity and symbol frequency profiles were compared for these four groups. We also…
We estimate the $n$-gram entropies of natural language texts in word-length representation and find that these are sensitive to text language and genre. We attribute this sensitivity to changes in the probability distribution of the lengths…
There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six…
The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average…
The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80…
In this study, the output of large language models (LLM) is considered an information source generating an unlimited sequence of symbols drawn from a finite alphabet. Given the probabilistic nature of modern LLMs, we assume a probabilistic…
The problem addressed concerns the determination of the average number of successive attempts of guessing a word of a certain length consisting of letters with given probabilities of occurrence. Both first- and second-order approximations…
The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures,…
The IMPACT-es diachronic corpus of historical Spanish compiles over one hundred books --containing approximately 8 million words-- in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the…
We study the entropy of Chinese and English texts, based on characters in case of Chinese texts and based on words for both languages. Significant differences are found between the languages and between different personal styles of debating…
In natural language processing, the entropy of a language is a measure of its unpredictability and complexity. The first study on this subject was conducted by Claude Shannon in 1951. By having participants predict the next character in a…
We evaluated the impact of changing the observation scale over the entropy measures for text descriptions. MIDI coded Music, computer code and two human natural languages were studied at the scale of characters, words, and at the…
Written language is complex. A written text can be considered an attempt to convey a meaningful message which ends up being constrained by language rules, context dependence and highly redundant in its use of resources. Despite all these…
Beyond the local constraints imposed by grammar, words concatenated in long sequences carrying a complex message show statistical regularities that may reflect their linguistic role in the message. In this paper, we perform a systematic…
We consider words as a network of interacting letters, and approximate the probability distribution of states taken on by this network. Despite the intuition that the rules of English spelling are highly combinatorial (and arbitrary), we…
Shannon entropy is often a quantity of interest to linguists studying the communicative capacity of human language. However, entropy must typically be estimated from observed data because researchers do not have access to the underlying…
Calculating the Shannon entropy for symbolic sequences has been widely considered in many fields. For descriptive statistical problems such as estimating the N-gram entropy of English language text, a common approach is to use as much data…
In this work we present a state of the art in the area of Computational Creativity (CC). In particular, we address the automatic generation of literary sentences in Spanish. We propose three models of text generation based mainly on…