Related papers: Approaching the linguistic complexity

Linguistic complexity: English vs. Polish, text vs. corpus

We analyze the rank-frequency distributions of words in selected English and Polish texts. We show that for the lemmatized (basic) word forms the scale-invariant regime breaks after about two decades, while it might be consistent for the…

Computation and Language · Computer Science 2010-07-07 Jaroslaw Kwapien , Stanislaw Drozdz , Adam Orczyk

Complex network analysis of literary and scientific texts

We present results from our quantitative study of statistical and network properties of literary and scientific texts written in two languages: English and Polish. We show that Polish texts are described by the Zipf law with the scaling…

Physics and Society · Physics 2013-12-16 Iwona Grabska-Gradzinska , Andrzej Kulig , Jaroslaw Kwapien , Stanislaw Drozdz

Rank diversity of languages: Generic behavior in computational linguistics

Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution \emph{rank diversity}. We calculate this diversity…

Computation and Language · Computer Science 2015-05-15 Germinal Cocho , Jorge Flores , Carlos Gershenson , Carlos Pineda , Sergio Sánchez

Word-length entropies and correlations of natural language written texts

We study the frequency distributions and correlations of the word lengths of ten European languages. Our findings indicate that a) the word-length distribution of short words quantified by the mean value and the entropy distinguishes the…

Computation and Language · Computer Science 2014-01-27 Maria Kalimeri , Vassilios Constantoudis , Constantinos Papadimitriou , Konstantinos Karamanos , Fotis K. Diakonos , Harris Papageorgiou

Is language evolution grinding to a halt? The scaling of lexical turbulence in English fiction suggests it is not

Of basic interest is the quantification of the long term growth of a language's lexicon as it develops to more completely cover both a culture's communication requirements and knowledge space. Here, we explore the usage dynamics of words in…

Computation and Language · Computer Science 2017-03-27 Eitan Adam Pechenick , Christopher M. Danforth , Peter Sheridan Dodds

A scaling law beyond Zipf's law and its relation to Heaps' law

The dependence with text length of the statistical properties of word occurrences has long been considered a severe limitation quantitative linguistics. We propose a simple scaling form for the distribution of absolute word frequencies…

Physics and Society · Physics 2015-06-15 Francesc Font-Clos , Gemma Boleda , Álvaro Corral

Scaling relations for diversity of languages

The distribution of living languages is investigated and scaling relations are found for the diversity of languages as a function of the country area and population. These results are compared with data from Ecology and from computer…

Physics and Society · Physics 2009-11-11 M. A. F. Gomes , G. L. Vasconcelos , I. J. Tsang , I. R. Tsang

The Dependence of Frequency Distributions on Multiple Meanings of Words, Codes and Signs

The dependence of the frequency distributions due to multiple meanings of words in a text is investigated by deleting letters. By coding the words with fewer letters the number of meanings per coded word increases. This increase is measured…

Computation and Language · Computer Science 2017-10-04 Xiaoyong Yan , Petter Minnhagen

Text mixing shapes the anatomy of rank-frequency distributions: A modern Zipfian mechanics for natural language

Natural languages are full of rules and exceptions. One of the most famous quantitative rules is Zipf's law which states that the frequency of occurrence of a word is approximately inversely proportional to its rank. Though this `law' of…

Computation and Language · Computer Science 2015-05-27 Jake Ryland Williams , James P. Bagrow , Christopher M. Danforth , Peter Sheridan Dodds

Scaling laws in human speech, decreasing emergence of new words and a generalized model

Human language, as a typical complex system, its organization and evolution is an attractive topic for both physical and cultural researchers. In this paper, we present the first exhaustive analysis of the text organization of human speech.…

Computation and Language · Computer Science 2015-01-08 Ruokuang Lin , Qianli D. Y. Ma , Chunhua Bian

Quantitative Entropy Study of Language Complexity

We study the entropy of Chinese and English texts, based on characters in case of Chinese texts and based on words for both languages. Significant differences are found between the languages and between different personal styles of debating…

Computation and Language · Computer Science 2017-01-17 R. R. Xie , W. B. Deng , D. J. Wang , L. P. Csernai

Languages cool as they expand: Allometric scaling and the decreasing need for new words

We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two…

Physics and Society · Physics 2012-12-12 Alexander M. Petersen , Joel N. Tenenbaum , Shlomo Havlin , H. Eugene Stanley , Matjaz Perc

Rank dynamics of word usage at multiple scales

The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore…

Physics and Society · Physics 2026-02-04 José A. Morales , Ewan Colman , Sergio Sánchez , Fernanda Sánchez-Puig , Carlos Pineda , Gerardo Iñiguez , Germinal Cocho , Jorge Flores , Carlos Gershenson

Redefining part-of-speech classes with distributional semantic models

This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of…

Computation and Language · Computer Science 2016-08-15 Andrey Kutuzov , Erik Velldal , Lilja Øvrelid

Scaling Laws in Human Language

Zipf's law on word frequency is observed in English, French, Spanish, Italian, and so on, yet it does not hold for Chinese, Japanese or Korean characters. A model for writing process is proposed to explain the above difference, which takes…

Data Analysis, Statistics and Probability · Physics 2013-05-03 Linyuan Lu , Zi-Ke Zhang , Tao Zhou

Polish phonology and morphology through the lens of distributional semantics

This study investigates the relationship between the phonological and morphological structure of Polish words and their meanings using Distributional Semantics. In the present analysis, we ask whether there is a relationship between the…

Computation and Language · Computer Science 2026-04-02 Paula Orzechowska , R. Harald Baayen

The 'Letter' Distribution in the Chinese Language

Corpus-based statistical analysis plays a significant role in linguistic research, and ample evidence has shown that different languages exhibit some common laws. Studies have found that letters in some alphabetic writing languages have…

Computation and Language · Computer Science 2020-06-03 Qinghua Chen , Yan Wang , Mengmeng Wang , Xiaomeng Li

Estimation of English and non-English Language Use on the WWW

The World Wide Web has grown so big, in such an anarchic fashion, that it is difficult to describe. One of the evident intrinsic characteristics of the World Wide Web is its multilinguality. Here, we present a technique for estimating the…

Computation and Language · Computer Science 2021-08-23 Gregory Grefenstette , Julien Nioche

Large language models and the entropy of English

We use large language models (LLMs) to uncover long-ranged structure in English texts from a variety of sources. The conditional entropy or code length in many cases continues to decrease with context length at least to $N\sim 10^4$…

Statistical Mechanics · Physics 2026-01-01 Colin Scheibner , Lindsay M. Smith , William Bialek

Rank-frequency distribution of natural languages: a difference of probabilities approach

The time variation of the rank $k$ of words for six Indo-European languages is obtained using data from Google Books. For low ranks the distinct languages behave differently, maybe due to syntaxis rules, whereas for $k>50$ the law of large…

Physics and Society · Physics 2026-02-04 Germinal Cocho , R. F. Rodríguez , Sergio Sánchez , Jorge Flores , Carlos Pineda , Carlos Gershenson