English
Related papers

Related papers: Approaching the linguistic complexity

200 papers

We analyze the rank-frequency distributions of words in selected English and Polish texts. We show that for the lemmatized (basic) word forms the scale-invariant regime breaks after about two decades, while it might be consistent for the…

Computation and Language · Computer Science 2010-07-07 Jaroslaw Kwapien , Stanislaw Drozdz , Adam Orczyk

We present results from our quantitative study of statistical and network properties of literary and scientific texts written in two languages: English and Polish. We show that Polish texts are described by the Zipf law with the scaling…

Physics and Society · Physics 2013-12-16 Iwona Grabska-Gradzinska , Andrzej Kulig , Jaroslaw Kwapien , Stanislaw Drozdz

Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution \emph{rank diversity}. We calculate this diversity…

Computation and Language · Computer Science 2015-05-15 Germinal Cocho , Jorge Flores , Carlos Gershenson , Carlos Pineda , Sergio Sánchez

We study the frequency distributions and correlations of the word lengths of ten European languages. Our findings indicate that a) the word-length distribution of short words quantified by the mean value and the entropy distinguishes the…

Of basic interest is the quantification of the long term growth of a language's lexicon as it develops to more completely cover both a culture's communication requirements and knowledge space. Here, we explore the usage dynamics of words in…

Computation and Language · Computer Science 2017-03-27 Eitan Adam Pechenick , Christopher M. Danforth , Peter Sheridan Dodds

The dependence with text length of the statistical properties of word occurrences has long been considered a severe limitation quantitative linguistics. We propose a simple scaling form for the distribution of absolute word frequencies…

Physics and Society · Physics 2015-06-15 Francesc Font-Clos , Gemma Boleda , Álvaro Corral

The distribution of living languages is investigated and scaling relations are found for the diversity of languages as a function of the country area and population. These results are compared with data from Ecology and from computer…

Physics and Society · Physics 2009-11-11 M. A. F. Gomes , G. L. Vasconcelos , I. J. Tsang , I. R. Tsang

The dependence of the frequency distributions due to multiple meanings of words in a text is investigated by deleting letters. By coding the words with fewer letters the number of meanings per coded word increases. This increase is measured…

Computation and Language · Computer Science 2017-10-04 Xiaoyong Yan , Petter Minnhagen

Natural languages are full of rules and exceptions. One of the most famous quantitative rules is Zipf's law which states that the frequency of occurrence of a word is approximately inversely proportional to its rank. Though this `law' of…

Computation and Language · Computer Science 2015-05-27 Jake Ryland Williams , James P. Bagrow , Christopher M. Danforth , Peter Sheridan Dodds

Human language, as a typical complex system, its organization and evolution is an attractive topic for both physical and cultural researchers. In this paper, we present the first exhaustive analysis of the text organization of human speech.…

Computation and Language · Computer Science 2015-01-08 Ruokuang Lin , Qianli D. Y. Ma , Chunhua Bian

We study the entropy of Chinese and English texts, based on characters in case of Chinese texts and based on words for both languages. Significant differences are found between the languages and between different personal styles of debating…

Computation and Language · Computer Science 2017-01-17 R. R. Xie , W. B. Deng , D. J. Wang , L. P. Csernai

We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two…

Physics and Society · Physics 2012-12-12 Alexander M. Petersen , Joel N. Tenenbaum , Shlomo Havlin , H. Eugene Stanley , Matjaz Perc

The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore…

This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of…

Computation and Language · Computer Science 2016-08-15 Andrey Kutuzov , Erik Velldal , Lilja Øvrelid

Zipf's law on word frequency is observed in English, French, Spanish, Italian, and so on, yet it does not hold for Chinese, Japanese or Korean characters. A model for writing process is proposed to explain the above difference, which takes…

Data Analysis, Statistics and Probability · Physics 2013-05-03 Linyuan Lu , Zi-Ke Zhang , Tao Zhou

This study investigates the relationship between the phonological and morphological structure of Polish words and their meanings using Distributional Semantics. In the present analysis, we ask whether there is a relationship between the…

Computation and Language · Computer Science 2026-04-02 Paula Orzechowska , R. Harald Baayen

Corpus-based statistical analysis plays a significant role in linguistic research, and ample evidence has shown that different languages exhibit some common laws. Studies have found that letters in some alphabetic writing languages have…

Computation and Language · Computer Science 2020-06-03 Qinghua Chen , Yan Wang , Mengmeng Wang , Xiaomeng Li

The World Wide Web has grown so big, in such an anarchic fashion, that it is difficult to describe. One of the evident intrinsic characteristics of the World Wide Web is its multilinguality. Here, we present a technique for estimating the…

Computation and Language · Computer Science 2021-08-23 Gregory Grefenstette , Julien Nioche

We use large language models (LLMs) to uncover long-ranged structure in English texts from a variety of sources. The conditional entropy or code length in many cases continues to decrease with context length at least to $N\sim 10^4$…

Statistical Mechanics · Physics 2026-01-01 Colin Scheibner , Lindsay M. Smith , William Bialek

The time variation of the rank $k$ of words for six Indo-European languages is obtained using data from Google Books. For low ranks the distinct languages behave differently, maybe due to syntaxis rules, whereas for $k>50$ the law of large…

Physics and Society · Physics 2026-02-04 Germinal Cocho , R. F. Rodríguez , Sergio Sánchez , Jorge Flores , Carlos Pineda , Carlos Gershenson
‹ Prev 1 2 3 10 Next ›