Related papers: Remarks on "Random Sequences"
We define a notion of randomness for individual and collections of formal languages based on automatic martingales acting on sequences of words from some underlying domain. An automatic martingale bets if the incoming word belongs to the…
In this paper, we describe an approach to sentence categorization which has the originality to be based on natural properties of languages with no training set dependency. The implementation is fast, small, robust and textual errors…
Traditional linguistic theories have largely regard language as a formal system composed of rigid rules. However, their failures in processing real language, the recent successes in statistical natural language processing, and the findings…
The established language for statistical testing --- significance levels, power, and p-values --- is overly complicated and deceptively conclusive. Even teachers of statistics and scientists who use statistics misinterpret the results of…
Based on data from a large-scale experiment with human subjects, we conclude that the logarithm of probability to guess a word in context (unpredictability) depends linearly on the word length. This result holds both for poetry and prose,…
Stanford typed dependencies are a widely desired representation of natural language sentences, but parsing is one of the major computational bottlenecks in text analysis systems. In light of the evolving definition of the Stanford…
The words of a language are randomly replaced in time by new ones, but it has long been known that words corresponding to some items (meanings) are less frequently replaced than others. Usually, the rate of replacement for a given item is…
Lexical resemblances among a group of languages indicate that the languages could be genetically related, i.e., they could have descended from a common ancestral language. However, such resemblances can arise by chance and, hence, need not…
Some asymptotic notions for random variables are discussed. In particular, different versions of O and o for sequences of random variables are studied. The results are elementary and more or less well-known, but collected here for future…
This paper presents a comparison of the quality of randomness of D sequences based on diehard tests. Since D sequences can model any random sequence, this comparison is of value beyond this specific class.
In this paper we propose several variants to perform the independence test between two random elements based on recurrence rates. We will show how to calculate the test statistic in each one of these cases. From simulations we obtain that…
Are pairs of words that tend to occur together also likely to stand in a linguistic dependency? This empirical question is motivated by a long history of literature in cognitive science, psycholinguistics, and NLP. In this work we…
We analyze the frequency-rank relationship in sub-vocabularies corresponding to three different grammatical classes (nouns, verbs, and others) in a collection of literary works in English, whose words have been automatically tagged…
The statistical properties of letters frequencies in European literature texts are investigated. The determination of logarithmic dependence of letters sequence for one-language and two-language texts are examined. The pare of languages is…
Transformer models are now a cornerstone in natural language processing. Yet, explaining their decisions remains a challenge. It was shown recently that the same model trained on the same data with a different randomness can lead to very…
We present a theoretical and empirical investigation of the statistical behaviour of the words in a text produced by human language. To this aim, we analyse the word distribution of various texts of Italian language selected from a specific…
The paper presents a language model that develops syntactic structure and uses it to extract meaningful information from the word history, thus enabling the use of long distance dependencies. The model assigns probability to every joint…
We consider the conditional randomization test as a way to account for covariate imbalance in randomized experiments. The test accounts for covariate imbalance by comparing the observed test statistic to the null distribution of the test…
Language change is a cultural evolutionary process in which variants of linguistic variables change in frequency through processes analogous to mutation, selection and genetic drift. In this work, we apply a recently-introduced method to…
In this paper new test statistics are introduced and studied for the important problem of testing hypothesis that involves inequality constraint on proportions when the sample comes from independent binomial random variables: Wald type and…