Related papers: Constructions for Clumps Statistics
A correlation is a binary vector that encodes all possible positions of overlaps of two words, where an overlap for an ordered pair of words (u,v) occurs if a suffix of word u matches a prefix of word v. As multiple pairs can have the same…
The analysis of strings of $n$ random variables with geometric distribution has recently attracted renewed interest: Archibald et al. consider the number of distinct adjacent pairs in geometrically distributed words. They obtain the…
We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and using the acquired word classes to improve the accuracy of syntactic disambiguation. We view this problem as that of estimating a…
As suggested by Currie, we apply the probabilistic method to problems regarding pattern avoidance. Using techniques from analytic combinatorics, we calculate asymptotic pattern occurrence statistics and use them in conjunction with the…
The problem addressed concerns the determination of the average number of successive attempts of guessing a word of a certain length consisting of letters with given probabilities of occurrence. Both first- and second-order approximations…
We present a clustering method and provide a theoretical analysis and an explanation to a phenomenon encountered in the applied statistical literature since the 1990's. This phenomenon is the natural adaptability of the order when using a…
We develop a simple and unified approach to investigate several aspects of the cluster statistics of random expansive (multi-)sets. In particular, we determine the limiting distribution of the size of the smallest and largest clusters, we…
Given a random text over a finite alphabet, we study the frequencies at which fixed-length words occur as subsequences. As the data size grows, the joint distribution of word counts exhibits a rich asymptotic structure. We investigate all…
A composite likelihood is an inference function derived by multiplying a set of likelihood components. This approach provides a flexible framework for drawing inference when the likelihood function of a statistical model is computationally…
We focus on the statistics of word occurrences and of the waiting times between such occurrences in Blogs. Due to the heterogeneity of words' frequencies, the empirical analysis is performed by studying classes of "frequently-equivalent"…
One way of getting a better view of data is using frequent patterns. In this paper frequent patterns are subsets that occur a minimal number of times in a stream of itemsets. However, the discovery of frequent patterns in streams has always…
We analyse correspondence of a text to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary. The probability distribution correspond to the Zipf---Mandelbrot law. We count…
Words are sequences of letters over a finite alphabet. We study two intimately related topics for this object: quasi-randomness and limit theory. With respect to the first topic we investigate the notion of uniform distribution of letters…
A composite likelihood is a combination of low-dimensional likelihood objects useful in applications where the data have complex structure. Although composite likelihood construction is a crucial aspect influencing both computing and…
In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat…
This paper considers the entropy of the sum of (possibly dependent and non-identically distributed) Bernoulli random variables. Upper bounds on the error that follows from an approximation of this entropy by the entropy of a Poisson random…
Chang's Lemma is a widely employed result in additive combinatorics. It gives bounds on the dimension of the large spectrum of probability distributions on finite abelian groups. Recently, Bloom (2016) presented a powerful variant of…
Clustering is a technique for the analysis of datasets obtained by empirical studies in several disciplines with a major application for biomedical research. Essentially, clustering algorithms are executed by machines aiming at finding…
The paper introduces the concept of a cluster structure to define a joint distribution of the sample size and its exchangeable random partitions. The cluster structure allows the probability distribution of the random partitions of a subset…
Invitation to the statistical study of language: The topic of this presentation is the interdisciplinary nexus between linguistics and statistics. It targets linguists, for whom it may have a theoretical interest, or professionals that work…