Related papers: Constructions for Clumps Statistics

Counting overlapping pairs of words

A correlation is a binary vector that encodes all possible positions of overlaps of two words, where an overlap for an ordered pair of words (u,v) occurs if a suffix of word u matches a prefix of word v. As multiple pairs can have the same…

Discrete Mathematics · Computer Science 2025-06-03 Eric Rivals , Pengfei Wang

The number of distinct adjacent pairs in geometrically distributed words: a probabilistic and combinatorial analysis

The analysis of strings of $n$ random variables with geometric distribution has recently attracted renewed interest: Archibald et al. consider the number of distinct adjacent pairs in geometrically distributed words. They obtain the…

Probability · Mathematics 2024-02-14 Guy Louchard , Werner Schachinger , Mark Daniel Ward

Word Clustering and Disambiguation Based on Co-occurrence Data

We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and using the acquired word classes to improve the accuracy of syntactic disambiguation. We view this problem as that of estimating a…

cmp-lg · Computer Science 2007-05-23 Hang Li , Naoki Abe

Pattern occurrence statistics and applications to the Ramsey theory of unavoidable patterns

As suggested by Currie, we apply the probabilistic method to problems regarding pattern avoidance. Using techniques from analytic combinatorics, we calculate asymptotic pattern occurrence statistics and use them in conjunction with the…

Combinatorics · Mathematics 2014-06-03 Jim Tao

Exact Probability Distribution versus Entropy

The problem addressed concerns the determination of the average number of successive attempts of guessing a word of a certain length consisting of letters with given probabilities of occurrence. Both first- and second-order approximations…

Information Theory · Computer Science 2015-06-19 Kerstin Andersson

Adaptative clustering by minimization of the mixing entropy criterion

We present a clustering method and provide a theoretical analysis and an explanation to a phenomenon encountered in the applied statistical literature since the 1990's. This phenomenon is the natural adaptability of the order when using a…

Statistics Theory · Mathematics 2022-03-23 Thierry Dumont

Cluster Statistics in Expansive Combinatorial Structures

We develop a simple and unified approach to investigate several aspects of the cluster statistics of random expansive (multi-)sets. In particular, we determine the limiting distribution of the size of the smallest and largest clusters, we…

Probability · Mathematics 2022-08-02 Konstantinos Panagiotou , Leon Ramzews

Spectral Analysis of Word Statistics

Given a random text over a finite alphabet, we study the frequencies at which fixed-length words occur as subsequences. As the data size grows, the joint distribution of word counts exhibits a rich asymptotic structure. We investigate all…

Probability · Mathematics 2026-05-06 Chaim Even-Zohar , Tsviqa Lakrec , Ran J. Tessler

When Composite Likelihood Meets Stochastic Approximation

A composite likelihood is an inference function derived by multiplying a set of likelihood components. This approach provides a flexible framework for drawing inference when the likelihood function of a statistical model is computationally…

Methodology · Statistics 2024-12-10 Giuseppe Alfonzetti , Ruggero Bellio , Yunxiao Chen , Irini Moustaki

Word statistics in Blogs and RSS feeds: Towards empirical universal evidence

We focus on the statistics of word occurrences and of the waiting times between such occurrences in Blogs. Due to the heterogeneity of words' frequencies, the empirical analysis is performed by studying classes of "frequently-equivalent"…

Information Theory · Computer Science 2012-09-25 R. Lambiotte , M. Ausloos , M. Thelwall

Clustering Co-occurrence of Maximal Frequent Patterns in Streams

One way of getting a better view of data is using frequent patterns. In this paper frequent patterns are subsets that occur a minimal number of times in a stream of itemsets. However, the discovery of frequent patterns in streams has always…

Artificial Intelligence · Computer Science 2007-05-23 Edgar H. de Graaf , Joost N. Kok , Walter A. Kosters

A statistical test for correspondence of texts to the Zipf-Mandelbrot law

We analyse correspondence of a text to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary. The probability distribution correspond to the Zipf---Mandelbrot law. We count…

Statistics Theory · Mathematics 2019-12-30 Anik Chakrabarty , Mikhail Chebunin , Artyom Kovalevskii , Ilya Pupyshev , Natalia Zakrevskaya , Qianqian Zhou

Quasi-random words and limits of word sequences

Words are sequences of letters over a finite alphabet. We study two intimately related topics for this object: quasi-randomness and limit theory. With respect to the first topic we investigate the notion of uniform distribution of letters…

Combinatorics · Mathematics 2021-09-01 Hiêp Hàn , Marcos Kiwi , Matías Pavez-Signé

Fast construction of optimal composite likelihoods

A composite likelihood is a combination of low-dimensional likelihood objects useful in applications where the data have complex structure. Although composite likelihood construction is a crucial aspect influencing both computing and…

Methodology · Statistics 2022-04-26 Zhendong Huang , Davide Ferrari

In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat…

cmp-lg · Computer Science 2008-02-03 Ido Dagan , Fernando Pereira , Lillian Lee

On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method

This paper considers the entropy of the sum of (possibly dependent and non-identically distributed) Bernoulli random variables. Upper bounds on the error that follows from an approximation of this entropy by the entropy of a Poisson random…

Information Theory · Computer Science 2016-11-17 Igal Sason

Covering the large spectrum and generalized Riesz products

Chang's Lemma is a widely employed result in additive combinatorics. It gives bounds on the dimension of the large spectrum of probability distributions on finite abelian groups. Recently, Bloom (2016) presented a powerful variant of…

Combinatorics · Mathematics 2016-12-30 James R. Lee

A shortest-path based clustering algorithm for joint human-machine analysis of complex datasets

Clustering is a technique for the analysis of datasets obtained by empirical studies in several disciplines with a major application for biomedical research. Essentially, clustering algorithms are executed by machines aiming at finding…

Quantitative Methods · Quantitative Biology 2024-09-30 Diego Ulisse Pizzagalli , Santiago Fernandez Gonzalez , Rolf Krause

Generalized Negative Binomial Processes and the Representation of Cluster Structures

The paper introduces the concept of a cluster structure to define a joint distribution of the sample size and its exchangeable random partitions. The cluster structure allows the probability distribution of the random partitions of a subset…

Methodology · Statistics 2013-10-08 Mingyuan Zhou

Invitaci\'on al estudio estad\'istico del lenguaje

Invitation to the statistical study of language: The topic of this presentation is the interdisciplinary nexus between linguistics and statistics. It targets linguists, for whom it may have a theoretical interest, or professionals that work…

Applications · Statistics 2018-04-23 Rogelio Nazar