Related papers: Non-Parametric Bayesian Areal Linguistics
Kingman's coalescent is one of the most popular models in population genetics. It describes the genealogy of a population whose genetic composition evolves in time according to the Wright-Fisher model, or suitable approximations of it…
Is it possible to develop a `physics of language' which can explain the spatial, temporal and social patterns we see, and which can predict future change like we forecast the weather? Such a theory is likely to involve ideas from…
Pretrained language models (PLMs) have become remarkably adept at task and language generalization. Nonetheless, they often fail when faced with unseen languages. In this work, we present LinguAlchemy, a regularization method that…
In audio signal processing, probabilistic time-frequency models have many benefits over their non-probabilistic counterparts. They adapt to the incoming signal, quantify uncertainty, and measure correlation between the signal's amplitude…
In this article we propose a novel method to estimate the frequency distribution of linguistic variables while controlling for statistical non-independence due to shared ancestry. Unlike previous approaches, our technique uses all available…
In this thesis, we investigate three problems involving the probabilistic modeling of language: smoothing n-gram models, statistical grammar induction, and bilingual sentence alignment. These three problems employ models at three different…
This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of statistical models to discover a good explanation of a data set, and then produces a detailed…
A neural probabilistic language model (NPLM) provides an idea to achieve the better perplexity than n-gram language model and their smoothed language models. This paper investigates application area in bilingual NLP, specifically…
Quantifying the speed of linguistic change is challenging due to the fact that the historical evolution of languages is sparsely documented. Consequently, traditional methods rely on phylogenetic reconstruction. In this paper, we propose a…
We propose a simple, scalable, fully generative model for transition-based dependency parsing with high accuracy. The model, parameterized by Hierarchical Pitman-Yor Processes, overcomes the limitations of previous generative models by…
The statistical over-representation of phonological features in the basic vocabulary of languages is often interpreted as reflecting potentially universal sound symbolic patterns. However, most of those results have not been tested…
The distribution of human linguistic groups presents a number of interesting and non-trivial patterns. The distributions of the number of speakers per language and the area each group covers follow log-normal distributions, while population…
In this paper we develop a functorial language of probabilistic morphisms and apply it to some basic problems in Bayesian nonparametrics. First we extend and unify the Kleisli category of probabilistic morphisms proposed by Lawvere and Giry…
A simple machine learning model of pluralisation as a linear regression problem minimising a p-adic metric substantially outperforms even the most robust of Euclidean-space regressors on languages in the Indo-European, Austronesian, Trans…
This paper proposes a nonparametric Bayesian method for exploratory data analysis and feature construction in continuous time series. Our method focuses on understanding shared features in a set of time series that exhibit significant…
Tree structured graphical models are powerful at expressing long range or hierarchical dependency among many variables, and have been widely applied in different areas of computer science and statistics. However, existing methods for…
Natural language processing often involves computations with semantic or syntactic graphs to facilitate sophisticated reasoning based on structural relationships. While convolution kernels provide a powerful tool for comparing graph…
Large pretrained language models are critical components of modern NLP pipelines. Yet, they suffer from spurious correlations, poor out-of-domain generalization, and biases. Inspired by recent progress in causal machine learning, in…
This paper proposes methods of predicting dynamic time series (including non-stationary ones) based on a linguistic approach, namely, the study of occurrences and repetition of so-called N-grams. This approach is used in computational…
We introduce the Probabilistic Worldbuilding Model (PWM), a new fully-symbolic Bayesian model of semantic parsing and reasoning, as a first step in a research program toward more domain- and task-general NLU and AI. Humans create internal…