Related papers: Context tree selection and linguistic rhythm retri…
We study a variable length Markov chain model associated with a group of stationary processes that share the same context tree but each process has potentially different conditional probabilities. We propose a new model selection and…
Speech technologies rely on capturing a speaker's voice variability while obtaining comprehensive language information. Textual prompts and sentence selection methods have been proposed in the literature to comprise such adequate phonetic…
Languages have long been described according to their perceived rhythmic attributes. The associated typologies are of interest in psycholinguistics as they partly predict newborns' abilities to discriminate between languages and provide…
This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text.…
The present paper investigates non-asymptotic properties of two popular procedures of context tree (or Variable Length Markov Chains) estimation: Rissanen's algorithm Context and the Penalized Maximum Likelihood criterion. First showing how…
We study a problem of model selection for data produced by two different context tree sources. Motivated by linguistic questions, we consider the case where the probabilistic context trees corresponding to the two sources are finite and…
Measuring similarities between strings is central for many established and fast growing research areas including information retrieval, biology, and natural language processing. The traditional approach for string similarity measurements is…
We consider the problem of estimating the context tree of a stationary ergodic process with finite alphabet without imposing additional conditions on the process. As a starting point we introduce a Hamming metric in the space of irreducible…
Concepts of complex networks have been used to obtain metrics that were correlated to text quality established by scores assigned by human judges. Texts produced by high-school students in Portuguese were represented as scale-free networks…
Rhetoric, both spoken and written, involves not only content but also style. One common stylistic tool is $\textit{parallelism}$: the juxtaposition of phrases which have the same sequence of linguistic ($\textit{e.g.}$, phonological,…
Many efficient algorithms with strong theoretical guarantees have been proposed for the contextual multi-armed bandit problem. However, applying these algorithms in practice can be difficult because they require domain expertise to build…
In natural speech, the speaker does not pause between words, yet a human listener somehow perceives this continuous stream of phonemes as a series of distinct words. The detection of boundaries between spoken words is an instance of a…
With the constant growth of the World Wide Web and the number of documents in different languages accordingly, the need for reliable language detection tools has increased as well. Platforms such as Twitter with predominantly short texts…
The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval…
We address the issue of context tree estimation in variable length hidden Markov models. We propose an estimator of the context tree of the hidden Markov process which needs no prior upper bound on the depth of the context tree. We prove…
Tokenising continuous speech into sequences of discrete tokens and modelling them with language models (LMs) has led to significant success in text-to-speech (TTS) synthesis. Although these models can generate speech with high quality and…
There is extensive interest in metric learning methods for image retrieval. Many metric learning loss functions focus on learning a correct ranking of training samples, but strongly overfit semantically inconsistent labels and require a…
The quota system in Brazil made it possible to include blind students in higher education. Teachers' lack of knowledge about the braille system can represent a barrier between them and students who use it for writing and reading.…
The context tree source is a source model in which the occurrence probability of symbols is determined from a finite past sequence, and is a broader class of sources that includes i.i.d. and Markov sources. The proposed source model in this…
Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current…