Related papers: A statistical learning algorithm for word segmenta…

A Statistical Model for Word Discovery in Transcribed Speech

A statistical model for segmentation and word discovery in continuous speech is presented. An incremental unsupervised learning algorithm to infer word boundaries based on this model is described. Results of empirical tests showing that the…

Computation and Language · Computer Science 2007-05-23 Anand Venkataraman

An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery

This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text.…

Computation and Language · Computer Science 2007-05-23 Michael R. Brent

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and…

Computation and Language · Computer Science 2018-12-04 Yerai Doval , Carlos Gómez-Rodríguez

Consensus Sequence Segmentation

In this paper we introduce a method to detect words or phrases in a given sequence of alphabets without knowing the lexicon. Our linear time unsupervised algorithm relies entirely on statistical relationships among alphabets in the input…

Computation and Language · Computer Science 2013-12-31 Tamal Chowdhury , Rabindra Rakshit , Arko Banerjee

On the Difficulty of Segmenting Words with Attention

Word segmentation, the problem of finding word boundaries in speech, is of interest for a range of tasks. Previous papers have suggested that for sequence-to-sequence models trained on tasks such as speech translation or speech recognition,…

Computation and Language · Computer Science 2021-09-22 Ramon Sanabria , Hao Tang , Sharon Goldwater

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation use a Dirichlet process to jointly segment…

Computation and Language · Computer Science 2022-06-24 Robin Algayres , Tristan Ricoul , Julien Karadayi , Hugo Laurençon , Salah Zaiem , Abdelrahman Mohamed , Benoît Sagot , Emmanuel Dupoux

Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

Inspired by early research on exploring naturally annotated data for Chinese word segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to mine word boundaries…

Computation and Language · Computer Science 2023-10-31 Lei Zhang , Zhenghua Li , Shilin Zhou , Chen Gong , Zhefeng Wang , Baoxing Huai , Min Zhang

A statistical model for word discovery in child directed speech

A statistical model for segmentation and word discovery in child directed speech is presented. An incremental unsupervised learning algorithm to infer word boundaries based on this model is described and results of empirical tests showing…

Computation and Language · Computer Science 2007-05-23 Anand Venkataraman

Continuous speech separation: dataset and analysis

This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies on speech separation use pre-segmented signals of artificially mixed speech utterances which are mostly \emph{fully}…

Sound · Computer Science 2020-05-08 Zhuo Chen , Takuya Yoshioka , Liang Lu , Tianyan Zhou , Zhong Meng , Yi Luo , Jian Wu , Xiong Xiao , Jinyu Li

Text Segmentation Using Exponential Models

This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To…

cmp-lg · Computer Science 2008-02-03 Doug Beeferman , Adam Berger , John Lafferty

Neural Sequence Segmentation as Determining the Leftmost Segments

Prior methods to text segmentation are mostly at token level. Despite the adequacy, this nature limits their full potential to capture the long-term dependencies among segments. In this work, we propose a novel framework that incrementally…

Computation and Language · Computer Science 2021-04-16 Yangming Li , Lemao Liu , Kaisheng Yao

CNN-based Spoken Term Detection and Localization without Dynamic Programming

In this paper, we propose a spoken term detection algorithm for simultaneous prediction and localization of in-vocabulary and out-of-vocabulary terms within an audio segment. The proposed algorithm infers whether a term was uttered within a…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-11 Tzeviya Sylvia Fuchs , Yael Segal , Joseph Keshet

Automatic Discovery of Non-Compositional Compounds in Parallel Data

Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine…

cmp-lg · Computer Science 2008-02-03 I. Dan Melamed

Text segmentation with character-level text embeddings

Learning word representations has recently seen much success in computational linguistics. However, assuming sequences of word tokens as input to linguistic analysis is often unjustified. For many languages word segmentation is a…

Computation and Language · Computer Science 2013-09-19 Grzegorz Chrupała

A Learning Approach to Natural Language Understanding

In this paper we propose a learning paradigm for the problem of understanding spoken language. The basis of the work is in a formalization of the understanding problem as a communication problem. This results in the definition of a…

cmp-lg · Computer Science 2008-02-03 Roberto Pieraccini , Esther Levin

BabyLM's First Words: Word Segmentation as a Phonological Probing Task

Language models provide a key framework for studying linguistic theories based on prediction, but phonological analysis using large language models (LLMs) is difficult; there are few phonological benchmarks beyond English and the standard…

Computation and Language · Computer Science 2025-06-13 Zébulon Goriely , Paula Buttery

Language Acquisition in Computers

This project explores the nature of language acquisition in computers, guided by techniques similar to those used in children. While existing natural language processing methods are limited in scope and understanding, our system aims to…

Computation and Language · Computer Science 2012-06-04 Megan Belzner , Sean Colin-Ellerin , Jorge H. Roman

Combining Multiple Knowledge Sources for Discourse Segmentation

We predict discourse segment boundaries from linguistic features of utterances, using a corpus of spoken narratives as data. We present two methods for developing segmentation algorithms from training data: hand tuning and machine learning.…

cmp-lg · Computer Science 2008-02-03 Diane J. Litman , Rebecca J. Passonneau

XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words

Due to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent self-supervised…

Computation and Language · Computer Science 2023-10-10 Robin Algayres , Pablo Diego-Simon , Benoit Sagot , Emmanuel Dupoux

Sentence Segmentation in Narrative Transcripts from Neuropsychological Tests using Recurrent Convolutional Neural Networks

Automated discourse analysis tools based on Natural Language Processing (NLP) aiming at the diagnosis of language-impairing dementias generally extract several textual metrics of narrative transcripts. However, the absence of sentence…

Computation and Language · Computer Science 2017-08-17 Marcos Vinícius Treviso , Christopher Shulby , Sandra Maria Aluísio