Related papers: Consensus Sequence Segmentation

A Statistical Model for Word Discovery in Transcribed Speech

A statistical model for segmentation and word discovery in continuous speech is presented. An incremental unsupervised learning algorithm to infer word boundaries based on this model is described. Results of empirical tests showing that the…

Computation and Language · Computer Science 2007-05-23 Anand Venkataraman

Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-14 Simon Malan , Benjamin van Niekerk , Herman Kamper

Corpus analysis without prior linguistic knowledge - unsupervised mining of phrases and subphrase structure

When looking at the structure of natural language, "phrases" and "words" are central notions. We consider the problem of identifying such "meaningful subparts" of language of any length and underlying composition principles in a completely…

Computation and Language · Computer Science 2016-02-19 Stefan Gerdjikov , Klaus U. Schulz

Unsupervised Word Segmentation Using Temporal Gradient Pseudo-Labels

Unsupervised word segmentation in audio utterances is challenging as, in speech, there is typically no gap between words. In a preliminary experiment, we show that recent deep self-supervised features are very effective for word…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-04 Tzeviya Sylvia Fuchs , Yedid Hoshen

Unsupervised word segmentation and lexicon discovery using acoustic word embeddings

In settings where only unlabelled speech data is available, speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text. A similar problem is faced when modelling infant language…

Computation and Language · Computer Science 2016-03-10 Herman Kamper , Aren Jansen , Sharon Goldwater

Linear-Time Sequence Comparison Using Minimal Absent Words & Applications

Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realized by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free…

Data Structures and Algorithms · Computer Science 2015-12-23 Maxime Crochemore , Gabriele Fici , Robert Mercaş , Solon P. Pissis

A statistical learning algorithm for word segmentation

In natural speech, the speaker does not pause between words, yet a human listener somehow perceives this continuous stream of phonemes as a series of distinct words. The detection of boundaries between spoken words is an instance of a…

Computation and Language · Computer Science 2011-06-28 Jerry R. Van Aken

Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language…

Computation and Language · Computer Science 2025-05-27 Zihong Zhang , Liqi He , Zuchao Li , Lefei Zhang , Hai Zhao , Bo Du

An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery

This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text.…

Computation and Language · Computer Science 2007-05-23 Michael R. Brent

A procedure for unsupervised lexicon learning

We describe an incremental unsupervised procedure to learn words from transcribed continuous speech. The algorithm is based on a conservative and traditional statistical model, and results of empirical tests show that it is competitive with…

Computation and Language · Computer Science 2007-05-23 Anand Venkataraman

Alignment-free sequence comparison using absent words

Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realised by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free…

Data Structures and Algorithms · Computer Science 2018-06-08 Panagiotis Charalampopoulos , Maxime Crochemore , Gabriele Fici , Robert Mercas , Solon P. Pissis

Unsupervised Word Segmentation using K Nearest Neighbors

In this paper, we propose an unsupervised kNN-based approach for word segmentation in speech utterances. Our method relies on self-supervised pre-trained speech representations, and compares each audio segment of a given utterance to its K…

Sound · Computer Science 2022-04-28 Tzeviya Sylvia Fuchs , Yedid Hoshen , Joseph Keshet

CNN-based Spoken Term Detection and Localization without Dynamic Programming

In this paper, we propose a spoken term detection algorithm for simultaneous prediction and localization of in-vocabulary and out-of-vocabulary terms within an audio segment. The proposed algorithm infers whether a term was uttered within a…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-11 Tzeviya Sylvia Fuchs , Yael Segal , Joseph Keshet

Training-Free Semantic Segmentation via LLM-Supervision

Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Wenfang Sun , Yingjun Du , Gaowen Liu , Ramana Kompella , Cees G. M. Snoek

Grouping Words Using Statistical Context

This paper (cmp-lg/yymmnnn) has been accepted for publication in the student session of EACL-95. It outlines ongoing work using statistical and unsupervised neural network methods for clustering words in untagged corpora. Such approaches…

cmp-lg · Computer Science 2008-02-03 Christopher C. Huckle

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and…

Computation and Language · Computer Science 2018-12-04 Yerai Doval , Carlos Gómez-Rodríguez

Universal Word Segmentation: Implementation and Interpretation

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different…

Computation and Language · Computer Science 2018-07-10 Yan Shao , Christian Hardmeier , Joakim Nivre

Expedition: A System for the Unsupervised Learning of a Hierarchy of Concepts

We present a system for bottom-up cumulative learning of myriad concepts corresponding to meaningful character strings, and their part-related and prediction edges. The learning is self-supervised in that the concepts discovered are used as…

Machine Learning · Computer Science 2021-12-20 Omid Madani

Unsupervised Learning for Lexicon-Based Classification

In lexicon-based classification, documents are assigned labels by comparing the number of words that appear from two opposed lexicons, such as positive and negative sentiment. Creating such words lists is often easier than labeling…

Machine Learning · Computer Science 2016-11-22 Jacob Eisenstein

Learning to Discover, Ground and Use Words with Segmental Neural Language Models

We propose a segmental neural language model that combines the generalization power of neural networks with the ability to discover word-like units that are latent in unsegmented character sequences. In contrast to previous segmentation…

Computation and Language · Computer Science 2019-06-19 Kazuya Kawakami , Chris Dyer , Phil Blunsom