Related papers: Language Segmentation
Text segmentation, the task of dividing a document into contiguous segments based on its semantic structure, is a longstanding challenge in language understanding. Previous work on text segmentation focused on unsupervised methods such as…
Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language…
Word segmentation, the problem of finding word boundaries in speech, is of interest for a range of tasks. Previous papers have suggested that for sequence-to-sequence models trained on tasks such as speech translation or speech recognition,…
Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and…
This paper presents a novel approach for multi-lingual sentiment classification in short texts. This is a challenging task as the amount of training data in languages other than English is very limited. Previously proposed multi-lingual…
Segmentation remains an important preprocessing step both in languages where "words" or other important syntactic/semantic units (like morphemes) are not clearly delineated by white space, as well as when dealing with continuous speech…
Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different…
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving…
Easy Read text is one of the main forms of access to information for people with reading difficulties. One of the key characteristics of this type of text is the requirement to split sentences into smaller grammatical segments, to…
Unsupervised word segmentation in audio utterances is challenging as, in speech, there is typically no gap between words. In a preliminary experiment, we show that recent deep self-supervised features are very effective for word…
Text segmentation is a challenging vision task with many downstream applications. Current text segmentation methods require pixel-level annotations, which are expensive in the cost of human labor and limited in application scenarios. In…
We present an unsupervised word segmentation model, in which the learning objective is to maximize the generation probability of a sentence given its all possible segmentation. Such generation probability can be factorized into the…
Recent advancements in open vocabulary models, like CLIP, have notably advanced zero-shot classification and segmentation by utilizing natural language for class-specific embeddings. However, most research has focused on improving model…
We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation.…
Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user's intent. Existing approaches to semi-supervised…
Text segmentation aims to divide text into contiguous, semantically coherent segments, while segment labeling deals with producing labels for each segment. Past work has shown success in tackling segmentation and labeling for documents and…
Cross-lingual transfer of word embeddings aims to establish the semantic mappings among words in different languages by learning the transformation functions over the corresponding word embedding spaces. Successfully solving this problem…
We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature…
Existing weakly-supervised semantic segmentation methods using image-level annotations typically rely on initial responses to locate object regions. However, such response maps generated by the classification network usually focus on…
Solving text classification in a weakly supervised manner is important for real-world applications where human annotations are scarce. In this paper, we propose to query a masked language model with cloze style prompts to obtain supervision…