Related papers: Text Segmentation Using Exponential Models

Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images

Text segmentation, the task of dividing a document into sections, is often a prerequisite for performing additional natural language processing tasks. Existing text segmentation methods have typically been developed and tested using clean,…

Computer Vision and Pattern Recognition · Computer Science 2023-12-21 Carol Anderson , Phil Crone

OntoSeg: a Novel Approach to Text Segmentation using Ontological Similarity

Text segmentation (TS) aims at dividing long text into coherent segments which reflect the subtopic structure of the text. It is beneficial to many natural language processing tasks, such as Information Retrieval (IR) and document…

Computation and Language · Computer Science 2015-11-30 Mostafa Bayomi , Killian Levacher , M. Rami Ghorab , Séamus Lawless

Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation

We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov…

Computation and Language · Computer Science 2022-02-28 G. Tur , D. Hakkani-Tur , A. Stolcke , E. Shriberg

An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery

This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text.…

Computation and Language · Computer Science 2007-05-23 Michael R. Brent

Text Segmentation as a Supervised Learning Task

Text segmentation, the task of dividing a document into contiguous segments based on its semantic structure, is a longstanding challenge in language understanding. Previous work on text segmentation focused on unsupervised methods such as…

Computation and Language · Computer Science 2018-03-28 Omri Koshorek , Adir Cohen , Noam Mor , Michael Rotman , Jonathan Berant

Recent Trends in Linear Text Segmentation: a Survey

Linear Text Segmentation is the task of automatically tagging text documents with topic shifts, i.e. the places in the text where the topics change. A well-established area of research in Natural Language Processing, drawing from…

Computation and Language · Computer Science 2024-11-26 Iacopo Ghinassi , Lin Wang , Chris Newell , Matthew Purver

Segmentation of Expository Texts by Hierarchical Agglomerative Clustering

We propose a method for segmentation of expository texts based on hierarchical agglomerative clustering. The method uses paragraphs as the basic segments for identifying hierarchical discourse structure in the text, applying lexical…

cmp-lg · Computer Science 2016-08-31 Yaakov Yaari

Two-Level Transformer and Auxiliary Coherence Modeling for Improved Text Segmentation

Breaking down the structure of long texts into semantically coherent segments makes the texts more readable and supports downstream applications like summarization and retrieval. Starting from an apparent link between text coherence and…

Computation and Language · Computer Science 2020-01-06 Goran Glavaš , Swapna Somasundaran

Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models

Large language models (LLMs) often struggle to accurately read and comprehend extremely long texts. Current methods for improvement typically rely on splitting long contexts into fixed-length chunks. However, fixed truncation risks…

Computation and Language · Computer Science 2025-06-04 Boheng Sheng , Jiacheng Yao , Meicong Zhang , Guoxiu He

Multi-Paragraph Segmentation of Expository Text

This paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. The algorithm uses domain-independent lexical frequency and…

cmp-lg · Computer Science 2008-02-03 Marti A. Hearst

Structural Text Segmentation of Legal Documents

The growing complexity of legal cases has lead to an increasing interest in legal information retrieval systems that can effectively satisfy user-specific information needs. However, such downstream systems typically require documents to be…

Computation and Language · Computer Science 2021-05-18 Dennis Aumiller , Satya Almasian , Sebastian Lackner , Michael Gertz

Topic Segmentation Model Focusing on Local Context

Topic segmentation is important in understanding scientific documents since it can not only provide better readability but also facilitate downstream tasks such as information retrieval and question answering by creating appropriate…

Computation and Language · Computer Science 2023-01-06 Jeonghwan Lee , Jiyeong Han , Sunghoon Baek , Min Song

SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

When searching for information, a human reader first glances over a document, spots relevant sections and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates to identify…

Computation and Language · Computer Science 2019-02-14 Sebastian Arnold , Rudolf Schneider , Philippe Cudré-Mauroux , Felix A. Gers , Alexander Löser

Domain and Language Independent Feature Extraction for Statistical Text Categorization

A generic system for text categorization is presented which uses a representative text corpus to adapt the processing steps: feature extraction, dimension reduction, and classification. Feature extraction automatically learns features from…

cmp-lg · Computer Science 2008-02-03 Thomas Bayer , Ingrid Renz , Michael Stein , Ulrich Kressel

Attention-based Neural Text Segmentation

Text segmentation plays an important role in various Natural Language Processing (NLP) tasks like summarization, context understanding, document indexing and document noise removal. Previous methods for this task require manual feature…

Machine Learning · Computer Science 2018-08-30 Pinkesh Badjatiya , Litton J Kurisinkel , Manish Gupta , Vasudeva Varma

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer…

Computation and Language · Computer Science 2024-10-04 Markus Frohmann , Igor Sterner , Ivan Vulić , Benjamin Minixhofer , Markus Schedl

Text Segmentation by Cross Segment Attention

Document and discourse segmentation are two fundamental NLP tasks pertaining to breaking up text into constituents, which are commonly used to help downstream tasks such as information retrieval or text summarization. In this work, we…

Computation and Language · Computer Science 2020-12-08 Michal Lukasik , Boris Dadachev , Gonçalo Simões , Kishore Papineni

Text Segmentation based on Semantic Word Embeddings

We explore the use of semantic word embeddings in text segmentation algorithms, including the C99 segmentation algorithm and new algorithms inspired by the distributed word vector representation. By developing a general framework for…

Computation and Language · Computer Science 2015-03-19 Alexander A Alemi , Paul Ginsparg

A Sequential Algorithm for Training Text Classifiers

The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential…

cmp-lg · Computer Science 2008-02-03 David D. Lewis , William A. Gale

Automatic Discovery of Non-Compositional Compounds in Parallel Data

Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine…

cmp-lg · Computer Science 2008-02-03 I. Dan Melamed