Related papers: Representing Text Chunks

Neural Models for Sequence Chunking

Many natural language understanding (NLU) tasks, such as shallow parsing (i.e., text chunking) and semantic slot filling, require the assignment of representative labels to the meaningful chunks in a sentence. Most of the current deep…

Computation and Language · Computer Science 2017-01-17 Feifei Zhai , Saloni Potdar , Bing Xiang , Bowen Zhou

Discovering Chunks in Neural Embeddings for Interpretability

Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this…

Machine Learning · Computer Science 2025-02-05 Shuchen Wu , Stephan Alaniz , Eric Schulz , Zeynep Akata

Text Chunking using Transformation-Based Learning

Eric Brill introduced transformation-based learning and showed that it can do part-of-speech tagging with fairly high accuracy. The same method can be applied at a higher level of textual interpretation for locating chunks in the tagged…

cmp-lg · Computer Science 2009-09-25 Lance A. Ramshaw , Mitchell P. Marcus

Is Semantic Chunking Worth the Computational Cost?

Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the…

Computation and Language · Computer Science 2024-10-18 Renyi Qu , Ruixuan Tu , Forrest Bao

Same Representation, Different Attentions: Shareable Sentence Representation Learning from Multiple Tasks

Distributed representation plays an important role in deep learning based natural language processing. However, the representation of a sentence often varies in different tasks, which is usually learned from scratch and suffers from the…

Computation and Language · Computer Science 2018-04-24 Renjie Zheng , Junkun Chen , Xipeng Qiu

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

While Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for boosting large language models (LLMs) in knowledge-intensive tasks, it often overlooks the crucial aspect of text chunking within its workflow. This paper…

Computation and Language · Computer Science 2025-05-22 Jihao Zhao , Zhiyuan Ji , Yuchen Feng , Pengnian Qi , Simin Niu , Bo Tang , Feiyu Xiong , Zhiyu Li

ChuLo: Chunk-Level Key Information Representation for Long Document Understanding

Transformer-based models have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet their ability to handle long documents is constrained by computational limitations. Traditional approaches, such as truncating…

Computation and Language · Computer Science 2025-08-21 Yan Li , Soyeon Caren Han , Yue Dai , Feiqi Cao

Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis

Chunking is a crucial preprocessing step in retrieval-augmented generation (RAG) systems, significantly impacting retrieval effectiveness across diverse datasets. In this study, we systematically evaluate fixed-size chunking strategies and…

Information Retrieval · Computer Science 2025-05-30 Sinchana Ramakanth Bhat , Max Rudat , Jannis Spiekermann , Nicolas Flores-Herr

Nugget: Neural Agglomerative Embeddings of Text

Embedding text sequences is a widespread requirement in modern language understanding. Existing approaches focus largely on constant-size representations. This is problematic, as the amount of information contained in text often varies with…

Computation and Language · Computer Science 2023-10-04 Guanghui Qin , Benjamin Van Durme

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently,…

Computation and Language · Computer Science 2025-07-08 Michael Günther , Isabelle Mohr , Daniel James Williams , Bo Wang , Han Xiao

The Impact of Visual Segmentation on Lexical Word Recognition

When a reader encounters a word in English, they split the word into smaller orthographic units in the process of recognizing its meaning. For example, "rough", when split according to phonemes, is decomposed as r-ou-gh (not as r-o-ugh or…

Human-Computer Interaction · Computer Science 2025-08-26 Matthew Termuende , Kevin Larson , Miguel Nacenta

Chunk Tagger - Statistical Recognition of Noun Phrases

We describe a stochastic approach to partial parsing, i.e., the recognition of syntactic structures of limited depth. The technique utilises Markov Models, but goes beyond usual bracketing approaches, since it is capable of recognising not…

cmp-lg · Computer Science 2007-05-23 Wojciech Skut , Thorsten Brants

Improving cross-lingual model transfer by chunking

We present a shallow parser guided cross-lingual model transfer approach in order to address the syntactic differences between source and target languages more effectively. In this work, we assume the chunks or phrases in a sentence as…

Computation and Language · Computer Science 2020-02-28 Ayan Das , Sudeshna Sarkar

Mimicking Human Process: Text Representation via Latent Semantic Clustering for Classification

Considering that words with different characteristic in the text have different importance for classification, grouping them together separately can strengthen the semantic expression of each part. Thus we propose a new text representation…

Computation and Language · Computer Science 2019-06-19 Xiaoye Tan , Rui Yan , Chongyang Tao , Mingrui Wu

A Systematic Comparison of English Noun Compound Representations

Building meaningful representations of noun compounds is not trivial since many of them scarcely appear in the corpus. To that end, composition functions approximate the distributional representation of a noun compound by combining its…

Computation and Language · Computer Science 2019-06-13 Vered Shwartz

A Cross-Task Analysis of Text Span Representations

Many natural language processing (NLP) tasks involve reasoning with textual spans, including question answering, entity recognition, and coreference resolution. While extensive research has focused on functional architectures for…

Computation and Language · Computer Science 2020-06-09 Shubham Toshniwal , Haoyue Shi , Bowen Shi , Lingyu Gao , Karen Livescu , Kevin Gimpel

The Emergence of Chunking Structures with Hierarchical RNN

In Natural Language Processing (NLP), predicting linguistic structures, such as parsing and chunking, has mostly relied on manual annotations of syntactic structures. This paper introduces an unsupervised approach to chunking, a syntactic…

Computation and Language · Computer Science 2025-12-19 Zijun Wu , Anup Anand Deshmukh , Yongkang Wu , Jimmy Lin , Lili Mou

Chunk Knowledge Generation Model for Enhanced Information Retrieval: A Multi-task Learning Approach

Traditional query expansion techniques for addressing vocabulary mismatch problems in information retrieval are context-sensitive and may lead to performance degradation. As an alternative, document expansion research has gained attention,…

Information Retrieval · Computer Science 2025-09-22 Jisu Kim , Jinhee Park , Changhyun Jeon , Jungwoo Choi , Keonwoo Kim , Minji Hong , Sehyun Kim

Recurrent Chunking Mechanisms for Long-Text Machine Reading Comprehension

In this paper, we study machine reading comprehension (MRC) on long texts, where a model takes as inputs a lengthy document and a question and then extracts a text span from the document as an answer. State-of-the-art models tend to use a…

Computation and Language · Computer Science 2020-05-20 Hongyu Gong , Yelong Shen , Dian Yu , Jianshu Chen , Dong Yu

Learning Robust, Transferable Sentence Representations for Text Classification

Despite deep recurrent neural networks (RNNs) demonstrate strong performance in text classification, training RNN models are often expensive and requires an extensive collection of annotated data which may not be available. To overcome the…

Computation and Language · Computer Science 2018-10-02 Wasi Uddin Ahmad , Xueying Bai , Nanyun Peng , Kai-Wei Chang