Related papers: Minimum Entropy Aproach to Word Segmentation Probl…

Exact Probability Distribution versus Entropy

The problem addressed concerns the determination of the average number of successive attempts of guessing a word of a certain length consisting of letters with given probabilities of occurrence. Both first- and second-order approximations…

Information Theory · Computer Science 2015-06-19 Kerstin Andersson

Segmenting DNA sequence into `words'

This paper presents a novel method to segment/decode DNA sequences based on n-grams statistical language model. Firstly, we find the length of most DNA 'words' is 12 to 15 bps by analyzing the genomes of 12 model species. Then we design an…

Genomics · Quantitative Biology 2015-03-13 Wang Liang

Language Segmentation

Language segmentation consists in finding the boundaries where one language ends and another language begins in a text written in more than one language. This is important for all natural language processing tasks. The problem can be solved…

Computation and Language · Computer Science 2015-10-07 David Alfter

On the Difficulty of Segmenting Words with Attention

Word segmentation, the problem of finding word boundaries in speech, is of interest for a range of tasks. Previous papers have suggested that for sequence-to-sequence models trained on tasks such as speech translation or speech recognition,…

Computation and Language · Computer Science 2021-09-22 Ramon Sanabria , Hao Tang , Sharon Goldwater

Approaches to the classification of complex systems: Words, texts, and more

The Chapter starts with introductory information about quantitative linguistics notions, like rank--frequency dependence, Zipf's law, frequency spectra, etc. Similarities in distributions of words in texts with level occupation in quantum…

Data Analysis, Statistics and Probability · Physics 2024-01-04 Andrij Rovenchak

Text segmentation with character-level text embeddings

Learning word representations has recently seen much success in computational linguistics. However, assuming sequences of word tokens as input to linguistic analysis is often unjustified. For many languages word segmentation is a…

Computation and Language · Computer Science 2013-09-19 Grzegorz Chrupała

Neural Sequence Segmentation as Determining the Leftmost Segments

Prior methods to text segmentation are mostly at token level. Despite the adequacy, this nature limits their full potential to capture the long-term dependencies among segments. In this work, we propose a novel framework that incrementally…

Computation and Language · Computer Science 2021-04-16 Yangming Li , Lemao Liu , Kaisheng Yao

Towards Lossless Encoding of Sentences

A lot of work has been done in the field of image compression via machine learning, but not much attention has been given to the compression of natural language. Compressing text into lossless representations while making features easily…

Computation and Language · Computer Science 2019-08-05 Gabriele Prato , Mathieu Duchesneau , Sarath Chandar , Alain Tapp

Truncation Sampling as Language Model Desmoothing

Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms--like top-$p$ or top-$k$ -- address this by setting some words' probabilities to zero at each step. This work provides framing for the…

Computation and Language · Computer Science 2022-10-28 John Hewitt , Christopher D. Manning , Percy Liang

Unsupervised Word Segmentation with Bi-directional Neural Language Model

We present an unsupervised word segmentation model, in which the learning objective is to maximize the generation probability of a sentence given its all possible segmentation. Such generation probability can be factorized into the…

Computation and Language · Computer Science 2021-03-03 Lihao Wang , Zongyi Li , Xiaoqing Zheng

Semantic Chunking and the Entropy of Natural Language

The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80…

Computation and Language · Computer Science 2026-02-19 Weishun Zhong , Doron Sivan , Tankut Can , Mikhail Katkov , Misha Tsodyks

Effective Subword Segmentation for Text Comprehension

Representation learning is the foundation of machine reading comprehension and inference. In state-of-the-art models, character-level representations have been broadly adopted to alleviate the problem of effectively representing rare or…

Computation and Language · Computer Science 2019-06-12 Zhuosheng Zhang , Hai Zhao , Kangwei Ling , Jiangtong Li , Zuchao Li , Shexia He , Guohong Fu

Alignment Entropy Regularization

Existing training criteria in automatic speech recognition(ASR) permit the model to freely explore more than one time alignments between the feature and label sequences. In this paper, we use entropy to measure a model's uncertainty, i.e.…

Computation and Language · Computer Science 2022-12-26 Ehsan Variani , Ke Wu , David Rybach , Cyril Allauzen , Michael Riley

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and…

Computation and Language · Computer Science 2018-12-04 Yerai Doval , Carlos Gómez-Rodríguez

Maximum Entropy Regularization and Chinese Text Recognition

Chinese text recognition is more challenging than Latin text due to the large amount of fine-grained Chinese characters and the great imbalance over classes, which causes a serious overfitting problem. We propose to apply Maximum Entropy…

Computer Vision and Pattern Recognition · Computer Science 2020-07-10 Changxu Cheng , Wuheng Xu , Xiang Bai , Bin Feng , Wenyu Liu

A Maximum-Entropy Partial Parser for Unrestricted Text

This paper describes a partial parser that assigns syntactic structures to sequences of part-of-speech tags. The program uses the maximum entropy parameter estimation method, which allows a flexible combination of different knowledge…

cmp-lg · Computer Science 2007-05-23 Wojciech Skut , Thorsten Brants

Entropy-based closure for probabilistic learning on manifolds

In a recent paper, the authors proposed a general methodology for probabilistic learning on manifolds. The method was used to generate numerical samples that are statistically consistent with an existing dataset construed as a realization…

Probability · Mathematics 2018-03-30 C. Soizea , R. Ghanem , C. Safta , X. Huan , Z. P. Vane , J. Oefelein , G. Lacaz , H. N. Najm , Q. Tang , X. Chen

MEP-Net: Generating Solutions to Scientific Problems with Limited Knowledge by Maximum Entropy Principle

Maximum entropy principle (MEP) offers an effective and unbiased approach to inferring unknown probability distributions when faced with incomplete information, while neural networks provide the flexibility to learn complex distributions…

Machine Learning · Statistics 2024-12-04 Wuyue Yang , Liangrong Peng , Guojie Li , Liu Hong

Minimum Description Length Principle for Maximum Entropy Model Selection

Model selection is central to statistics, and many learning problems can be formulated as model selection problems. In this paper, we treat the problem of selecting a maximum entropy model given various feature subsets and their moments, as…

Information Theory · Computer Science 2013-11-28 Gaurav Pandey , Ambedkar Dukkipati

Universal Word Segmentation: Implementation and Interpretation

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different…

Computation and Language · Computer Science 2018-07-10 Yan Shao , Christian Hardmeier , Joakim Nivre