Related papers: Modeling Text Complexity using a Multi-Scale Probi…

Adaptive Text Recognition through Visual Matching

In this work, our objective is to address the problems of generalization and flexibility for text recognition in documents. We introduce a new model that exploits the repetitive nature of characters in languages, and decouples the visual…

Computer Vision and Pattern Recognition · Computer Science 2020-09-15 Chuhan Zhang , Ankush Gupta , Andrew Zisserman

A Corpus-Based Investigation of Definite Description Use

We present the results of a study of definite descriptions use in written texts aimed at assessing the feasibility of annotating corpora with information about definite description interpretation. We ran two experiments, in which subjects…

cmp-lg · Computer Science 2007-05-23 Massimo Poesio , Renata Vieira

Graph Topic Modeling for Documents with Spatial or Covariate Dependencies

We address the challenge of incorporating document-level metadata into topic modeling to improve topic mixture estimation. To overcome the computational complexity and lack of theoretical guarantees in existing Bayesian methods, we extend…

Machine Learning · Computer Science 2025-03-18 Yeo Jin Jung , Claire Donnat

Neural Text Classification by Jointly Learning to Cluster and Align

Distributional text clustering delivers semantically informative representations and captures the relevance between each word and semantic clustering centroids. We extend the neural text clustering approach to text classification tasks by…

Computation and Language · Computer Science 2020-11-25 Yekun Chai , Haidong Zhang , Shuo Jin

How Many Topics? Stability Analysis for Topic Models

Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling…

Machine Learning · Computer Science 2014-06-20 Derek Greene , Derek O'Callaghan , Pádraig Cunningham

Large-Scale Evaluation of Topic Models and Dimensionality Reduction Methods for 2D Text Spatialization

Topic models are a class of unsupervised learning algorithms for detecting the semantic structure within a text corpus. Together with a subsequent dimensionality reduction algorithm, topic models can be used for deriving spatializations for…

Computation and Language · Computer Science 2023-10-26 Daniel Atzberger , Tim Cech , Willy Scheibel , Matthias Trapp , Rico Richter , Jürgen Döllner , Tobias Schreck

Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization

We consider the problem of modeling the content structure of texts within a specific domain, in terms of the topics the texts address and the order in which these topics appear. We first present an effective knowledge-lean method for…

Computation and Language · Computer Science 2007-05-23 Regina Barzilay , Lillian Lee

Extending Multi-Text Sentence Fusion Resources via Pyramid Annotations

NLP models that compare or consolidate information across multiple documents often struggle when challenged with recognizing substantial information redundancies across the texts. For example, in multi-document summarization it is crucial…

Computation and Language · Computer Science 2021-10-12 Daniela Brook Weiss , Paul Roit , Ori Ernst , Ido Dagan

Text characterization based on recurrence networks

Several complex systems are characterized by presenting intricate characteristics taking place at several scales of time and space. These multiscale characterizations are used in various applications, including better understanding…

Computation and Language · Computer Science 2023-05-12 Bárbara C. e Souza , Filipi N. Silva , Henrique F. de Arruda , Giovana D. da Silva , Luciano da F. Costa , Diego R. Amancio

Controlling Pre-trained Language Models for Grade-Specific Text Simplification

Text simplification (TS) systems rewrite text to make it more readable while preserving its content. However, what makes a text easy to read depends on the intended readers. Recent work has shown that pre-trained language models can…

Computation and Language · Computer Science 2023-12-01 Sweta Agrawal , Marine Carpuat

Measuring Societal Biases from Text Corpora with Smoothed First-Order Co-occurrence

Text corpora are widely used resources for measuring societal biases and stereotypes. The common approach to measuring such biases using a corpus is by calculating the similarities between the embedding vector of a word (like nurse) and the…

Computation and Language · Computer Science 2021-04-28 Navid Rekabsaz , Robert West , James Henderson , Allan Hanbury

On Smoothing and Inference for Topic Models

Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for modeling high-dimensional sparse count data. Various learning algorithms have been developed in recent years, including collapsed Gibbs sampling,…

Machine Learning · Computer Science 2012-05-14 Arthur Asuncion , Max Welling , Padhraic Smyth , Yee Whye Teh

Computer-Assisted Text Analysis for Social Science: Topic Models and Beyond

Topic models are a family of statistical-based algorithms to summarize, explore and index large collections of text documents. After a decade of research led by computer scientists, topic models have spread to social science as a new…

Computation and Language · Computer Science 2018-04-04 Ryan Wesslen

Text vectorization via transformer-based language models and n-gram perplexities

As the probability (and thus perplexity) of a text is calculated based on the product of the probabilities of individual tokens, it may happen that one unlikely token significantly reduces the probability (i.e., increase the perplexity) of…

Computation and Language · Computer Science 2023-07-19 Mihailo Škorić

A discomfort-informed adaptive Gibbs sampler for finite mixture models

Finite mixture models are frequently used to uncover latent structures in high-dimensional datasets (e.g.\ identifying clusters of patients in electronic health records). The inference of such structures can be performed in a Bayesian…

Methodology · Statistics 2025-12-02 Davide Fabbrico , Andi Q. Wang , Sebastiano Grazzi , Alice Corbella , Gareth O. Roberts , Sylvia Richardson , Filippo Pagani , Paul D. W. Kirk

A Probabilistic Model of Compound Nouns

Compound nouns such as example noun compound are becoming more common in natural language and pose a number of difficult problems for NLP systems, notably increasing the complexity of parsing. In this paper we develop a probabilistic model…

cmp-lg · Computer Science 2008-02-03 Mark Lauer , Mark Dras

Dealing with Difficult Minority Labels in Imbalanced Mutilabel Data Sets

Multilabel classification is an emergent data mining task with a broad range of real world applications. Learning from imbalanced multilabel data is being deeply studied latterly, and several resampling methods have been proposed in the…

Machine Learning · Computer Science 2018-02-15 Francisco Charte , Antonio J. Rivera , María J. del Jesus , Francisco Herrera

Multi-Level Attention and Contrastive Learning for Enhanced Text Classification with an Optimized Transformer

This paper studies a text classification algorithm based on an improved Transformer to improve the performance and efficiency of the model in text classification tasks. Aiming at the shortcomings of the traditional Transformer model in…

Computation and Language · Computer Science 2025-01-24 Jia Gao , Guiran Liu , Binrong Zhu , Shicheng Zhou , Hongye Zheng , Xiaoxuan Liao

Provable Algorithms for Inference in Topic Models

Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference…

Machine Learning · Computer Science 2016-05-30 Sanjeev Arora , Rong Ge , Frederic Koehler , Tengyu Ma , Ankur Moitra

TexComp - A Text Complexity Analyzer for Student Texts

This paper describes a method for providing feedback about the degree of complexity that is present in particular texts. Both the method and the software tool called TexComp are designed for use during the assessment of student compositions…

Computers and Society · Computer Science 2012-06-29 T. Kakkonen