Related papers: Modeling Text Complexity using a Multi-Scale Probi…

Text Modeling using Unsupervised Topic Models and Concept Hierarchies

Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. While topic models can potentially discover a broad range of themes in a data set,…

Artificial Intelligence · Computer Science 2008-08-08 Chaitanya Chemudugunta , Padhraic Smyth , Mark Steyvers

Text Segmentation Using Exponential Models

This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To…

cmp-lg · Computer Science 2008-02-03 Doug Beeferman , Adam Berger , John Lafferty

Analyses of Multi-collection Corpora via Compound Topic Modeling

As electronically stored data grow in daily life, obtaining novel and relevant information becomes challenging in text mining. Thus people have sought statistical methods based on term frequency, matrix algebra, or topic modeling for text…

Information Retrieval · Computer Science 2019-07-04 Clint P. George , Wei Xia , George Michailidis

Text Classification Based on Knowledge Graphs and Improved Attention Mechanism

To resolve the semantic ambiguity in texts, we propose a model, which innovatively combines a knowledge graph with an improved attention mechanism. An existing knowledge base is utilized to enrich the text with relevant contextual concepts.…

Computation and Language · Computer Science 2024-01-30 Siyu Li , Lu Chen , Chenwei Song , Xinyi Liu

A Two-Sample Test of Text Generation Similarity

The surge in digitized text data requires reliable inferential methods on observed textual patterns. This article proposes a novel two-sample text test for comparing similarity between two groups of documents. The hypothesis is whether the…

Machine Learning · Statistics 2025-05-09 Jingbin Xu , Chen Qian , Meimei Liu , Feng Guo

FastText.zip: Compressing text classification models

We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method…

Computation and Language · Computer Science 2016-12-19 Armand Joulin , Edouard Grave , Piotr Bojanowski , Matthijs Douze , Hérve Jégou , Tomas Mikolov

Text as Environment: A Deep Reinforcement Learning Text Readability Assessment Model

Evaluating the readability of a text can significantly facilitate the precise expression of information in written form. The formulation of text readability assessment involves the identification of meaningful properties of the text…

Computation and Language · Computer Science 2023-10-24 Hamid Mohammadi , Seyed Hossein Khasteh , Tahereh Firoozi , Taha Samavati

Unveiling the semantic structure of text documents using paragraph-aware Topic Models

Classic Topic Models are built under the Bag Of Words assumption, in which word position is ignored for simplicity. Besides, symmetric priors are typically used in most applications. In order to easily learn topics with different properties…

Computation and Language · Computer Science 2018-06-27 Simón Roca-Sotelo , Jerónimo Arenas-García

Hierarchical Ranking Neural Network for Long Document Readability Assessment

Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text…

Computation and Language · Computer Science 2025-11-27 Yurui Zheng , Yijun Chen , Shaohong Zhang

Understanding Text Classification Data and Models Using Aggregated Input Salience

Realizing when a model is right for a wrong reason is not trivial and requires a significant effort by model developers. In some cases an input salience method, which highlights the most important parts of the input, may reveal problematic…

Computation and Language · Computer Science 2023-01-12 Sebastian Ebert , Alice Shoshana Jakobovits , Katja Filippova

Text Classification with Few Examples using Controlled Generalization

Training data for text classification is often limited in practice, especially for applications with many output classes or involving many related classification problems. This means classifiers must generalize from limited evidence, but…

Computation and Language · Computer Science 2020-05-19 Abhijit Mahabal , Jason Baldridge , Burcu Karagol Ayan , Vincent Perot , Dan Roth

Joint Embedding of Words and Labels for Text Classification

Word embeddings are effective intermediate representations for capturing semantic regularities between words, when learning the representations of text sequences. We propose to view text classification as a label-word joint embedding…

Computation and Language · Computer Science 2018-05-14 Guoyin Wang , Chunyuan Li , Wenlin Wang , Yizhe Zhang , Dinghan Shen , Xinyuan Zhang , Ricardo Henao , Lawrence Carin

Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts

A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts' rich stories into a single number is often…

Computation and Language · Computer Science 2021-02-05 Ryan J. Gallagher , Morgan R. Frank , Lewis Mitchell , Aaron J. Schwartz , Andrew J. Reagan , Christopher M. Danforth , Peter Sheridan Dodds

Unsupervised Matching of Data and Text

Entity resolution is a widely studied problem with several proposals to match records across relations. Matching textual content is a widespread task in many applications, such as question answering and search. While recent methods achieve…

Databases · Computer Science 2021-12-17 Naser Ahmadi , Hansjorg Sand , Paolo Papotti

A Multiplicative Model for Learning Distributed Text-Based Attribute Representations

In this paper we propose a general framework for learning distributed representations of attributes: characteristics of text whose representations can be jointly learned with word embeddings. Attributes can correspond to document indicators…

Machine Learning · Computer Science 2014-06-12 Ryan Kiros , Richard S. Zemel , Ruslan Salakhutdinov

A modified model for topic detection from a corpus and a new metric evaluating the understandability of topics

This paper presents a modified neural model for topic detection from a corpus and proposes a new metric to evaluate the detected topics. The new model builds upon the embedded topic model incorporating some modifications such as document…

Computation and Language · Computer Science 2023-06-09 Tomoya Kitano , Yuto Miyatake , Daisuke Furihata

CompLex: A New Corpus for Lexical Complexity Prediction from Likert Scale Data

Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such as text simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few…

Computation and Language · Computer Science 2020-06-12 Matthew Shardlow , Michael Cooper , Marcos Zampieri

Modelling the semantics of text in complex document layouts using graph transformer networks

Representing structured text from complex documents typically calls for different machine learning techniques, such as language models for paragraphs and convolutional neural networks (CNNs) for table extraction, which prohibits drawing…

Computation and Language · Computer Science 2022-02-21 Thomas Roland Barillot , Jacob Saks , Polena Lilyanova , Edward Torgas , Yachen Hu , Yuanqing Liu , Varun Balupuri , Paul Gaskell

Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

Supervised topic models with a logistic likelihood have two issues that potentially limit their practical use: 1) response variables are usually over-weighted by document word counts; and 2) existing variational inference methods make…

Machine Learning · Computer Science 2013-10-10 Jun Zhu , Xun Zheng , Bo Zhang

Topic Modeling of Hierarchical Corpora

We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy. We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by…

Machine Learning · Statistics 2015-04-14 Do-kyum Kim , Geoffrey M. Voelker , Lawrence K. Saul