Related papers: Modeling Text Complexity using a Multi-Scale Probi…

Learning Supervised Topic Models for Classification and Regression from Crowds

The growing need to analyze large collections of documents has led to great developments in topic modeling. Since documents are frequently associated with other related variables, such as labels or ratings, much interest has been placed on…

Machine Learning · Statistics 2018-08-20 Filipe Rodrigues , Mariana Lourenço , Bernardete Ribeiro , Francisco Pereira

Cluster Based Symbolic Representation for Skewed Text Categorization

In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially…

Information Retrieval · Computer Science 2017-06-27 Lavanya Narayana Raju , Mahamad Suhil , D S Guru , Harsha S Gowda

Aggressive Sampling for Multi-class to Binary Reduction with Applications to Text Classification

We address the problem of multi-class classification in the case where the number of classes is very large. We propose a double sampling strategy on top of a multi-class to binary reduction strategy, which transforms the original…

Machine Learning · Statistics 2021-09-15 Bikash Joshi , Massih-Reza Amini , Ioannis Partalas , Franck Iutzeler , Yury Maximov

Advancing Text Classification with Large Language Models and Neural Attention Mechanisms

This study proposes a text classification algorithm based on large language models, aiming to address the limitations of traditional methods in capturing long-range dependencies, understanding contextual semantics, and handling class…

Computation and Language · Computer Science 2025-12-11 Ning Lyu , Yuxi Wang , Feng Chen , Qingyuan Zhang

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models…

Computation and Language · Computer Science 2021-01-11 Carlos Badenes-Olmedo , Jose-Luis Redondo García , Oscar Corcho

Quantifying French Document Complexity

Measuring a document's complexity level is an open challenge, particularly when one is working on a diverse corpus of documents rather than comparing several documents on a similar topic or working on a language other than English. In this…

Computation and Language · Computer Science 2022-09-01 Vincent Primpied , David Beauchemin , Richard Khoury

Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models

Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We…

Machine Learning · Statistics 2017-08-16 Måns Magnusson , Leif Jonsson , Mattias Villani , David Broman

Rethinking the Role of Text Complexity in Language Model Pretraining

Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity--how hard a text is to read--remains less explored. We reduce surface-level complexity (shorter sentences, simpler words,…

Computation and Language · Computer Science 2025-10-07 Dan John Velasco , Matthew Theodore Roque

Probit Normal Correlated Topic Models

The logistic normal distribution has recently been adapted via the transformation of multivariate Gaus- sian variables to model the topical distribution of documents in the presence of correlations among topics. In this paper, we propose a…

Machine Learning · Statistics 2014-10-06 Xingchen Yu , Ernest Fokoue

Fast Approximation in Subspaces by Doubling Metric Decomposition

In this paper we propose and study a new complexity model for approximation algorithms. The main motivation are practical problems over large data sets that need to be solved many times for different scenarios, e.g., many multicast trees…

Data Structures and Algorithms · Computer Science 2010-06-18 Marek Cygan , Lukasz Kowalik , Marcin Mucha , Marcin Pilipczuk , Piotr Sankowski

Evaluating prose style transfer with the Bible

In the prose style transfer task a system, provided with text input and a target prose style, produces output which preserves the meaning of the input text but alters the style. These systems require parallel data for evaluation of results…

Computation and Language · Computer Science 2021-09-01 Keith Carlson , Allen Riddell , Daniel Rockmore

A Scalable Document-based Architecture for Text Analysis

Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps %(e.g.,…

Databases · Computer Science 2016-12-20 Ciprian-Octavian Truică , Jérôme Darmont , Julien Velcin

Analyzing Text Representations under Tight Annotation Budgets: Measuring Structural Alignment

Annotating large collections of textual data can be time consuming and expensive. That is why the ability to train models with limited annotation budgets is of great importance. In this context, it has been shown that under tight annotation…

Computation and Language · Computer Science 2022-10-13 César González-Gutiérrez , Audi Primadhanty , Francesco Cazzaro , Ariadna Quattoni

Modeling, comprehending and summarizing textual content by graphs

Automatic Text Summarization strategies have been successfully employed to digest text collections and extract its essential content. Usually, summaries are generated using textual corpora that belongs to the same domain area where the…

Computation and Language · Computer Science 2018-07-03 Vinicius Woloszyn , Guilherme Medeiros Machado , Leandro Krug Wives , José Palazzo Moreira de Oliveira

GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation

Large-scale language models such as GPT-3 are excellent few-shot learners, allowing them to be controlled via natural text prompts. Recent studies report that prompt-based direct classification eliminates the need for fine-tuning but lacks…

Computation and Language · Computer Science 2021-11-19 Kang Min Yoo , Dongju Park , Jaewook Kang , Sang-Woo Lee , Woomyeong Park

Boosting Short Text Classification with Multi-Source Information Exploration and Dual-Level Contrastive Learning

Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MI-DELIGHT for short…

Computation and Language · Computer Science 2025-01-17 Yonghao Liu , Mengyu Li , Wei Pang , Fausto Giunchiglia , Lan Huang , Xiaoyue Feng , Renchu Guan

Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation

In this paper, we focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer and aims to preserve text styles while altering the content. In detail, the input is a set of structured…

Computation and Language · Computer Science 2020-02-25 Xiaocheng Feng , Yawei Sun , Bing Qin , Heng Gong , Yibo Sun , Wei Bi , Xiaojiang Liu , Ting Liu

Analyzing Text Representations by Measuring Task Alignment

Textual representations based on pre-trained language models are key, especially in few-shot learning scenarios. What makes a representation good for text classification? Is it due to the geometric properties of the space or because it is…

Computation and Language · Computer Science 2023-06-01 Cesar Gonzalez-Gutierrez , Audi Primadhanty , Francesco Cazzaro , Ariadna Quattoni

Linear Complexity Gibbs Sampling for Generalized Labeled Multi-Bernoulli Filtering

Generalized Labeled Multi-Bernoulli (GLMB) densities arise in a host of multi-object system applications analogous to Gaussians in single-object filtering. However, computing the GLMB filtering density requires solving NP-hard problems. To…

Machine Learning · Statistics 2023-12-29 Changbeom Shim , Ba-Tuong Vo , Ba-Ngu Vo , Jonah Ong , Diluka Moratuwage

Clustering and Classification in Text Collections Using Graph Modularity

A new fast algorithm for clustering and classification of large collections of text documents is introduced. The new algorithm employs the bipartite graph that realizes the word-document matrix of the collection. Namely, the modularity of…

Information Retrieval · Computer Science 2011-05-31 Grigory Pivovarov , Sergei Trunov