English
Related papers

Related papers: Modeling Text Complexity using a Multi-Scale Probi…

200 papers

The growing need to analyze large collections of documents has led to great developments in topic modeling. Since documents are frequently associated with other related variables, such as labels or ratings, much interest has been placed on…

Machine Learning · Statistics 2018-08-20 Filipe Rodrigues , Mariana Lourenço , Bernardete Ribeiro , Francisco Pereira

In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially…

Information Retrieval · Computer Science 2017-06-27 Lavanya Narayana Raju , Mahamad Suhil , D S Guru , Harsha S Gowda

We address the problem of multi-class classification in the case where the number of classes is very large. We propose a double sampling strategy on top of a multi-class to binary reduction strategy, which transforms the original…

Machine Learning · Statistics 2021-09-15 Bikash Joshi , Massih-Reza Amini , Ioannis Partalas , Franck Iutzeler , Yury Maximov

This study proposes a text classification algorithm based on large language models, aiming to address the limitations of traditional methods in capturing long-range dependencies, understanding contextual semantics, and handling class…

Computation and Language · Computer Science 2025-12-11 Ning Lyu , Yuxi Wang , Feng Chen , Qingyuan Zhang

With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models…

Computation and Language · Computer Science 2021-01-11 Carlos Badenes-Olmedo , Jose-Luis Redondo García , Oscar Corcho

Measuring a document's complexity level is an open challenge, particularly when one is working on a diverse corpus of documents rather than comparing several documents on a similar topic or working on a language other than English. In this…

Computation and Language · Computer Science 2022-09-01 Vincent Primpied , David Beauchemin , Richard Khoury

Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We…

Machine Learning · Statistics 2017-08-16 Måns Magnusson , Leif Jonsson , Mattias Villani , David Broman

Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity--how hard a text is to read--remains less explored. We reduce surface-level complexity (shorter sentences, simpler words,…

Computation and Language · Computer Science 2025-10-07 Dan John Velasco , Matthew Theodore Roque

The logistic normal distribution has recently been adapted via the transformation of multivariate Gaus- sian variables to model the topical distribution of documents in the presence of correlations among topics. In this paper, we propose a…

Machine Learning · Statistics 2014-10-06 Xingchen Yu , Ernest Fokoue

In this paper we propose and study a new complexity model for approximation algorithms. The main motivation are practical problems over large data sets that need to be solved many times for different scenarios, e.g., many multicast trees…

Data Structures and Algorithms · Computer Science 2010-06-18 Marek Cygan , Lukasz Kowalik , Marcin Mucha , Marcin Pilipczuk , Piotr Sankowski

In the prose style transfer task a system, provided with text input and a target prose style, produces output which preserves the meaning of the input text but alters the style. These systems require parallel data for evaluation of results…

Computation and Language · Computer Science 2021-09-01 Keith Carlson , Allen Riddell , Daniel Rockmore

Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps %(e.g.,…

Databases · Computer Science 2016-12-20 Ciprian-Octavian Truică , Jérôme Darmont , Julien Velcin

Annotating large collections of textual data can be time consuming and expensive. That is why the ability to train models with limited annotation budgets is of great importance. In this context, it has been shown that under tight annotation…

Computation and Language · Computer Science 2022-10-13 César González-Gutiérrez , Audi Primadhanty , Francesco Cazzaro , Ariadna Quattoni

Automatic Text Summarization strategies have been successfully employed to digest text collections and extract its essential content. Usually, summaries are generated using textual corpora that belongs to the same domain area where the…

Computation and Language · Computer Science 2018-07-03 Vinicius Woloszyn , Guilherme Medeiros Machado , Leandro Krug Wives , José Palazzo Moreira de Oliveira

Large-scale language models such as GPT-3 are excellent few-shot learners, allowing them to be controlled via natural text prompts. Recent studies report that prompt-based direct classification eliminates the need for fine-tuning but lacks…

Computation and Language · Computer Science 2021-11-19 Kang Min Yoo , Dongju Park , Jaewook Kang , Sang-Woo Lee , Woomyeong Park

Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MI-DELIGHT for short…

Computation and Language · Computer Science 2025-01-17 Yonghao Liu , Mengyu Li , Wei Pang , Fausto Giunchiglia , Lan Huang , Xiaoyue Feng , Renchu Guan

In this paper, we focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer and aims to preserve text styles while altering the content. In detail, the input is a set of structured…

Computation and Language · Computer Science 2020-02-25 Xiaocheng Feng , Yawei Sun , Bing Qin , Heng Gong , Yibo Sun , Wei Bi , Xiaojiang Liu , Ting Liu

Textual representations based on pre-trained language models are key, especially in few-shot learning scenarios. What makes a representation good for text classification? Is it due to the geometric properties of the space or because it is…

Computation and Language · Computer Science 2023-06-01 Cesar Gonzalez-Gutierrez , Audi Primadhanty , Francesco Cazzaro , Ariadna Quattoni

Generalized Labeled Multi-Bernoulli (GLMB) densities arise in a host of multi-object system applications analogous to Gaussians in single-object filtering. However, computing the GLMB filtering density requires solving NP-hard problems. To…

Machine Learning · Statistics 2023-12-29 Changbeom Shim , Ba-Tuong Vo , Ba-Ngu Vo , Jonah Ong , Diluka Moratuwage

A new fast algorithm for clustering and classification of large collections of text documents is introduced. The new algorithm employs the bipartite graph that realizes the word-document matrix of the collection. Namely, the modularity of…

Information Retrieval · Computer Science 2011-05-31 Grigory Pivovarov , Sergei Trunov