Related papers: Modeling Text Complexity using a Multi-Scale Probi…
The growing need to analyze large collections of documents has led to great developments in topic modeling. Since documents are frequently associated with other related variables, such as labels or ratings, much interest has been placed on…
In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially…
We address the problem of multi-class classification in the case where the number of classes is very large. We propose a double sampling strategy on top of a multi-class to binary reduction strategy, which transforms the original…
This study proposes a text classification algorithm based on large language models, aiming to address the limitations of traditional methods in capturing long-range dependencies, understanding contextual semantics, and handling class…
With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models…
Measuring a document's complexity level is an open challenge, particularly when one is working on a diverse corpus of documents rather than comparing several documents on a similar topic or working on a language other than English. In this…
Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We…
Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity--how hard a text is to read--remains less explored. We reduce surface-level complexity (shorter sentences, simpler words,…
The logistic normal distribution has recently been adapted via the transformation of multivariate Gaus- sian variables to model the topical distribution of documents in the presence of correlations among topics. In this paper, we propose a…
In this paper we propose and study a new complexity model for approximation algorithms. The main motivation are practical problems over large data sets that need to be solved many times for different scenarios, e.g., many multicast trees…
In the prose style transfer task a system, provided with text input and a target prose style, produces output which preserves the meaning of the input text but alters the style. These systems require parallel data for evaluation of results…
Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps %(e.g.,…
Annotating large collections of textual data can be time consuming and expensive. That is why the ability to train models with limited annotation budgets is of great importance. In this context, it has been shown that under tight annotation…
Automatic Text Summarization strategies have been successfully employed to digest text collections and extract its essential content. Usually, summaries are generated using textual corpora that belongs to the same domain area where the…
Large-scale language models such as GPT-3 are excellent few-shot learners, allowing them to be controlled via natural text prompts. Recent studies report that prompt-based direct classification eliminates the need for fine-tuning but lacks…
Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MI-DELIGHT for short…
In this paper, we focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer and aims to preserve text styles while altering the content. In detail, the input is a set of structured…
Textual representations based on pre-trained language models are key, especially in few-shot learning scenarios. What makes a representation good for text classification? Is it due to the geometric properties of the space or because it is…
Generalized Labeled Multi-Bernoulli (GLMB) densities arise in a host of multi-object system applications analogous to Gaussians in single-object filtering. However, computing the GLMB filtering density requires solving NP-hard problems. To…
A new fast algorithm for clustering and classification of large collections of text documents is introduced. The new algorithm employs the bipartite graph that realizes the word-document matrix of the collection. Namely, the modularity of…