Related papers: Modeling Text Complexity using a Multi-Scale Probi…

An empirical study on large scale text classification with skip-gram embeddings

We investigate the integration of word embeddings as classification features in the setting of large scale text classification. Such representations have been used in a plethora of tasks, however their application in classification…

Computation and Language · Computer Science 2016-06-22 Georgios Balikas , Massih-Reza Amini

Locating the Leading Edge of Cultural Change

Measures of textual similarity and divergence are increasingly used to study cultural change. But which measures align, in practice, with social evidence about change? We apply three different representations of text (topic models, document…

Computation and Language · Computer Science 2024-11-25 Sarah Griebel , Becca Cohen , Lucian Li , Jaihyun Park , Jiayu Liu , Jana Perkins , Ted Underwood

An Iterative Contextualization Algorithm with Second-Order Attention

Combining the representations of the words that make up a sentence into a cohesive whole is difficult, since it needs to account for the order of words, and to establish how the words present relate to each other. The solution we propose…

Computation and Language · Computer Science 2021-03-04 Diego Maupomé , Marie-Jean Meurs

Understanding the Properties of Generated Corpora

Models for text generation have become focal for many research tasks and especially for the generation of sentence corpora. However, understanding the properties of an automatically generated text corpus remains challenging. We propose a…

Computation and Language · Computer Science 2022-10-28 Naama Zwerdling , Segev Shlomov , Esther Goldbraich , George Kour , Boaz Carmeli , Naama Tepper , Inbal Ronen , Vitaly Zabershinsky , Ateret Anaby-Tavor

A Conceptual Model for Measuring the Complexity of Spreadsheets

Spreadsheets are widely used in industry, even for critical business processes. This implies the need for proper risk assessment in spreadsheets to evaluate the reliability and validity of the spreadsheet's outcome. As related research has…

Software Engineering · Computer Science 2017-04-06 Thomas Reschenhofer , Bernhard Waltl , Klym Shumaiev , Florian Matthes

Multilevel linear models, Gibbs samplers and multigrid decompositions

We study the convergence properties of the Gibbs Sampler in the context of posterior distributions arising from Bayesian analysis of conditionally Gaussian hierarchical models. We develop a multigrid approach to derive analytic expressions…

Computation · Statistics 2019-06-27 Giacomo Zanella , Gareth Roberts

Generalised Spherical Text Embedding

This paper aims to provide an unsupervised modelling approach that allows for a more flexible representation of text embeddings. It jointly encodes the words and the paragraphs as individual matrices of arbitrary column dimension with unit…

Computation and Language · Computer Science 2022-12-01 Souvik Banerjee , Bamdev Mishra , Pratik Jawanpuria , Manish Shrivastava

MotifClass: Weakly Supervised Text Classification with Higher-order Metadata Information

We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only and without any annotated training document provided. Most existing…

Computation and Language · Computer Science 2023-10-24 Yu Zhang , Shweta Garg , Yu Meng , Xiusi Chen , Jiawei Han

Identifying and Reducing Gender Bias in Word-Level Language Models

Many text corpora exhibit socially problematic biases, which can be propagated or amplified in the models trained on such data. For example, doctor cooccurs more frequently with male pronouns than female pronouns. In this study we (i)…

Computation and Language · Computer Science 2019-04-08 Shikha Bordia , Samuel R. Bowman

Density-Based Dynamic Curriculum Learning for Intent Detection

Pre-trained language models have achieved noticeable performance on the intent detection task. However, due to assigning an identical weight to each sample, they suffer from the overfitting of simple samples and the failure to learn complex…

Computation and Language · Computer Science 2021-08-25 Yantao Gong , Cao Liu , Jiazhen Yuan , Fan Yang , Xunliang Cai , Guanglu Wan , Jiansong Chen , Ruiyao Niu , Houfeng Wang

CLASS: Enhancing Cross-Modal Text-Molecule Retrieval Performance and Training Efficiency

Cross-modal text-molecule retrieval task bridges molecule structures and natural language descriptions. Existing methods predominantly focus on aligning text modality and molecule modality, yet they overlook adaptively adjusting the…

Computation and Language · Computer Science 2025-02-18 Hongyan Wu , Peijian Zeng , Weixiong Zheng , Lianxi Wang , Nankai Lin , Shengyi Jiang , Aimin Yang

SelfMix: Robust Learning Against Textual Label Noise with Self-Mixup Training

The conventional success of textual classification relies on annotated data, and the new paradigm of pre-trained language models (PLMs) still requires a few labeled data for downstream tasks. However, in real-world applications, label noise…

Computation and Language · Computer Science 2022-10-14 Dan Qiao , Chenchen Dai , Yuyang Ding , Juntao Li , Qiang Chen , Wenliang Chen , Min Zhang

Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections

Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to…

Computation and Language · Computer Science 2020-03-20 Yi-An Lai , Xuan Zhu , Yi Zhang , Mona Diab

Interpretable Recognition of Cognitive Distortions in Natural Language Texts

We propose a new approach to multi-factor classification of natural language texts based on weighted structured patterns such as N-grams, taking into account the heterarchical relationships between them, applied to solve such a socially…

Computation and Language · Computer Science 2025-11-11 Anton Kolonin , Anna Arinicheva

Scalable Bayesian shrinkage and uncertainty quantification in high-dimensional regression

Bayesian shrinkage methods have generated a lot of recent interest as tools for high-dimensional regression and model selection. These methods naturally facilitate tractable uncertainty quantification and incorporation of prior information.…

Computation · Statistics 2017-04-17 Bala Rajaratnam , Doug Sparks , Kshitij Khare , Liyuan Zhang

Automated Sized-Type Inference and Complexity Analysis

This paper introduces a new methodology for the complexity analysis of higher-order functional programs, which is based on three components: a powerful type system for size analysis and a sound type inference procedure for it, a ticking…

Logic in Computer Science · Computer Science 2017-04-20 Martin Avanzini , Ugo Dal Lago

Content Modeling Using Latent Permutations

We present a novel Bayesian topic model for learning discourse-level document structure. Our model leverages insights from discourse theory to constrain latent topic assignments in a way that reflects the underlying organization of document…

Information Retrieval · Computer Science 2014-01-16 Harr Chen , S. R. K. Branavan , Regina Barzilay , David R. Karger

Textual Spatial Cosine Similarity

When dealing with document similarity many methods exist today, like cosine similarity. More complex methods are also available based on the semantic analysis of textual information, which are computationally expensive and rarely used in…

Information Retrieval · Computer Science 2015-05-18 Giancarlo Crocetti

An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages

The availability of parallel sentence simplification (SS) is scarce for neural SS modelings. We propose an unsupervised method to build SS corpora from large-scale bilingual translation corpora, alleviating the need for SS supervised…

Computation and Language · Computer Science 2021-09-02 Xinyu Lu , Jipeng Qiang , Yun Li , Yunhao Yuan , Yi Zhu

Seeing The Whole Patient: Using Multi-Label Medical Text Classification Techniques to Enhance Predictions of Medical Codes

Machine learning-based multi-label medical text classifications can be used to enhance the understanding of the human body and aid the need for patient care. We present a broad study on clinical natural language processing techniques to…

Information Retrieval · Computer Science 2020-04-02 Vithya Yogarajan , Jacob Montiel , Tony Smith , Bernhard Pfahringer