Related papers: Grouping Words Using Statistical Context
Word groupings useful for language processing tasks are increasingly available, as thesauri appear on-line, and as distributional word clustering techniques improve. However, for many tasks, one is interested in relationships among word…
Considering that words with different characteristic in the text have different importance for classification, grouping them together separately can strengthen the semantic expression of each part. Thus we propose a new text representation…
Clustering a lexicon of words is a well-studied problem in natural language processing (NLP). Word clusters are used to deal with sparse data in statistical language processing, as well as features for solving various NLP tasks (text…
We describe and experimentally evaluate a method for automatically clustering words according to their distribution in particular syntactic contexts. Deterministic annealing is used to find lowest distortion sets of clusters. As the…
Distributional text clustering delivers semantically informative representations and captures the relevance between each word and semantic clustering centroids. We extend the neural text clustering approach to text classification tasks by…
Words unknown to the lexicon present a substantial problem to part-of-speech tagging. In this paper we present a technique for fully unsupervised statistical acquisition of rules which guess possible parts-of-speech for unknown words. Three…
We compare the performance of different clustering algorithms applied to the task of unsupervised text categorization. We consider agglomerative clustering algorithms, principal direction divisive partitioning and (for the first time)…
To make sense of massive data, we often fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often…
Any approach aimed at pasteurizing and quantifying a particular phenomenon must include the use of robust statistical methodologies for data analysis. With this in mind, the purpose of this study is to present statistical approaches that…
The advent of online social networks has led to the development of an abundant literature on the study of online social groups and their relationship to individuals' personalities as revealed by their textual productions. Social structures…
The deployment of language models brings challenges in generating reliable information, especially when these models are fine-tuned using human preferences. To extract encoded knowledge without (potentially) biased human labels,…
Lexical ambiguity makes it difficult to compute various useful statistics of a corpus. A given word form might represent any of several morphological feature bundles. One can, however, use unsupervised learning (as in EM) to fit a model…
In this paper we introduce a method to detect words or phrases in a given sequence of alphabets without knowing the lexicon. Our linear time unsupervised algorithm relies entirely on statistical relationships among alphabets in the input…
Several methods have been explored for automating parts of Systematic Mapping (SM) and Systematic Review (SR) methodologies. Challenges typically evolve around the gaps in semantic understanding of text, as well as lack of domain and…
This paper presents a new Bayesian non-parametric model by extending the usage of Hierarchical Dirichlet Allocation to extract tree structured word clusters from text data. The inference algorithm of the model collects words in a cluster if…
We introduce two different approaches for clustering semantically similar words. We accommodate ambiguity by allowing a word to belong to several clusters. Both methods use a graph-theoretic representation of words and their paradigmatic…
Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The…
When looking at the structure of natural language, "phrases" and "words" are central notions. We consider the problem of identifying such "meaningful subparts" of language of any length and underlying composition principles in a completely…
We develop and test a novel unsupervised algorithm for word sense induction and disambiguation which uses topological data analysis. Typical approaches to the problem involve clustering, based on simple low level features of distance in…
Despite the predominance of contextualized embeddings in NLP, approaches to detect semantic change relying on these embeddings and clustering methods underperform simpler counterparts based on static word embeddings. This stems from the…