Related papers: Grouping Words Using Statistical Context

Disambiguating Noun Groupings with Respect to WordNet Senses

Word groupings useful for language processing tasks are increasingly available, as thesauri appear on-line, and as distributional word clustering techniques improve. However, for many tasks, one is interested in relationships among word…

cmp-lg · Computer Science 2008-02-03 Philip Resnik

Mimicking Human Process: Text Representation via Latent Semantic Clustering for Classification

Considering that words with different characteristic in the text have different importance for classification, grouping them together separately can strengthen the semantic expression of each part. Thus we propose a new text representation…

Computation and Language · Computer Science 2019-06-19 Xiaoye Tan , Rui Yan , Chongyang Tao , Mingrui Wu

Computing Word Classes Using Spectral Clustering

Clustering a lexicon of words is a well-studied problem in natural language processing (NLP). Word clusters are used to deal with sparse data in statistical language processing, as well as features for solving various NLP tasks (text…

Computation and Language · Computer Science 2018-08-17 Effi Levi , Saggy Herman , Ari Rappoport

Distributional Clustering of English Words

We describe and experimentally evaluate a method for automatically clustering words according to their distribution in particular syntactic contexts. Deterministic annealing is used to find lowest distortion sets of clusters. As the…

cmp-lg · Computer Science 2008-02-03 Fernando Pereira , Naftali Tishby , Lillian Lee

Neural Text Classification by Jointly Learning to Cluster and Align

Distributional text clustering delivers semantically informative representations and captures the relevance between each word and semantic clustering centroids. We extend the neural text clustering approach to text classification tasks by…

Computation and Language · Computer Science 2020-11-25 Yekun Chai , Haidong Zhang , Shuo Jin

Unsupervised Learning of Word-Category Guessing Rules

Words unknown to the lexicon present a substantial problem to part-of-speech tagging. In this paper we present a technique for fully unsupervised statistical acquisition of rules which guess possible parts-of-speech for unknown words. Three…

cmp-lg · Computer Science 2008-02-03 Andrei Mikheev

Resampling methods for document clustering

We compare the performance of different clustering algorithms applied to the task of unsupervised text categorization. We consider agglomerative clustering algorithms, principal direction divisive partitioning and (for the first time)…

Disordered Systems and Neural Networks · Physics 2007-05-23 D. Volk , M. G. Stepanov

Explaining Datasets in Words: Statistical Models with Natural Language Parameters

To make sense of massive data, we often fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often…

Artificial Intelligence · Computer Science 2025-01-14 Ruiqi Zhong , Heng Wang , Dan Klein , Jacob Steinhardt

On Language Clustering: A Non-parametric Statistical Approach

Any approach aimed at pasteurizing and quantifying a particular phenomenon must include the use of robust statistical methodologies for data analysis. With this in mind, the purpose of this study is to present statistical approaches that…

Computation and Language · Computer Science 2023-06-29 Anagh Chattopadhyay , Soumya Sankar Ghosh , Samir Karmakar

Testing network clustering algorithms with Natural Language Processing

The advent of online social networks has led to the development of an abundant literature on the study of online social groups and their relationship to individuals' personalities as revealed by their textual productions. Social structures…

Social and Information Networks · Computer Science 2024-06-26 Ixandra Achitouv , David Chavalarias , Bruno Gaume

Cluster-norm for Unsupervised Probing of Knowledge

The deployment of language models brings challenges in generating reliable information, especially when these models are fine-tuned using human preferences. To extract encoded knowledge without (potentially) biased human labels,…

Artificial Intelligence · Computer Science 2024-10-07 Walter Laurito , Sharan Maiya , Grégoire Dhimoïla , Owen , Yeung , Kaarel Hänni

Unsupervised Disambiguation of Syncretism in Inflected Lexicons

Lexical ambiguity makes it difficult to compute various useful statistics of a corpus. A given word form might represent any of several morphological feature bundles. One can, however, use unsupervised learning (as in EM) to fit a model…

Computation and Language · Computer Science 2020-02-26 Ryan Cotterell , Christo Kirov , Sabrina J. Mielke , Jason Eisner

Consensus Sequence Segmentation

In this paper we introduce a method to detect words or phrases in a given sequence of alphabets without knowing the lexicon. Our linear time unsupervised algorithm relies entirely on statistical relationships among alphabets in the input…

Computation and Language · Computer Science 2013-12-31 Tamal Chowdhury , Rabindra Rakshit , Arko Banerjee

Unsupervised Key-phrase Extraction and Clustering for Classification Scheme in Scientific Publications

Several methods have been explored for automating parts of Systematic Mapping (SM) and Systematic Review (SR) methodologies. Challenges typically evolve around the gaps in semantic understanding of text, as well as lack of domain and…

Computation and Language · Computer Science 2021-02-10 Xiajing Li , Marios Daoutis

Hierarchical Latent Word Clustering

This paper presents a new Bayesian non-parametric model by extending the usage of Hierarchical Dirichlet Allocation to extract tree structured word clusters from text data. The inference algorithm of the model collects words in a cluster if…

Computation and Language · Computer Science 2016-01-22 Halid Ziya Yerebakan , Fitsum Reda , Yiqiang Zhan , Yoshihisa Shinagawa

Using Curvature and Markov Clustering in Graphs for Lexical Acquisition and Word Sense Discrimination

We introduce two different approaches for clustering semantically similar words. We accommodate ambiguity by allowing a word to belong to several clusters. Both methods use a graph-theoretic representation of words and their paradigmatic…

Other Condensed Matter · Physics 2009-09-29 Beate Dorow , Dominic Widdows , Katarina Ling , Jean-Pierre Eckmann , Danilo Sergi , Elisha Moses

Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach

Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The…

Machine Learning · Statistics 2023-10-20 Dimitrios Saligkaras , Vasileios E. Papageorgiou

Corpus analysis without prior linguistic knowledge - unsupervised mining of phrases and subphrase structure

When looking at the structure of natural language, "phrases" and "words" are central notions. We consider the problem of identifying such "meaningful subparts" of language of any length and underlying composition principles in a completely…

Computation and Language · Computer Science 2016-02-19 Stefan Gerdjikov , Klaus U. Schulz

Topological Data Analysis for Word Sense Disambiguation

We develop and test a novel unsupervised algorithm for word sense induction and disambiguation which uses topological data analysis. Typical approaches to the problem involve clustering, based on simple low level features of distance in…

Computation and Language · Computer Science 2022-03-02 Michael Rawson , Samuel Dooley , Mithun Bharadwaj , Rishabh Choudhary

Graph-based Clustering for Detecting Semantic Change Across Time and Languages

Despite the predominance of contextualized embeddings in NLP, approaches to detect semantic change relying on these embeddings and clustering methods underperform simpler counterparts based on static word embeddings. This stems from the…

Computation and Language · Computer Science 2024-02-05 Xianghe Ma , Michael Strube , Wei Zhao