Related papers: German Text Embedding Clustering Benchmark

Text Clustering with Large Language Model Embeddings

Text clustering is an important method for organising the increasing volume of digital content, aiding in the structuring and discovery of hidden patterns in uncategorised data. The effectiveness of text clustering largely depends on the…

Computation and Language · Computer Science 2024-12-06 Alina Petukhova , João P. Matos-Carvalho , Nuno Fachada

Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics

Recent work incorporates pre-trained word embeddings such as BERT embeddings into Neural Topic Models (NTMs), generating highly coherent topics. However, with high-quality contextualized document representations, do we really need…

Computation and Language · Computer Science 2022-04-22 Zihan Zhang , Meng Fang , Ling Chen , Mohammad-Reza Namazi-Rad

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore…

Machine Learning · Computer Science 2024-07-29 Luke Merrick

Language Modeling by Clustering with Word Embeddings for Text Readability Assessment

We present a clustering-based language model using word embeddings for text readability prediction. Presumably, an Euclidean semantic space hypothesis holds true for word embeddings whose training is done by observing word co-occurrences.…

Computation and Language · Computer Science 2017-09-07 Miriam Cha , Youngjune Gwon , H. T. Kung

Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences

Sentence embedding methods offer a powerful approach for working with short textual constructs or sequences of words. By representing sentences as dense numerical vectors, many natural language processing (NLP) applications have improved…

Computation and Language · Computer Science 2021-10-05 Yuan An , Alexander Kalinowski , Jane Greenberg

Text Clustering as Classification with LLMs

Text clustering serves as a fundamental technique for organizing and interpreting unstructured textual data, particularly in contexts where manual annotation is prohibitively costly. With the rapid advancement of Large Language Models…

Computation and Language · Computer Science 2025-10-08 Chen Huang , Guoxiu He

Cluster & Tune: Boost Cold Start Performance in Text Classification

In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone…

Computation and Language · Computer Science 2022-03-22 Eyal Shnarch , Ariel Gera , Alon Halfon , Lena Dankin , Leshem Choshen , Ranit Aharonov , Noam Slonim

Publicly Available Clinical BERT Embeddings

Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been…

Computation and Language · Computer Science 2019-06-24 Emily Alsentzer , John R. Murphy , Willie Boag , Wei-Hung Weng , Di Jin , Tristan Naumann , Matthew B. A. McDermott

An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets

In this paper, an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings is presented. WEClustering technique is used as the base model. WEClustering model is fur-ther improvements incorporating…

Machine Learning · Computer Science 2025-05-22 Vijay Kumar Sutrakar , Nikhil Mogre

BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation

The success of bidirectional encoders using masked language models, such as BERT, on numerous natural language processing tasks has prompted researchers to attempt to incorporate these pre-trained models into neural machine translation…

Computation and Language · Computer Science 2021-09-13 Haoran Xu , Benjamin Van Durme , Kenton Murray

Bias at a Second Glance: A Deep Dive into Bias for German Educational Peer-Review Data Modeling

Natural Language Processing (NLP) has become increasingly utilized to provide adaptivity in educational applications. However, recent research has highlighted a variety of biases in pre-trained language models. While existing studies…

Computation and Language · Computer Science 2022-09-23 Thiemo Wambsganss , Vinitra Swamy , Roman Rietsche , Tanja Käser

Visual Grounding of Inter-lingual Word-Embeddings

Visual grounding of Language aims at enriching textual representations of language with multiple sources of visual knowledge such as images and videos. Although visual grounding is an area of intense research, inter-lingual aspects of…

Computation and Language · Computer Science 2022-11-22 Wafaa Mohammed , Hassan Shahmohammadi , Hendrik P. A. Lensch , R. Harald Baayen

Short Text Clustering with Transformers

Recent techniques for the task of short text clustering often rely on word embeddings as a transfer learning component. This paper shows that sentence vector representations from Transformers in conjunction with different clustering methods…

Computation and Language · Computer Science 2021-02-02 Leonid Pugachev , Mikhail Burtsev

Enhancing Lexicon-Based Text Embeddings with Large Language Models

Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first lexicon-based embeddings (LENS) leveraging…

Computation and Language · Computer Science 2026-03-20 Yibin Lei , Tao Shen , Yu Cao , Andrew Yates

Influence of various text embeddings on clustering performance in NLP

With the advent of e-commerce platforms, reviews are crucial for customers to assess the credibility of a product. The star ratings do not always match the review text written by the customer. For example, a three star rating (out of five)…

Machine Learning · Computer Science 2023-05-08 Rohan Saha

Classification and Clustering of Arguments with Contextualized Word Embeddings

We experiment with two recent contextualized word embedding methods (ELMo and BERT) in the context of open-domain argument search. For the first time, we show how to leverage the power of contextualized word embeddings to classify and…

Computation and Language · Computer Science 2019-06-25 Nils Reimers , Benjamin Schiller , Tilman Beck , Johannes Daxenberger , Christian Stab , Iryna Gurevych

Domain Lexical Knowledge-based Word Embedding Learning for Text Classification under Small Data

Pre-trained language models such as BERT have been proved to be powerful in many natural language processing tasks. But in some text classification applications such as emotion recognition and sentiment analysis, BERT may not lead to…

Computation and Language · Computer Science 2025-06-03 Zixiao Zhu , Kezhi Mao

Enhancing Legal Argument Mining with Domain Pre-training and Neural Networks

The contextual word embedding model, BERT, has proved its ability on downstream tasks with limited quantities of annotated data. BERT and its variants help to reduce the burden of complex annotation work in many interdisciplinary research…

Computation and Language · Computer Science 2022-04-07 Gechuan Zhang , Paul Nulty , David Lillis

Contextual Embedding-based Clustering to Identify Topics for Healthcare Service Improvement

Understanding patient feedback is crucial for improving healthcare services, yet analyzing unlabeled short-text feedback presents challenges due to limited data and domain-specific nuances. Traditional supervised approaches require…

Machine Learning · Computer Science 2026-01-21 K M Sajjadul Islam , Ravi Teja Karri , Srujan Vegesna , Jiawei Wu , Praveen Madiraju

Improving Medical Short Text Classification with Semantic Expansion Using Word-Cluster Embedding

Automatic text classification (TC) research can be used for real-world problems such as the classification of in-patient discharge summaries and medical text reports, which is beneficial to make medical documents more understandable to…

Computation and Language · Computer Science 2018-12-06 Ying Shen , Qiang Zhang , Jin Zhang , Jiyue Huang , Yuming Lu , Kai Lei