English
Related papers

Related papers: German Text Embedding Clustering Benchmark

200 papers

Text clustering is an important method for organising the increasing volume of digital content, aiding in the structuring and discovery of hidden patterns in uncategorised data. The effectiveness of text clustering largely depends on the…

Computation and Language · Computer Science 2024-12-06 Alina Petukhova , João P. Matos-Carvalho , Nuno Fachada

Recent work incorporates pre-trained word embeddings such as BERT embeddings into Neural Topic Models (NTMs), generating highly coherent topics. However, with high-quality contextualized document representations, do we really need…

Computation and Language · Computer Science 2022-04-22 Zihan Zhang , Meng Fang , Ling Chen , Mohammad-Reza Namazi-Rad

Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore…

Machine Learning · Computer Science 2024-07-29 Luke Merrick

We present a clustering-based language model using word embeddings for text readability prediction. Presumably, an Euclidean semantic space hypothesis holds true for word embeddings whose training is done by observing word co-occurrences.…

Computation and Language · Computer Science 2017-09-07 Miriam Cha , Youngjune Gwon , H. T. Kung

Sentence embedding methods offer a powerful approach for working with short textual constructs or sequences of words. By representing sentences as dense numerical vectors, many natural language processing (NLP) applications have improved…

Computation and Language · Computer Science 2021-10-05 Yuan An , Alexander Kalinowski , Jane Greenberg

Text clustering serves as a fundamental technique for organizing and interpreting unstructured textual data, particularly in contexts where manual annotation is prohibitively costly. With the rapid advancement of Large Language Models…

Computation and Language · Computer Science 2025-10-08 Chen Huang , Guoxiu He

In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone…

Computation and Language · Computer Science 2022-03-22 Eyal Shnarch , Ariel Gera , Alon Halfon , Lena Dankin , Leshem Choshen , Ranit Aharonov , Noam Slonim

Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been…

Computation and Language · Computer Science 2019-06-24 Emily Alsentzer , John R. Murphy , Willie Boag , Wei-Hung Weng , Di Jin , Tristan Naumann , Matthew B. A. McDermott

In this paper, an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings is presented. WEClustering technique is used as the base model. WEClustering model is fur-ther improvements incorporating…

Machine Learning · Computer Science 2025-05-22 Vijay Kumar Sutrakar , Nikhil Mogre

The success of bidirectional encoders using masked language models, such as BERT, on numerous natural language processing tasks has prompted researchers to attempt to incorporate these pre-trained models into neural machine translation…

Computation and Language · Computer Science 2021-09-13 Haoran Xu , Benjamin Van Durme , Kenton Murray

Natural Language Processing (NLP) has become increasingly utilized to provide adaptivity in educational applications. However, recent research has highlighted a variety of biases in pre-trained language models. While existing studies…

Computation and Language · Computer Science 2022-09-23 Thiemo Wambsganss , Vinitra Swamy , Roman Rietsche , Tanja Käser

Visual grounding of Language aims at enriching textual representations of language with multiple sources of visual knowledge such as images and videos. Although visual grounding is an area of intense research, inter-lingual aspects of…

Computation and Language · Computer Science 2022-11-22 Wafaa Mohammed , Hassan Shahmohammadi , Hendrik P. A. Lensch , R. Harald Baayen

Recent techniques for the task of short text clustering often rely on word embeddings as a transfer learning component. This paper shows that sentence vector representations from Transformers in conjunction with different clustering methods…

Computation and Language · Computer Science 2021-02-02 Leonid Pugachev , Mikhail Burtsev

Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first lexicon-based embeddings (LENS) leveraging…

Computation and Language · Computer Science 2026-03-20 Yibin Lei , Tao Shen , Yu Cao , Andrew Yates

With the advent of e-commerce platforms, reviews are crucial for customers to assess the credibility of a product. The star ratings do not always match the review text written by the customer. For example, a three star rating (out of five)…

Machine Learning · Computer Science 2023-05-08 Rohan Saha

We experiment with two recent contextualized word embedding methods (ELMo and BERT) in the context of open-domain argument search. For the first time, we show how to leverage the power of contextualized word embeddings to classify and…

Computation and Language · Computer Science 2019-06-25 Nils Reimers , Benjamin Schiller , Tilman Beck , Johannes Daxenberger , Christian Stab , Iryna Gurevych

Pre-trained language models such as BERT have been proved to be powerful in many natural language processing tasks. But in some text classification applications such as emotion recognition and sentiment analysis, BERT may not lead to…

Computation and Language · Computer Science 2025-06-03 Zixiao Zhu , Kezhi Mao

The contextual word embedding model, BERT, has proved its ability on downstream tasks with limited quantities of annotated data. BERT and its variants help to reduce the burden of complex annotation work in many interdisciplinary research…

Computation and Language · Computer Science 2022-04-07 Gechuan Zhang , Paul Nulty , David Lillis

Understanding patient feedback is crucial for improving healthcare services, yet analyzing unlabeled short-text feedback presents challenges due to limited data and domain-specific nuances. Traditional supervised approaches require…

Machine Learning · Computer Science 2026-01-21 K M Sajjadul Islam , Ravi Teja Karri , Srujan Vegesna , Jiawei Wu , Praveen Madiraju

Automatic text classification (TC) research can be used for real-world problems such as the classification of in-patient discharge summaries and medical text reports, which is beneficial to make medical documents more understandable to…

Computation and Language · Computer Science 2018-12-06 Ying Shen , Qiang Zhang , Jin Zhang , Jiyue Huang , Yuming Lu , Kai Lei
‹ Prev 1 2 3 10 Next ›