English
Related papers

Related papers: RETVec: Resilient and Efficient Text Vectorizer

200 papers

This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication…

Computation and Language · Computer Science 2023-11-30 Marina Zhang , Owen Vallis , Aysegul Bumin , Tanay Vakharia , Elie Bursztein

Vector embeddings have become ubiquitous tools for many language-related tasks. A leading embedding model is OpenAI's text-ada-002 which can embed approximately 6,000 words into a 1,536-dimensional vector. While powerful, text-ada-002 is…

Computation and Language · Computer Science 2023-06-23 Andrew Kean Gao

We present Tweet2Vec, a novel method for generating general-purpose vector representation of tweets. The model learns tweet embeddings using character-level CNN-LSTM encoder-decoder. We trained our model on 3 million, randomly selected…

Computation and Language · Computer Science 2016-07-27 Soroush Vosoughi , Prashanth Vijayaraghavan , Deb Roy

Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to…

Computation and Language · Computer Science 2026-02-26 Felix Schneider , Maria Gogolev , Sven Sickert , Joachim Denzler

Embedding benchmarks like MTEB report a single score per model, implicitly treating robustness as a static, scalar property. We argue that embedding robustness is multidimensional, since models respond differently to different types of…

Computation and Language · Computer Science 2026-05-28 Manuel Frank , Haithem Afli

In this paper, we propose a novel deep neural network architecture, Speech2Vec, for learning fixed-length vector representations of audio segments excised from a speech corpus, where the vectors contain semantic information pertaining to…

Computation and Language · Computer Science 2018-06-12 Yu-An Chung , James Glass

PSDVec is a Python/Perl toolbox that learns word embeddings, i.e. the mapping of words in a natural language to continuous vectors which encode the semantic/syntactic regularities between the words. PSDVec implements a word embedding…

Computation and Language · Computer Science 2016-07-05 Shaohua Li , Jun Zhu , Chunyan Miao

Vector representation of sentences is important for many text processing tasks that involve clustering, classifying, or ranking sentences. Recently, distributed representation of sentences learned by neural models from unlabeled data has…

Computation and Language · Computer Science 2016-10-27 Tanay Kumar Saha , Shafiq Joty , Naeemul Hassan , Mohammad Al Hasan

Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can…

Computer Vision and Pattern Recognition · Computer Science 2025-01-03 Ziyan Jiang , Rui Meng , Xinyi Yang , Semih Yavuz , Yingbo Zhou , Wenhu Chen

Over the past decade, recommender systems have experienced a surge in popularity. Despite notable progress, they grapple with challenging issues, such as high data dimensionality and sparseness. Representing users and items as…

Information Retrieval · Computer Science 2025-07-28 Pedro R. Pires , Tiago A. Almeida

We present SeVeN (Semantic Vector Networks), a hybrid resource that encodes relationships between words in the form of a graph. Different from traditional semantic networks, these relations are represented as vectors in a continuous vector…

Computation and Language · Computer Science 2018-08-21 Luis Espinosa-Anke , Steven Schockaert

We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation process…

This project intends to study the image representation based on attention mechanism and multimodal data. By adding multiple pattern layers to the attribute model, the semantic and hidden layers of image content are integrated. The word…

Computation and Language · Computer Science 2024-06-14 Dan Sun , Yaxin Liang , Yining Yang , Yuhan Ma , Qishi Zhan , Erdi Gao

Pre-trained language models (PLMs) have consistently demonstrated outstanding performance across a diverse spectrum of natural language processing tasks. Nevertheless, despite their success with unseen data, current PLM-based…

Computation and Language · Computer Science 2024-03-19 Javad Rafiei Asl , Prajwal Panzade , Eduardo Blanco , Daniel Takabi , Zhipeng Cai

Word embedding or vector representation of word holds syntactical and semantic characteristics of a word which can be an informative feature for any machine learning-based models of natural language processing. There are several deep…

Computation and Language · Computer Science 2021-05-05 Rifat Rahman

Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval…

Computer Vision and Pattern Recognition · Computer Science 2021-03-31 Antoine Miech , Jean-Baptiste Alayrac , Ivan Laptev , Josef Sivic , Andrew Zisserman

Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be…

Computation and Language · Computer Science 2023-03-21 Niklas Muennighoff , Nouamane Tazi , Loïc Magne , Nils Reimers

Embedding words in a vector space has gained a lot of attention in recent years. While state-of-the-art methods provide efficient computation of word similarities via a low-dimensional matrix embedding, their motivation is often left…

Computation and Language · Computer Science 2016-09-29 Shihao Ji , Hyokun Yun , Pinar Yanardag , Shin Matsushima , S. V. N. Vishwanathan

In this paper, we introduce ReasonEmbed, a novel text embedding model developed for reasoning-intensive document retrieval. Our work includes three key technical contributions. First, we propose ReMixer, a new data synthesis method that…

Information Retrieval · Computer Science 2026-04-21 Jianlyu Chen , Junwei Lan , Chaofan Li , Defu Lian , Zheng Liu

While Large Language Models (LLMs) become ever more dominant, classic pre-trained word embeddings sustain their relevance through computational efficiency and nuanced linguistic interpretation. Drawing from recent studies demonstrating that…

Computation and Language · Computer Science 2023-11-21 Haoran Zhao , Jake Ryland Williams
‹ Prev 1 2 3 10 Next ›