Related papers: RETVec: Resilient and Efficient Text Vectorizer

RETSim: Resilient and Efficient Text Similarity

This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication…

Computation and Language · Computer Science 2023-11-30 Marina Zhang , Owen Vallis , Aysegul Bumin , Tanay Vakharia , Elie Bursztein

Vec2Vec: A Compact Neural Network Approach for Transforming Text Embeddings with High Fidelity

Vector embeddings have become ubiquitous tools for many language-related tasks. A leading embedding model is OpenAI's text-ada-002 which can embed approximately 6,000 words into a 1,536-dimensional vector. While powerful, text-ada-002 is…

Computation and Language · Computer Science 2023-06-23 Andrew Kean Gao

Tweet2Vec: Learning Tweet Embeddings Using Character-level CNN-LSTM Encoder-Decoder

We present Tweet2Vec, a novel method for generating general-purpose vector representation of tweets. The model learns tweet embeddings using character-level CNN-LSTM encoder-decoder. We trained our model on 3 million, randomly selected…

Computation and Language · Computer Science 2016-07-27 Soroush Vosoughi , Prashanth Vijayaraghavan , Deb Roy

Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages

Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to…

Computation and Language · Computer Science 2026-02-26 Felix Schneider , Maria Gogolev , Sven Sickert , Joachim Denzler

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

Embedding benchmarks like MTEB report a single score per model, implicitly treating robustness as a static, scalar property. We argue that embedding robustness is multidimensional, since models respond differently to different types of…

Computation and Language · Computer Science 2026-05-28 Manuel Frank , Haithem Afli

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

In this paper, we propose a novel deep neural network architecture, Speech2Vec, for learning fixed-length vector representations of audio segments excised from a speech corpus, where the vectors contain semantic information pertaining to…

Computation and Language · Computer Science 2018-06-12 Yu-An Chung , James Glass

PSDVec: a Toolbox for Incremental and Scalable Word Embedding

PSDVec is a Python/Perl toolbox that learns word embeddings, i.e. the mapping of words in a natural language to continuous vectors which encode the semantic/syntactic regularities between the words. PSDVec implements a word embedding…

Computation and Language · Computer Science 2016-07-05 Shaohua Li , Jun Zhu , Chunyan Miao

Dis-S2V: Discourse Informed Sen2Vec

Vector representation of sentences is important for many text processing tasks that involve clustering, classifying, or ranking sentences. Recently, distributed representation of sentences learned by neural models from unlabeled data has…

Computation and Language · Computer Science 2016-10-27 Tanay Kumar Saha , Shafiq Joty , Naeemul Hassan , Mohammad Al Hasan

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can…

Computer Vision and Pattern Recognition · Computer Science 2025-01-03 Ziyan Jiang , Rui Meng , Xinyi Yang , Semih Yavuz , Yingbo Zhou , Wenhu Chen

Interact2Vec -- An efficient neural network-based model for simultaneously learning users and items embeddings in recommender systems

Over the past decade, recommender systems have experienced a surge in popularity. Despite notable progress, they grapple with challenging issues, such as high data dimensionality and sparseness. Representing users and items as…

Information Retrieval · Computer Science 2025-07-28 Pedro R. Pires , Tiago A. Almeida

SeVeN: Augmenting Word Embeddings with Unsupervised Relation Vectors

We present SeVeN (Semantic Vector Networks), a hybrid resource that encodes relationships between words in the form of a graph. Different from traditional semantic networks, these relations are represented as vectors in a continuous vector…

Computation and Language · Computer Science 2018-08-21 Luis Espinosa-Anke , Steven Schockaert

Gecko: Versatile Text Embeddings Distilled from Large Language Models

We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation process…

Computation and Language · Computer Science 2024-04-01 Jinhyuk Lee , Zhuyun Dai , Xiaoqi Ren , Blair Chen , Daniel Cer , Jeremy R. Cole , Kai Hui , Michael Boratko , Rajvi Kapadia , Wen Ding , Yi Luan , Sai Meher Karthik Duddu , Gustavo Hernandez Abrego , Weiqiang Shi , Nithi Gupta , Aditya Kusupati , Prateek Jain , Siddhartha Reddy Jonnalagadda , Ming-Wei Chang , Iftekhar Naim

Research on Optimization of Natural Language Processing Model Based on Multimodal Deep Learning

This project intends to study the image representation based on attention mechanism and multimodal data. By adding multiple pattern layers to the attribute model, the semantic and hidden layers of image content are integrated. The word…

Computation and Language · Computer Science 2024-06-14 Dan Sun , Yaxin Liang , Yining Yang , Yuhan Ma , Qishi Zhan , Erdi Gao

RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning

Pre-trained language models (PLMs) have consistently demonstrated outstanding performance across a diverse spectrum of natural language processing tasks. Nevertheless, despite their success with unseen data, current PLM-based…

Computation and Language · Computer Science 2024-03-19 Javad Rafiei Asl , Prajwal Panzade , Eduardo Blanco , Daniel Takabi , Zhipeng Cai

Robust and Consistent Estimation of Word Embedding for Bangla Language by fine-tuning Word2Vec Model

Word embedding or vector representation of word holds syntactical and semantic characteristics of a word which can be an informative feature for any machine learning-based models of natural language processing. There are several deep…

Computation and Language · Computer Science 2021-05-05 Rifat Rahman

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval…

Computer Vision and Pattern Recognition · Computer Science 2021-03-31 Antoine Miech , Jean-Baptiste Alayrac , Ivan Laptev , Josef Sivic , Andrew Zisserman

MTEB: Massive Text Embedding Benchmark

Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be…

Computation and Language · Computer Science 2023-03-21 Niklas Muennighoff , Nouamane Tazi , Loïc Magne , Nils Reimers

WordRank: Learning Word Embeddings via Robust Ranking

Embedding words in a vector space has gained a lot of attention in recent years. While state-of-the-art methods provide efficient computation of word similarities via a low-dimensional matrix embedding, their motivation is often left…

Computation and Language · Computer Science 2016-09-29 Shihao Ji , Hyokun Yun , Pinar Yanardag , Shin Matsushima , S. V. N. Vishwanathan

ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval

In this paper, we introduce ReasonEmbed, a novel text embedding model developed for reasoning-intensive document retrieval. Our work includes three key technical contributions. First, we propose ReMixer, a new data synthesis method that…

Information Retrieval · Computer Science 2026-04-21 Jianlyu Chen , Junwei Lan , Chaofan Li , Defu Lian , Zheng Liu

Bit Cipher -- A Simple yet Powerful Word Representation System that Integrates Efficiently with Language Models

While Large Language Models (LLMs) become ever more dominant, classic pre-trained word embeddings sustain their relevance through computational efficiency and nuanced linguistic interpretation. Drawing from recent studies demonstrating that…

Computation and Language · Computer Science 2023-11-21 Haoran Zhao , Jake Ryland Williams