Related papers: hyperdoc2vec: Distributed Representations of Hyper…

An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec (Mikolov et al., 2013a) to learn document-level embeddings. Despite promising results in the original paper, others have struggled to reproduce those results. This…

Computation and Language · Computer Science 2016-12-19 Jey Han Lau , Timothy Baldwin

DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging

Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and…

Computation and Language · Computer Science 2017-07-18 Sheng Chen , Akshay Soni , Aasish Pappu , Yashar Mehdad

KeyVec: Key-semantics Preserving Document Representations

Previous studies have demonstrated the empirical success of word embeddings in various applications. In this paper, we investigate the problem of learning distributed representations for text documents which many machine learning algorithms…

Computation and Language · Computer Science 2017-09-29 Bin Bi , Hao Ma

Citation Recommendations Considering Content and Structural Context Embedding

The number of academic papers being published is increasing exponentially in recent years, and recommending adequate citations to assist researchers in writing papers is a non-trivial task. Conventional approaches may not be optimal, as the…

Information Retrieval · Computer Science 2020-01-09 Yang Zhang , Qiang Ma

Efficient Vector Representation for Documents through Corruption

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such…

Computation and Language · Computer Science 2017-07-11 Minmin Chen

Coherence-Based Distributed Document Representation Learning for Scientific Documents

Distributed document representation is one of the basic problems in natural language processing. Currently distributed document representation methods mainly consider the context information of words or sentences. These methods do not take…

Computation and Language · Computer Science 2022-01-11 Shicheng Tan , Shu Zhao , Yanping Zhang

Document Embedding with Paragraph Vectors

Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be…

Computation and Language · Computer Science 2015-07-30 Andrew M. Dai , Christopher Olah , Quoc V. Le

Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research

Conventional text classification models make a bag-of-words assumption reducing text into word occurrence counts per document. Recent algorithms such as word2vec are capable of learning semantic meaning and similarity between words in an…

Computation and Language · Computer Science 2018-07-11 Vincent Major , Alisa Surkis , Yindalon Aphinyanaphongs

Keyword Embeddings for Query Suggestion

Nowadays, search engine users commonly rely on query suggestions to improve their initial inputs. Current systems are very good at recommending lexical adaptations or spelling corrections to users' queries. However, they often struggle to…

Information Retrieval · Computer Science 2023-01-24 Jorge Gabín , M. Eduardo Ares , Javier Parapar

Paper2vec: Citation-Context Based Document Distributed Representation for Scholar Recommendation

Due to the availability of references of research papers and the rich information contained in papers, various citation analysis approaches have been proposed to identify similar documents for scholar recommendation. Despite of the success…

Information Retrieval · Computer Science 2017-03-21 Han Tian , Hankz Hankui Zhuo

Document-as-Image Representations Fall Short for Scientific Retrieval

Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA…

Information Retrieval · Computer Science 2026-04-21 Ghazal Khalighinejad , Raghuveer Thirukovalluru , Alexander H. Oh , Bhuwan Dhingra

HDLTex: Hierarchical Deep Learning for Text Classification

The continually increasing number of documents produced each year necessitates ever improving information processing methods for searching, retrieving, and organizing text. Central to these information processing methods is document…

Machine Learning · Computer Science 2018-03-29 Kamran Kowsari , Donald E. Brown , Mojtaba Heidarysafa , Kiana Jafari Meimandi , Matthew S. Gerber , Laura E. Barnes

Effective Distributed Representations for Academic Expert Search

Expert search aims to find and rank experts based on a user's query. In academia, retrieving experts is an efficient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of…

Information Retrieval · Computer Science 2022-11-10 Mark Berger , Jakub Zavrel , Paul Groth

Top2Vec: Distributed Representations of Topics

Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis.…

Computation and Language · Computer Science 2020-08-24 Dimo Angelov

word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data

Vector representations of graphs and relational structures, whether hand-crafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of…

Machine Learning · Computer Science 2020-03-31 Martin Grohe

The Influence of Feature Representation of Text on the Performance of Document Classification

In this paper we perform a comparative analysis of three models for feature representation of text documents in the context of document classification. In particular, we consider the most often used family of models bag-of-words, recently…

Computation and Language · Computer Science 2017-07-06 Sanda Martinčić-Ipšić , Tanja Miličić , Ljupčo Todorovski

A Dual Embedding Space Model for Document Ranking

A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words…

Information Retrieval · Computer Science 2016-02-04 Bhaskar Mitra , Eric Nalisnick , Nick Craswell , Rich Caruana

Utilizing Embeddings for Ad-hoc Retrieval by Document-to-document Similarity

Latent semantic representations of words or paragraphs, namely the embeddings, have been widely applied to information retrieval (IR). One of the common approaches of utilizing embeddings for IR is to estimate the document-to-query (D2Q)…

Information Retrieval · Computer Science 2017-08-11 Chenhao Yang , Ben He , Yanhua Ran

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Rui Meng , Ziyan Jiang , Ye Liu , Mingyi Su , Xinyi Yang , Yuepeng Fu , Can Qin , Zeyuan Chen , Ran Xu , Caiming Xiong , Yingbo Zhou , Wenhu Chen , Semih Yavuz

Terminology-based Text Embedding for Computing Document Similarities on Technical Content

We propose in this paper a new, hybrid document embedding approach in order to address the problem of document similarities with respect to the technical content. To do so, we employ a state-of-the-art graph techniques to first extract the…

Computation and Language · Computer Science 2019-07-02 Hamid Mirisaee , Eric Gaussier , Cedric Lagnier , Agnes Guerraz