Related papers: Improving a tf-idf weighted document vector embedd…

Comparative Analysis of Document-Level Embedding Methods for Similarity Scoring on Shakespeare Sonnets and Taylor Swift Lyrics

This study evaluates the performance of TF-IDF weighting, averaged Word2Vec embeddings, and BERT embeddings for document similarity scoring across two contrasting textual domains. By analysing cosine similarity scores, the methods'…

Computation and Language · Computer Science 2024-12-24 Klara Kramer

Context Aware Document Embedding

Recently, doc2vec has achieved excellent results in different tasks. In this paper, we present a context aware variant of doc2vec. We introduce a novel weight estimating mechanism that generates weights for each word occurrence according to…

Computation and Language · Computer Science 2017-07-07 Zhaocheng Zhu , Junfeng Hu

A Comparison of Semantic Similarity Methods for Maximum Human Interpretability

The inclusion of semantic information in any similarity measures improves the efficiency of the similarity measure and provides human interpretable results for further analysis. The similarity calculation method that focuses on features…

Information Retrieval · Computer Science 2019-11-01 Pinky Sitikhu , Kritish Pahi , Pujan Thapa , Subarna Shakya

Fusing Vector Space Models for Domain-Specific Applications

We address the problem of tuning word embeddings for specific use cases and domains. We propose a new method that automatically combines multiple domain-specific embeddings, selected from a wide range of pre-trained domain-specific…

Computation and Language · Computer Science 2019-09-06 Laura Rettig , Julien Audiffren , Philippe Cudré-Mauroux

Combining Word Embeddings and N-grams for Unsupervised Document Summarization

Graph-based extractive document summarization relies on the quality of the sentence similarity graph. Bag-of-words or tf-idf based sentence similarity uses exact word matching, but fails to measure the semantic similarity between individual…

Computation and Language · Computer Science 2020-04-30 Zhuolin Jiang , Manaj Srivastava , Sanjay Krishna , David Akodes , Richard Schwartz

Testing different Log Bases For Vector Model Weighting Technique

Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the…

Information Retrieval · Computer Science 2023-07-13 Kamel Assaf

Toward Incorporation of Relevant Documents in word2vec

Recent advances in neural word embedding provide significant benefit to various information retrieval tasks. However as shown by recent studies, adapting the embedding models for the needs of IR tasks can bring considerable further…

Information Retrieval · Computer Science 2018-04-05 Navid Rekabsaz , Bhaskar Mitra , Mihai Lupu , Allan Hanbury

Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to…

Information Retrieval · Computer Science 2022-11-23 Bakhyt Bakiyev

Word Embedding Dimension Reduction via Weakly-Supervised Feature Selection

As a fundamental task in natural language processing, word embedding converts each word into a representation in a vector space. A challenge with word embedding is that as the vocabulary grows, the vector space's dimension increases, which…

Computation and Language · Computer Science 2024-11-05 Jintang Xue , Yun-Cheng Wang , Chengwei Wei , C. -C. Jay Kuo

Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval

A major difficulty in applying word vector embeddings in IR is in devising an effective and efficient strategy for obtaining representations of compound units of text, such as whole documents, (in comparison to the atomic words), for the…

Information Retrieval · Computer Science 2016-06-28 Dwaipayan Roy , Debasis Ganguly , Mandar Mitra , Gareth J. F. Jones

Utilizing Embeddings for Ad-hoc Retrieval by Document-to-document Similarity

Latent semantic representations of words or paragraphs, namely the embeddings, have been widely applied to information retrieval (IR). One of the common approaches of utilizing embeddings for IR is to estimate the document-to-query (D2Q)…

Information Retrieval · Computer Science 2017-08-11 Chenhao Yang , Ben He , Yanhua Ran

Semantic Sensitive TF-IDF to Determine Word Relevance in Documents

Keyword extraction has received an increasing attention as an important research topic which can lead to have advancements in diverse applications such as document context categorization, text indexing and document classification. In this…

Information Retrieval · Computer Science 2021-01-27 Amir Jalilifard , Vinicius F. Caridá , Alex F. Mansano , Rogers S. Cristo , Felipe Penhorate C. da Fonseca

An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec (Mikolov et al., 2013a) to learn document-level embeddings. Despite promising results in the original paper, others have struggled to reproduce those results. This…

Computation and Language · Computer Science 2016-12-19 Jey Han Lau , Timothy Baldwin

Improving the Accuracy of Pre-trained Word Embeddings for Sentiment Analysis

Sentiment analysis is one of the well-known tasks and fast growing research areas in natural language processing (NLP) and text classifications. This technique has become an essential part of a wide range of applications including politics,…

Computation and Language · Computer Science 2017-11-27 Seyed Mahdi Rezaeinia , Ali Ghodsi , Rouhollah Rahmani

A Dual Embedding Space Model for Document Ranking

A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words…

Information Retrieval · Computer Science 2016-02-04 Bhaskar Mitra , Eric Nalisnick , Nick Craswell , Rich Caruana

Automatic measurement of semantic text similarity is an important task in natural language processing. In this paper, we evaluate the performance of different vector space models to perform this task. We address the real-world problem of…

Computation and Language · Computer Science 2018-10-02 Omid Shahmirzadi , Adam Lugowski , Kenneth Younge

Words are not Equal: Graded Weighting Model for building Composite Document Vectors

Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging,…

Computation and Language · Computer Science 2015-12-14 Pranjal Singh , Amitabha Mukerjee

Word Mover's Embedding: From Word2Vec to Document Embedding

While the celebrated Word2Vec technique yields semantically rich representations for individual words, there has been relatively less success in extending to generate unsupervised sentences or documents embeddings. Recent work has…

Computation and Language · Computer Science 2018-11-06 Lingfei Wu , Ian E. H. Yen , Kun Xu , Fangli Xu , Avinash Balakrishnan , Pin-Yu Chen , Pradeep Ravikumar , Michael J. Witbrock

Inverse-Category-Frequency based supervised term weighting scheme for text categorization

Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval…

Machine Learning · Computer Science 2012-06-07 Deqing Wang , Hui Zhang

Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings

Since the amount of information on the internet is growing rapidly, it is not easy for a user to find relevant information for his/her query. To tackle this issue, much attention has been paid to Automatic Document Summarization. The key…

Computation and Language · Computer Science 2019-02-05 Kamal Al-Sabahi , Zhang Zuping , Yang Kang